The user requests scripts or containers to launch llama.cpp (CPU) and vLLM (GPU) for reproducible local serving backends, including health/readiness endpoints and config-driven model/adapter loading. This would improve portability and testability compared to tight coupling with a single backend.
**Is your feature request related to a problem? Please describe.** We need reproducible local serving backends for development and throughput scenarios. **Describe the solution you'd like** Script/container to launch llama.cpp (CPU) and vLLM (GPU), with health/readiness endpoints and config-driven model/adapter loading. **Describe alternatives you've considered** Tight coupling to a single backend reduces portability and testability. **Additional context** **Acceptance Criteria** - Launch either backend via flag; health and readiness endpoints. - Config-driven model/adapters; documented setup. **KPIs** - llama.cpp: p50 ≤ 2.0s @ 256 toks on 7B Q4 (hardware documented). - vLLM: documented throughput/latency baseline. **Tests** - Integration: Chat endpoint smoke tests; JSON-only mode enforcement; relaxed latency budget on CI. **Dependencies** - Writer