Integrate Candle-based inferencing primitives (e.g., `candle-vllm`, `mistralrs`) for model handling. This is expected to provide tighter coupling and potentially better performance, especially on older GPU hardware and for large context windows, compared to existing solutions like `llamacpp`.
**Please describe the feature you want** Candle-vllm, mistralrs, or candle-based primitives for model handling should provider tighter coupling and possibly better performance. At present, `candle-vllm` can muster ~55T/s on a `q8_0` Qwen3-Coder even on NVCC7 hardware (Volta generation) with a 512k context (fairly stable into ~400k range due to how it handles ISQ and attention) whereas llamacpp gets a fraction of that and seems to forget what it was doing earlier into large context windows. **Additional context** Add any other context or screenshots about the feature request here. --- Please reply with a 👍 if you want this feature.