The user requests to ensure the `stop_token_ids` are correctly passed to guarantee immediate stopping by the VLLM engine and to avoid generating excessive tokens.
Thanks for bring up this vllm implements, the current implementation is generating excessive tokens, negating potential speed gains. The primary cause is the reliance on `max_tokens` instead of proper stopping via `stop_token_ids`, and inefficient post-processing. ### 1\. Model Configuration Fix: `model_vllm_v2.py` Ensure the `stop_token_ids` are correctly passed to **guarantee immediate stopping** by the VLLM engine. ```python # model_vllm_v2.py: def _create_sampling_params(self): """ Creates SamplingParams, ensuring VLLM stops at the stop token ID to prevent generation up to max_tokens. """ return SamplingParams( temperature=1.0, top_p=0.8, top_k=30, repetition_penalty=10.0, # VLLM now stops at the token ID, making this a functional upper bound, # not the typical stopping point. max_tokens=2048, # Sufficient for most cases, will stop at stop_mel_token, and 768 isn't enough for english stop_token