The user notes that vLLM already uses paged attention and AWQ to reduce VRAM usage. They suggest exploring activation quantization as the next step to further compress the memory footprint.
trt-llm and vLLM are considered the current state-of-the-art (SOTA) systems for LLM serving, not only due to the high performance of the CUDA kernel itself but also because they incorporate many techniques to optimize LLM serving performance in response to the challenges and metrics mentioned. For latency, the scheduling mechanism of vLLM is more friendly towards Time-to-First-Token (TTFT), which causes tokens in the decode phase to be in a waiting state. Consequently, the Token Per Second (TPS) may not appear as impressive compared to TTFT, but this is a compromise among throughput, TTFT, and TPS. In response to TPS and throughput, the initial approach involves autoregressive decoding where tokens are generated serially, prompting the emergence of parallel decoding methods and frameworks represented by non-autoregressive, speculative decoding, and medusa. Regarding memory footprint and model size, various inference frameworks including vLLM have adopted numerous optimizations to reduce VRAM usage. Paged attention drastically reduces the kv cache footprint, and AWQ, as a SOTA method in large model optimization (WOQ), is also implemented in the code framework. The next step for vLLM is to explore the quantization of activations to further compress the memory footprint. In terms of scalability and throughput, vLLM supports Tensor Parallelism (TP) and offloading, attempting to mask the swap in/out of kv cache with transformer computation. Moreover, TP supports multi-GPU inference. The scheduling mechanism of vLLM (i.e., continuous batching) combined with the VRAM savings from paged attention significantly increases the batch size or number of batch tokens, which naturally leads to an increase in throughput. For hardware acceleration, mapping algorithms perfectly with hardware units is an important method for fully utilizing hardware resources. A notable example includes various efficient CUDA kernels that have evolved along with algorithmic architectures, such as DeepSpeed's efficient dequantize from int4/int8 to fp16, the evolution of MHA to MQA, GQA, and then to deepseekv2's MLA, the transformation of FFN to MoE, and various attention kernels tailored for specific scenarios like flash attention, flash decoding, paged attention, and the recent attention kernel library => flashinfer. Regarding the tradeoff between accuracy and hardware efficiency, there has also been significant work to improve algorithmic architecture. Tim Dettmers released a pioneering work on large model quantization, llm.int8, in either 2022 or 2023, focusing on improving accuracy from an algorithmic perspective. This has led to a surge in quantization papers, including a recent one on i-LLM. vLLM is currently the most popular fully open-source LLM inference engine. Taking vLLM as an example, I believe there are four major breakthroughs: First, the scheduling part of decoding tokens has to wait for new prefill tokens to complete the current step's forward, suggesting a possible break from this limitation by allowing prefill and decode tokens to forward collaboratively. This has already been attempted with chunked prefill for long text inputs, and Microsoft's splitwise is another interesting trial, though I have not examined it in detail yet, which involves separating prefill and decode across different GPUs. Second, there might be a deficiency in the scheduling and cache engine parts of vLLM, as the scheduling result is dynamically generated, and the python/pytorch runtime overhead cannot be ignored. This can cause kernel dispatch and submission to incur costs on the CPU side, potentially stalling the GPU timeline (though more profiling is needed to confirm this). Although vLLM has incorporated cuda graph mechanisms to cache graphs of certain shapes during the decode phase, it does not cover the prefill phase, so converting vLLM's dynamic scheduling idea into a static graph is a potential research direction. I have not tried torch.compile yet, so I withhold any judgment. Third, vLLM does not support Pipeline Parallelism (PP), perhaps due to the difficulty in constructing microbatches that make pipeline bubbles negligible, which has kept this feature pending. Fourth, it isn't precisely a breakthrough for vLLM, but rather for LLM inference: the decoding phase is memory-bound, and to make it compute-bound, the batch size needs to be several thousand, a rather stringent requirement. Adjusting the architecture to make the decode phase compute-bound is also an interesting research direction.