Optimize on-device sampling performance in vLLM on Tenstorrent hardware. Non-greedy device sampling is currently ~2x slower than CPU sampling at batch=1. Investigation will start with IR analysis and Tracy profiling to identify the bottlenecks.
Tracking issue for optimizing on-device sampling performance in vLLM on Tenstorrent hardware. Individual issues will be opened as specific areas are investigated. Non-greedy device sampling is currently ~2x slower than CPU sampling at batch=1: | Configuration | OPT-125M tok/s | Llama-3.1-8B tok/s | |---|---|---| | Greedy device | 11.61 | 9.82 | | Non-greedy device | 5.94 | 4.02 | | Non-greedy CPU | 11.00 | 7.89 | Investigation will start with IR analysis and Tracy profiling to identify the dominant ops in the non-greedy sampling graph. Branch: `kmabee/vllm_perf_debug`