Enable users to easily deploy multiple vLLM instances on a single machine with multiple GPUs, sharding the work across replicated instances for offline use cases.
### 🚀 The feature, motivation and pitch It is common to have a scenario where folks want to deploy multiple vLLM instances on a single machine due to the machine have several GPUs (commonly 8 GPUs). The work can then be sharded across replicated instances. This issue describes the intended UX for such feature. Notably we might not want to tackle large distributed settings (100s of parallel vLLM instances), which should be better handled by a higher layers. * Offline use case, for the LLM class, a new argument data_parallel_size and support dispatching requests to one engine per GPU (or per tensor parallel size). ```python from vllm import LLM llm = LLM(model="...", data_parallel_size=X) # spawn X number of engine processes and shard the work among them llm = LLM(model="...", data_parallel_size=X, tensor_parallel_size=Y) # this is supported if X*Y <= total number of GPUs ``` For the server, same argument, route requests to different engine processes, we can start with s