Allow specifying multiple GPUs for vLLM device in GRPO Trainer, instead of being limited to a single GPU. This would improve RL performance by removing the single GPU bottleneck.
### Feature request It seems that the vLLM device can only be set in [GRPOConfig.vllm_device](https://github.com/huggingface/trl/blob/main/trl/trainer/grpo_config.py#L130), which is a string corresponding to a CUDA device identifier. I think this implies that the vLLM device can only use a single GPU, which can be a bottleneck for RL. It is also possible to use a subset by setting the CUDA_VISIBLE_DEVICE environment variable, but this might break TRL. Is there a more convenient way to specify multiple GPUs in a single node for training (or any hacks that would work now)? Furthermore, there might need to be more detailed configurations for multi-node vLLM/GPRO training runs. ### Motivation Enhance training efficiency for RL with >single GPU sampling. ### Your contribution N/A