Enhance verl's scalability to support online DPO-like training for diffusion image and video generation models like Qwen-Image, using Flow-GRPO as a representative algorithm.
# Motivation The goal is to enhance `verl`’s scalability so that it can support online DPO-like training for state-of-the-art diffusion image and video generation models, including `Qwen-Image`, `Z-Image`, `Wan2.2`, and others. We choose `Flow-GRPO` as the representative algorithm in this domain, while additional algorithms such as `DiffusionNFT` and `DanceGRPO` can be seamlessly integrated following this update. As an initial step, `Qwen-Image` has been selected as the first supported model for multimodal generation tasks. At present, `verl` does not support diffusion-based generation models. To enable this functionality, two major extensions are required: first, the addition of a rollout engine capable of handling image and video generation tasks, incorporating components such as `vLLM-Omni`; and second, the addition of a training engine for diffusion model training, which will rely on `diffusers` with an `FSDP` backend. Consequently, integrating `diffusers` and `vLLM-Omni` becomes