Optimize the performance of the WAN2.2 image-to-video diffusion model on NPU accelerators (Ascend, Cambricon, etc.) to meet the increasing demand for running such models efficiently in production environments.
### Motivation. WAN2.2 is a state‑of‑the‑art image‑to‑video (I2V) diffusion model that unlocks new possibilities in creative content generation, autonomous driving simulation, and interactive media. As one of the I2V models integrated into vLLM-OMNI, it represents a pioneering step toward accessible, high‑performance video AI. ##### NPU demand. With NPU accelerators (Ascend, Cambricon, etc.) becoming increasingly prevalent in production environments, there is strong demand to run WAN2.2 efficiently on these platforms. Currently, for offline serving on 8× NPU cards with 480×832 resolution and 81 frames, the total inference time is 215 seconds. This baseline reveals clear room for optimization in operators, distributed strategies, and NPU‑specific execution models. At the same time, low‑latency online serving on NPU is also an emerging requirement that this project will address. ##### GPU demand. GPUs remain the most widely adopted acceleration platform for generative AI today. vLLM-O