Enable high-bandwidth GPU-to-GPU communication across nodes connected via NVSwitch fabric by adding support for automatic compute domain creation for TrainJob pods, improving performance for distributed training workloads.
Multi-Node NVLink (MNNVL) enables high-bandwidth GPU-to-GPU communication across nodes connected via NVSwitch fabric. For distributed training workloads to achieve best performance, all TrainJob pods should be placed on nodes within the same MNNVL domain. Currently, users must manually create the compute domain CRD to leverage IMEX channels and manually configure placement constraints for their workloads. This doesn't scale. We should add support for: 1. **Automatic compute domain creation** — Trainer should automatically create the compute domain CRD required for MNNVL and IMEX channel configuration, removing the manual setup burden from users. 2. **Topology Aware Scheduling (TAS) integration** — Use [TAS](https://kueue.sigs.k8s.io/docs/concepts/topology_aware_scheduling/) to ensure best placement of TrainJob pods within MNNVL domains. /kind feature /area controller CC @andreyvelich