Towards Unified Video-Text-to-Audio Generation
Yusheng Dai2,3,
Zehua Chen1,3†,
Yuxuan Jiang1,3,
Baolong Gao1,3,
Qiuhong Ke2,
Jianfei Cai2,
Jun Zhu1,3†
1Tsinghua University, Beijing, China 2Monash University, Melbourne, Australia 3Shengshu AI, Beijing, China
Training a unified model for video-to-audio (V2A), text-to-audio (T2A), and joint video-text-to-audio (VT2A) generation offers significant flexibility but faces critical, underexplored challenges. In this paper, we identify two foundational problems:
In this work, we introduce SoundAtlas, a large-scale dataset of 470k audio-caption pairs. It is the first to significantly outperform existing datasets in quality, even surpassing human-expert annotation quality. Our construction process relies on a novel multi-turn agentic annotation pipeline powered by Gemini-2.5 Pro and Qwen-2.5-VL (Figure 2). Specifically, we employ Vision-to-Language Compression to mitigate hallucinations caused by visual bias (Figure 1), alongside a Junior-Senior Agent Handoff mechanism that achieves a 5× cost reduction followed by post-hoc filtering to ensure fidelity. Derived from VGGSound and AudioSet via this pipeline, SoundAtlas exhibits tight V-A-T alignment, delivering semantically rich captions capable of correcting errors in human benchmarks.
Building on SoundAtlas, we propose Omni2Sound, a diffusion-based unified model that supports flexible input modalities while maintaining both fine-grained audio-visual synchronization and high-fidelity generation. To address the identified cross-task and intra-task competition, we design a three-stage progressive training schedule that departs from naive joint training. This strategy first establishes a robust T2A prior and leverages high-quality VT2A data to map distinct conditional spaces into a unified joint embedding, effectively converting cross-task competition into a cooperative dynamic. Furthermore, it employs a decoupled robustness training stage with push-pull synergistic augmentations to mitigate intra-task modality bias, ensuring both A-V alignment and faithfulness in off-screen audio generation.
As a result, Omni2Sound achieves unified state-of-the-art performance across V2A, T2A, and VT2A tasks on the comprehensive VGGSound-Omni benchmark, surpassing both previous unified frameworks and specialized baselines. Extensive evaluations further demonstrate its strong generalization capabilities on external benchmarks (e.g., Kling-Audio-Eval, Video-LLaMA generated captions) .
Qualitative Demonstrations and Comparisons
We introduce SoundAtlas, the first large-scale, human-expert-level audio caption dataset, augmenting VGGSound and AudioSet with semantically rich and temporally detailed captions. It features tight visual–audio–text (V–A–T) alignment and a markedly higher text-audio faithfulness than prior datasets.
We propose Omni2Sound, a diffusion-based unified model supporting flexible input modalities while maintaining both fine-grained audio-visual synchronization and high-fidelity generation.
Video-Text-to-Audio: Joint conditioning on Video and Text for precise semantic control.
@article{dai2026omni2sound,
title = {Omni2Sound: Towards Unified Video-Text-to-Audio Generation},
author = {Dai, Yusheng and Chen, Zehua and Jiang, Yuxuan and Gao, Baolong and
Ke, Qiuhong and Cai, Jianfei and Zhu, Jun},
journal = {arXiv preprint arXiv:2601.02731},
year = {2026}
}
If you have any comments or questions, feel free to contact: yusheng.dai@monash.edu