A new task for generating 3D body motions and singing vocals from textual lyrics is introduced. The RapVerse dataset, with synchronous rapping vocals, lyrics, and 3D body meshes, is collected to support this research. The study employs a vector-quantized variational autoencoder for encoding motion sequences into tokens, and a vocal-to-unit model for audio tokens. By applying transformer modeling across language, audio, and motion, a unified framework is developed that produces realistic vocal and motion outputs, surpassing traditional single-modality systems and setting new benchmarks in this domain.
This work introduces a challenging task to generate 3D holistic body motions and singing vocals from textual lyrics, advancing beyond existing isolated approaches.
The RapVerse dataset contains synchronous rapping vocals, lyrics, and high-quality 3D holistic body meshes, significantly facilitating multimodal generation tasks.
A vector-quantized variational autoencoder is employed to encode whole-body motion sequences into discrete motion tokens, enhancing the motion generation process.
The unified generation framework outperforms specialized single-modality systems, producing coherent vocals and human motions directly from textual inputs, establishing new benchmarks.
Collection
[
|
...
]