A new framework enables the simultaneous generation of 3D whole-body motions and singing vocals from textual lyrics. This is achieved using the RapVerse dataset, which includes synchronous rap vocals, lyrics, and 3D motions. The utilization of autoregressive transformers results in coherent generation of audio and motion. Current limitations include a focus only on rap music, but the framework is adaptable for other genres with appropriate datasets. Future work involves developing multi-performer audio and motion generation for applications in virtual live bands.
This work presents a framework for generating 3D whole-body motions and singing vocals directly from textual lyrics, aiming for coherent and synchronized output.
The RapVerse dataset combines synchronous rap vocals, lyrics, and 3D body motions, enabling effective utilization of autoregressive transformers for motion and audio generation.
Collection
[
|
...
]