Meta has introduced Emu Video, a text-to-video model that can generate high-quality videos based on text-only, image-only, or combined text and image inputs.
The model uses a factorized approach, splitting the process into two steps: generating images conditioned on a text prompt, and then generating video conditioned on both the text and the generated image.
The video generation model was preferred over Meta's previous generative video project by 96% of respondents in terms of quality and by 85% in terms of faithfulness to the text prompt. [ more ]