We introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution-an approach that inherently makes global temporal consistency difficult to achieve.
[
add
]
[
|
|
...
]