To effectively model motion dynamics in video, we must expand traditional 2D diffusion models into 3D. This involves modifying our architecture to incorporate temporal processing, ensuring a cohesive understanding of motion over time.
The proposed method requires careful inflation of the existing model to accommodate 3D structures while implementing a new sub-module designed specifically for effective temporal information exchange, bridging the gap between frames.
Collection
[
|
...
]