Researchers from Google DeepMind have recently described a new approach for teaching intelligent agents to solve complex, long-term tasks by training them exclusively on video footage rather than through direct interaction with the environment. Their new agent, called Dreamer 4, demonstrated the ability to mine diamonds playing Minecraft after being trained on videos, without ever actually playing the game. The researchers dubbed their approach imagination training to emphasize that the agent learns solely from offline data, without any interaction with the physical world.
Traditional video methods [are a] brute-force approach to pixel generation, where you're trying to squeeze motion in a couple of frames to create the illusion of movement, but the model actually doesn't really know or reason about what's going on in that scene, Previous video-generation models had physics that were unlike the real world, he added, which general-purpose world model systems help to address.