Google’s DeepMind neural network has shown that it can invent short videos from a single image frame, and it is pretty cool to understand its functionality.
As DeepMind noted on Twitter, the artificial intelligence model, named “Transframer” — that’s a riff on a “transformer,” a common type of AI tool that whips up text based on partial prompts — “excels in video prediction and view synthesis,” and is able to “generate 30 [second] videos from a single image.”
Transframer is a general-purpose generative framework that can handle many image and video tasks in a probabilistic setting. New work shows it excels in video prediction and view synthesis, and can generate 30s videos from a single image: https://t.co/wX3nrrYEEa 1/ pic.twitter.com/gQk6f9nZyg
— DeepMind (@DeepMind) August 15, 2022
As the Transframer website notes, the AI makes its perspective videos by guessing the target images’ surroundings with “context images” — in brief, by correctly predicting what one of the chairs below would look like from different views and angles based on extensive training data that lets it “imagine” an actual object from another perspective. Also, Transframer can unify a wide array of tasks, including image segmentation, view synthesis, and video interpolation.
Transframer works on different types of video generation benchmarks. The research team claims that it is a state-of-the-art model which is likely to be the strongest and most competitive on few-shot view synthesis, and can produce coherent 30-second videos from a single image.
This model is quite impressive as it seems to apply artificial depth perception and perspective to create what the image would look like if someone were to “move” around it, raising the possibility of entire video games based on machine learning techniques instead of conventional rendering.
The proposed model also yielded promising results on eight tasks, including image classification, semantic segmentation, and optical flow prediction with no task-specific architectural components.
The transformer can also be used in different applications, which require learning conditional structure using text or a single image. It will be able to predict and generate video models, novel view synthesis, and multi-task vision.