Google Responds to Meta’s Video-creation AI With Imagen Video
October 06, 2022 By Monica Green
(Image Credit Google)
Google, not to be outdone by Meta's Make-A-Video, today detailed its work on Imagen Video, an AI system that can generate video clips based on a text prompt (e.g. "a teddy bear washing dishes"). While the results aren't perfect (the system's looping clips have artifacts and noise), Google claims that Imagen Video is a step toward a system with a "high degree of controllability" and world knowledge, including the ability to generate footage in a variety of artistic styles.
Text-to-video systems, Devin Coldewey pointed out in his piece about Make-A-Video, aren't new. CogVideo, developed by Tsinghua University and the Beijing Academy of Artificial Intelligence, was released earlier this year. It can translate text into reasonably high-fidelity short clips.
Imagen Video is an image-generating system similar to OpenAI's DALL-E 2 and Stable Diffusion. It generates new data by learning how to "destroy" and "recover" many existing samples of data. As it's fed new data, the model gets better at recovering the data it'd previously destroyed to create new works.
Google has developed a video-editing system that can produce 24 frames-per-second HD video at 720p (1280×768). Imagen Video was trained on the LAION-400M image-text dataset, which was used to train Stable Diffusion. It can render drone flythroughs and objects from different angles without distorting them.
Imagen Video is a Google text-to-video system that can turn long, detailed prompts into two-minute-plus videos. The system can turn paragraph-long prompts into films of an arbitrary length.
Phenaki can convert paragraph-length prompts into films of any length. The system can transform them into anything from a motorcycle rider to an alien spaceship flying over a futuristic city. The Phenaki clips have the same flaws as Imagen Video's, but it's amazing how closely they match the text descriptions.
Returning to Imagen Video, the researchers point out that the data used to train the system contained problematic content, which could lead to Imagen Video producing graphically violent or sexually explicit clips. Google says it will not release the Imagen Video model or source code "until these concerns are addressed," and, unlike Meta, it will not provide a sign-up form to register interest.
Nonetheless, with text-to-video technology advancing at such a rapid pace, it may not be long before an open-source model emerges, supercharging human creativity while posing an insurmountable challenge in terms of deep fakes, copyright, and misinformation.