Emu Video: Meta’s Most Advanced Video Generation Model (Text-to-Video)
– Meta has released their most advanced video generation model called Emu Video.
In the competitive world of generative AI, top tech companies are competing to release state-of-the-art models. We have incredible image generating models but video generation is still quite far. This is where Meta has stepped in with their most advanced model yet. Meta has introduced Emu Video and Emu Edit models, setting new benchmarks in text-to-video generation and image editing.
Just a few months back at Connect 2023, Meta introduced their Emu (Expressive Media Universe) model, a cutting-edge image generation model that rivaled others like DALLE-3, Stable Diffusion, and MidJourney. And now, they’ve taken a significant leap forward with the launch of Emu Video and Emu Edit. These new additions mark major advancements in their capabilities and it also goes to show the pace of advancement within the world of AI.
Emu Video Comparison
Let’s start off by saying that Emu Video, is the most advanced video generation model that is currently available for anyone to try. It beats both open and closed source models by a big margin, at least according to their research paper.
Below is a table showing Emu Video model’s performance based on human evaluation compared to other text-to-video models.
Compared Model | Quality (Emu Video Win Rate %) | Text Faithfulness (Emu Video Win Rate %) |
---|---|---|
Pika Labs | 96.9 | 98.5 |
Gen2 | 87.7 | 78.5 |
CogVideo | 100.0 | 100.0 |
Reuse & Diffuse | 87.0 | 95.7 |
PYOCO | 81.1 | 90.5 |
Align Your Latents | 97.0 | 92.3 |
Imagen Video | 56.4 | 81.8 |
Make-A-Video | 96.8 | 85.1 |
From the table we can clearly see that Emu Video outperforms almost all video generation models currently available. Keep in mind that some of these are commercial models while Emu Video will be available for free.
Let’s look at some of the technical details behind the Emu Video model.
Emu Video Model Technical Specifications
Emu Video is a diffusion model built on top of the previously released Emu model. Diffusion models, which have gained significant popularity in the past year, start with ‘noise’ and slowly refine this noise through a series of iterative steps, gradually transforming it into a coherent and detailed image or video frame. This process is guided by learned data patterns, allowing for the generation of high-quality visual content that closely aligns with specified textual prompts. Emu Video uniquely extends this approach, not only creating still images but also generating dynamic video content, leveraging the strength of diffusion models in capturing intricate details and textures.
EMU VIDEO’s training involves a unique, efficient multi-stage process:
- Multi-Resolution Training: It begins with training on simpler, lower-resolution videos (256px, 8fps, 1-second long), which reduces the complexity and speeds up the process.
- Progression to Higher Resolution: After the initial phase, the model is trained on higher-resolution (512px) videos at 4fps for 2 seconds, adjusting for more detailed and refined output.
- Finetuning for Motion: The model undergoes finetuning on a selected subset of high-motion videos, enhancing the quality and dynamics of motion in the generated videos.
- Interpolation Model Training: Additionally, an interpolation model is used to increase frame output, further improving the smoothness and resolution of the final videos.
This staged and specialised training approach allows Emu Video to efficiently produce high-quality, text-aligned video content.
Now let’s take a look at how the video is actually generated.
Stage | Description |
---|---|
1. Initiation with Text Prompt | Starts with a descriptive text prompt setting the thematic stage for the video. |
2. Image Synthesis from Noise | Uses diffusion models to generate a detailed still image from random noise, guided by the text prompt. |
3. Preparation for Temporal Extension | Prepares for video creation by concatenating the initial image with a binary mask, outlining static and dynamic regions across frames. |
4. Video Creation via Sequential Framing | Sequentially generates video frames using a second diffusion process, ensuring continuity and relevance from the initial image and text. |
One of the most important stages is the 3rd stage; “Preparation for Temporal Extension”. What this basically means is adding a time dimension to the initially generated image. The model prepares by concatenating the image with a binary mask, which helps to distinguish between the parts of the image that will change over time (dynamic regions) and those that will remain static. This step sets the stage for the subsequent video generation, where the model will use this prepared input to create a sequence of frames, effectively turning the image into a moving video.
Design
The research team behind Emu Video innovatively implemented a ‘two-step factorization’ approach, a two-stage process that first involves generating an image, followed by video creation. This method streamlines the task, significantly improving the efficiency of training compared to older models that attempted the entire process in one step.
Additionally, Emu Video’s training strategy is finely tuned with custom noise schedules and multi-stage training, enabling it to directly produce high-resolution videos. Notably, Emu Video achieves this efficiency with fewer resources than more complex, multi-model systems, making it a more effective solution for generating high-quality videos.
Meta has also released another model called Emu Edit (technical details will be covered in a separate article), that allows image editing. Combining both Emu Video and Emu Edit is an incredibly powerful feature that will be available for the public soon. But until then you can try out the demos.