Emu: The Most Advanced Next-Generation Image Model From Meta
Emu from Meta outperforms the current state-of-the-art image generation models.
There is a new image generation model in town and it pretty much blows everything out of the water. Expressive Media Universe, Emu for short, is Meta’s next-generation image model. We have seen quite a few image generation models like DALL-E, Midjourney or Stable Diffusion but the one from Meta, based on the early benchmarks, makes it the most advanced image model yet.
During Connect 2023, Mark Zuckerberg talked briefly about Emu but once you dig into the details, we start to realise Meta has not only entered the image generation space but they have come in with a big bang. In the research paper published by Meta we got to see the technical details and comparison with the State-of-the-art image models currently available, mostly comparing it to Stable Diffusion XL (SDXL).
Quality-tuning Approach
The pre-trained model is a Latent Diffusion Model (LDM), which has been trained on 1.1 billion image-text pairs but what makes Emu truly incredibly is what Meta did during the fine-tuning stage or as they like to call it ‘quality-tuning’.
Automatic filtering was applied to the over one billion images to reduce it to hundreds of millions. The next step ‘human filtering’ was done using a two-stage approach. The first stage consists of generalists annotators who reduced the total number of images to around 20K. The second stage use specialist annotators, who are specialised in image quality, focusing mainly on the aesthetics of the images and seeing how true the images are to a given prompt.
The approach resulted in selecting 2000 images with the highest quality for quality-tuning. This is very low number of images as the team behind Emu put quality over quantity. Which as can be seen below, resulted in a massive difference between the pre-trained model and the quality-tuned Emu model.
Emu Technical Details
Below are some of the technical aspects of the Emu model:
- The model is trained with 1.1 billion image-text pairs and fine-tuned with a few thousand carefully selected high-quality images, focusing on optimizing pre-training.
- Quality-tuning approach involves human filtering and an early stopping mechanism to prevent overfitting, stopping at around 15K iterations despite the loss still decreasing.
- The model is made to create detailed pictures that are 1024×1024 pixels in size. It uses a common design called “latent diffusion architecture” to do this. This design includes a feature called an “autoencoder,” which helps the model understand and remember the important parts of a picture, allowing it to recreate them accurately.
- The model is trained with progressively increasing resolutions, resembling the approach used in progressively growing GANs.
- Emu model originally converted RGB channels to four latent channels, but increasing the channel size to 16 significantly improved reconstruction. 32 channels were also tried but the improvements compared to 16 were not significant.
- The image generation time of Emu is five seconds.
Meta has been a pioneer in image segmentation with their Segment Anything Model (SAM), so it was only a matter of time before Meta rolled out an image generation model. Compared to the current State-of-the-art models, such as Stable Diffusion, images generated from Emu are preferred 68.4% of the time.
Emu is going to be rolled out as part of the Meta AI product announcement. It will be directly incorporated into various Meta products such as Facebook, Instagram and WhatsApp.
Interested in Learning More?
Check out our comprehensive courses to take your knowledge to the next level!
Browse Courses