The Large Language and Vision Assistant (LLaVA) is a cutting-edge Large Multimodal Model (LMM) introduced earlier this year. Like many LMMs, its goal is to achieve human-level proficiency in both language and vision. While no LMM has reached this pinnacle yet, GPT-4 stands out as the current state-of-the-art, particularly with its recent enhancements in vision and audio processing. However, there’s a new contender on the block: LLaVA 1.5. It’s garnering significant attention as a top-tier competitor to GPT-4, especially given its open access and completely free to the public.
At its core, LLaVA is an open-source chatbot that started with the LLaMA/Vicuna model and was subsequently fine-tuned on multimodal instruction-following data generated by GPT4. The new approach here, which I believe, will be big part of future generation models is that the fine-tuning is being done by data generated from another model, in this case GPT-4. In fact, this is the major breakthrough from their initial paper titled Visual Instruction Tuning.
The original LLaVA paper was published in April 2023 and fast forward a couple of months and we have a major upgrade to the original model. LLaVA 1.5 is a 13 billion parameter model trained in September 2023. The new version not only surpasses the original model but in some aspects is on-par with GPT-4. LLaVA 1.5 is currently available on Huggingface for download.
Below are the technical specifications of LLaVA vs LLaVA 1.5.
|LLaMA/Vicuna (with modifications)
|GPT-generated multimodal instruction-following data
|Enhanced with academic-task-oriented VQA data and response formatting prompts
|Fully-connected vision-language cross-modal connector, which is powerful and data-efficient
|Standard vision encoder
|Upgraded to CLIP-ViT-L-336px with an MLP projection
|Set a new state-of-the-art accuracy on Science QA
|Achieved state-of-the-art across 11 benchmarks
|Final 13B checkpoint uses only 1.2M publicly available data and finishes training in ∼1 day on a single 8-A100 node
Based on my initial testing, LLaVA 1.5 is an incredible chatbot with vision tool. Currently it the best free LMM available to the public. In fact, if you have never used ChatGPT Plus, this is an excellent free alternative.
As can be seen from the image above, when asked to explain an image, LLaVA 1.5 does an incredible job. Sometimes, even providing more detail than ChatGPT. Although ChatGPT can provide extensive detail if asked to elobrate. But when asked to explain the two jokes containing drake and a couch in the images below, it is not able to correctly explain it, while ChatGPT can.
Impressively, LLaVA 1.5 achives state-of-the-art ranking in 11/12 benchmarks. What’s even more remarkable is that it achieved this with a simpler architecture, less training data, and using publicly available datasets compared to other methods. This challenges the belief that such models need extensive vision-language alignment training. In fact, LLaVA 1.5 even outperformed models with billions of trainable parameters. This is not just an incredible achivement but also a testament to the pace at which AI is overall developing.
Base Model and Fine-tuning: As mentioned earlier, LLaVA-1.5 started with the LLaMA/Vicuna model and was subsequently fine-tuned on GPT-generated multimodal instruction-following data. This data involves both language and visuals, allowing it to understand and follow instructions that merge these two modalities.
- Image-Text Pairs: LLaVA-1.5 was trained on 558K image-text pairs sourced from LAION/CC/SBU. These pairs were captioned by a system named BLIP.
- GPT-Generated Data: The model also utilized 158K pieces of instruction-following data that were generated by GPT. This data is multimodal, meaning it combines both text and images.
- Academic-Task-Oriented VQA Data: LLaVA-1.5 was further trained on 450K pieces of data designed for Visual Question Answering (VQA) tasks that are academic in nature.
- ShareGPT Data: An additional 40K data from ShareGPT was included in the training process.
Training Duration and Hardware:
- Due to the increased image resolution processed (336 pixels), the training took approximately twice as long as the original model.
- The pretraining phase lasted about 6 hours.
- The visual instruction tuning phase took around 20 hours.
- The entire training was conducted using eight A100 GPUs, which are high-performance graphics processing units designed for compute-intensive tasks.
Competition is always good for the end user, so seeing such models as LLaVA be made free for the public to use is going to spur more growth in the world of AI. ChatGPT now has a serious competitor and I for one, can’t wait to see what’s coming next. Go ahead and give LLaVA a try, it’s completely free!
Interested in Learning More?
Check out our comprehensive courses to take your knowledge to the next level!Browse Courses