LLM Match-up: Gemini 1.0 Pro vs GPT-4

June 25, 2024 Dustin Gallegos

Welcome to the second edition of the Kmeleon Model Tournament, where we compare the performance of leading large language models (LLMs) and their embedding counterparts across various industry-specific challenges. Our goal remains to uncover deep insights into each model's capabilities and practical applications. This time, our spotlight turns to a match-up between OpenAI's GPT-4 and Embeddings V3 Large, and Google's Gemini 1.0 Pro and Google Embedding-001. You can find the most recent results for each model on our online leaderboard.

Second Match: GPT-4 Turbo vs. Gemini 1.0 Pro

GPT-4 0125 continues to be OpenAI’s flagship model. Gemini 1.0 was unveiled in December 2023 as Google’s most advanced and capable AI model to date. According to Google, it was the first AI model to surpass human expertise in the Massive Multitask Language Understanding (MMLU) test, showcasing its superior reasoning across diverse subjects from mathematics to ethics. The 1.0 Pro version, which we will test in this match, is the only Generally Available version in the Gemini API. The 1.0 Ultra and 1.5 Pro versions, which we will test later in this series, are only available for a limited number of users.

The embedding models to be tested are:

Results

We compared the models using our standardized financial data analysis test, whose definition can be found here. The following table shows the average results for both models. The winner is determined using the Human Reviewed Accuracy.

The following images show sample questions and answers from the actual conversation with the models:

Figure 1. Successful responses from both models.

Figure 2. Examples of failed responses from Gemini 1.0 Pro

Global Question Visualization

The following figures show the Human Evaluation and Response time results across each question in the dataset.

Figure 1. Human Evaluation and Response Time results.

Our Conclusion

In the head-to-head comparison between OpenAI's GPT-4 Turbo and Google's Gemini 1.0 Pro, the results reveal a striking performance gap in favor of GPT-4 Turbo across the average accuracy and cosine similarity metrics. Despite GPT-4 Turbo taking longer on average to generate responses, its significantly higher accuracy (0.76 vs. 0.36) and cosine similarity (0.78 vs. 0.54) scores indicate a deeper understanding of the input and a closer match to the expected outcomes, highlighting its superior comprehension and contextual alignment capabilities.

The pronounced underperformance of Gemini 1.0 Pro suggests that while it excels in speed (with an average response time of just 3.35 seconds), it may struggle with the complexity and nuance required in real world analytics workflows with RAG. This discrepancy underscores the importance of optimizing for more than just efficiency in response time; understanding and contextual relevance are crucial for delivering valuable and actionable insights.

Future Plans

Thank you to everyone who has followed this comparison. We appreciate your interest and look forward to bringing you more analyses in our LLM Match-Up series. Stay tuned for future updates as we continue to explore the evolving world of AI models.

To learn more about using generative AI technologies in your business, check out Kmeleon's website. See how our knowledge in generative AI can help your company grow in the fast-changing tech world, making sure your projects do well and lead with new ideas.

Make sure to check out the upcoming YouTube video about this research, and don’t forget to follow us on our social media, GitHub and YouTube channel.

References

OpenAI: https://platform.openai.com/docs/models

Gemini release: https://blog.google/technology/ai/google-gemini-ai/#introducing-gemini
Gemini models: https://ai.google.dev/models