LLM Match-up: Mistral vs GPT-4
Welcome to the first edition of the Kmeleon Model Tournament, where our research team will determine which model performs best across a spectrum of challenges. Our team seeks to highlight valuable insights into each model’s practical applications in industry-specific scenarios. In this first match-up we have selected two of the most relevant LLMs today: GPT-4 Turbo and Mistral Large, along with their respective embedding models, text-embedding-large and mistral-embed. You can find the most recent results for each model on our online leaderboard.
First Match: GPT-4 Turbo vs. Mistral Large
These two models are the flagship options from OpenAI and Mistral. GPT-4 0125 is the new version GPT-4 Turbo model, an enhanced new version of the industry leader GPT-4 model, released in January this year. Meanwhile, Mistal Large claims to be the world’s second-ranked model generally available through an API, according to its official announcement.
Embedding models also play a crucial role when testing a RAG implementation. Each LLM will be tested with its provider’s most recent embeddings model:
Results
We compared the models using our standardized financial data analysis test, whose definition can be found here. The following table shows the average results for both models. The winner is determined using the Human Reviewed Accuracy.
Global Question Visualization
The following figures show the Human Evaluation and Response time results across each question in the dataset.
Our Analysis
We can see that GPT-4 Turbo achieves better accuracy over the dataset, achieving a 6% advantage over Mistral Large, although the response times from Mistral Large significantly overperformed GPT-4 Turbo, being almost 400% faster.
In conclusion, the Mistral Large and Mistral Embed duo impresses with its speed and efficiency, being ideal for tasks requiring rapid responses and large reasoning capabilities, while GPT-4 and Embeddings V3 Large excels in accuracy and depth, suited for complex analyses where detailed understanding is paramount. This suggests that the choice between the two should be based on specific needs: speed and linguistic versatility with Mistral Large, or depth and accuracy with GPT-4
Future Plans
Thank you for joining us in this first exciting round of our LLM tournament series! We're just getting started, and there's so much more to come. In our next episode, we'll be bringing even more advanced models to the table, and opening different model categories. We’ll also keep expanding our tests with more real-life scenarios to challenge the model contenders.
References
Benchmarking LLM Powered Chatbots: https://arxiv.org/abs/2308.04624