Introduction

Motivations behind the provider leaderboard project

The open sourcing of Meta's Llama model series in February 2023 was a watershed moment -- it catalyzed the training and development open source Large Language Models (LLMs), leading to commonly used models like Llama-2-70b-chat and Mixtral-8x7b. These models are not just competing with but also surpassing the capabilities of their closed source counterparts in some instances. A compelling aspect of these open source models is their cost-effectiveness – some, like the Llama-2-7b-chat, are up to 300 times more affordable, offering a practical alternative for a wide range of applications.

However, the choice between different LLMs is only part of the equation. The selection of inference endpoint providers has become equally crucial. The latter half of 2023 saw a significant increase in providers offering text-to-text open source model inference endpoints. These providers vary widely in their offerings, with some prioritizing cost-effectiveness, while others focus on maximizing throughput or providing exceptionally high rate limits. For instance, with Llama-2-70b-chat public endpoints, the differences are stark: there is more than a 5x cost difference, over 6x throughput variation, and rate limit discrepancies exceeding 1000x among providers. This diversity necessitates a careful and informed choice of provider to optimally cater to specific application needs.

For developers, however, evaluating and comparing these various inference providers presents significant challenges:

  1. Discovering Emerging Providers: Newer and smaller providers, potentially offering unique advantages, are often difficult to identify in a rapidly expanding market.

  2. Resource-Intensive Evaluation Process: Conducting a thorough assessment of each provider involves substantial time and financial investment. It requires registering accounts, navigating through detailed documentation for cost and rate limit information, and executing a statistically significant number of requests to gather vital performance metrics.

  3. Tracking Continuous Optimizations: Providers are constantly enhancing their systems, leading to improvements in latency and cost-efficiency. Staying on top of these developments is crucial yet challenging.

While resources like the Open LLM Leaderboard from Huggingface and the Chatbot Arena Leaderboard from LMSYS provide valuable insights into model performance, a gap remains in the comprehensive comparison of open source model inference providers.

At Martian, we feel this need acutely. Our product is an LLM Router -- an API that takes in a prompt and sends it to the best LLM based on factors like output quality, cost, and latency. Because getting the best performance from AI models is not just a function of what model you use, but also how you run inference for that model, we need to understand the pros and cons of each inference provider in order to use the best one for any given request.

With the provider leaderboard from Martian, we aim to share what we've learned from benchmarking the performance of different LLM providers. We hope that this leaderboard serves as a useful tool in a multiple ways:

  1. As an easy-to-use tool to evaluate the strength and weakness of different LLM providers

  2. As a way to share those evaluation results within teams and across the AI community

  3. As a way to highlight the difficulties in doing such evaluations (see e.g. our limitations section) and helping the community to address those difficulties

  4. As a way to provide an unbiased open-source project that the community can build-on and improve

Last updated