Limitations

Limitations to our benchmarking system

Below is a list of known limitations for our evaluation system. This is a living project, and future versions of the project will work to resolve these issues:

Reliability and uptime are important metrics in evaluating a provider, but require a longer time-period for data collection, and will be added shortly.
Many providers can provide customized rate limits upon request, including for public endpoints. We are only displaying the default rate limit based on their documentation.
TTFT is dependent on the location where the request is made. Our requests are sent from US East.
In order to reduce the time it takes to collect the leaderboard data, we parallelize the requests we make to each provider by running multiple models at the same time (e.g. testing llama-2-70b and mixtral at the same time). If a provider co-locates multiple models (i.e. runs both models on the same machines), their server will be under heavier load during our data collection than providers who do not co-locate, leading to slower throughput and longer time to first token. Such a pattern for calling models may not mimic the normal usage that these models see (e.g. if higher llama usage typically occurs where there is lower mixtral usage then co-location makes things more efficient, whereas it makes thing less efficient in our testing).
Providers who over-provision GPUs will have more machine available to absorb increased traffic, and therefore will have lower TTFT, because any requests that are not immediately serviced by a GPU have to be queued. Popular providers and providers who have less funding to burn on GPUs may therefore encounter longer queuing times in their public endpoints.
More generally, public inference endpoints have different performance characteristics than private endpoints with dedicated hardware. For example, in addition to the example above about the number of idle GPUs, our backend simulates a user sending a series of concurrent requests at random times in the day, which means any network request latency, load balancing, or queuing time is considered as part of the latency and throughput measurement. At the moment, we are focusing on the end-to-end experience of public inference endpoint users. However, we are also in the process of testing private endpoints in order to measure the performance characteristics of different providers in that setting.

If you are a developer who would like to help fix these issues, contribute to the open source repo:

If you are a provider who would like us to do a more in-depth review of your systems with a private instance, contact us: [email protected]

PreviousEvaluation Methodology NextReproducibility

Last updated 1 year ago