Evaluation Methodology

How we measure the chosen metrics

Cost and Rate Limit

We obtain cost and rate limit information from each provider’s documentation. If a provider’s endpoint cannot handle a certain amount of concurrent request, we eliminate that provider’s endpoint from the leaderboard when that use case is selected. See additional limitations in the limitations section. If any displayed information is out of date, please contact us at contact@withmartian.com!


Before we dive into how we collect throughputs and time to first token, here’s a quick walk through of different sources of latency of an API endpoint proudct.

Latency Sources in End-to-End Inference Systems

  • Network Latency: When a user requests a provider’s endpoint, we first encounter network latency. Depends on the location of the user and the server of the provider, this network latency can differ.

  • Load Balancing: Then, the provider’s server receives processes the incoming request and forward it to the LLM inference endpoint. If there is a load balancer, it can add latency while determining the best server to forward the request.

  • Queuing Time: Especially for shared public endpoints, providers usually have a queuing system where other users inference request might be processed first if their request was sent first.

  • Inference Computation Time: This is the time it takes for the GPU to actually process your inference request. Many providers have optimizations to accelerate inference and decrease computation time.

    • Scaling up additional hardware resources: when there are more concurrent requests than the existing system can handle, sometimes the underlying provider will spin up additional GPUs to handle the spike of requests. The ability to scale up additional compute effectively also influences latency.

  • Finally, we encounter network latency again when the server sends the response back to the user.


Now, we’ll walk through how we collect and display throughput and TTFT.

Warm Up Requests

We send 1-3 warmup requests to each model by the provider before measuring throughput and TTFT to eliminate the effect of cold starts.

Throughput

We calculate throughput by sending a request, measure the total latency, and divide by the total number of output tokens.

Since most client applications are written in python, we send the request via OpenAI’s async python SDK if the provider supports it, otherwise we use the provider’s custom SDK.

Parameters that affects throughput:

  • output token length: Token lengths can affect throughputs in empirical inference systems. We measure throughputs for ~100 and ~1000 output tokens since these are common output lengths and output

  • concurrency: We measure throughput for 2, 20, and 50 concurrent requests to simulate systems of different loads. This attempts to measure different provider’s ability to batch process, manage queues, and scale up additional computational resources.

For each model offered by each provider, we measure throughput for the 6 combinations given the 2 variables above. We've found input token length to have minimal influence on throughput, so we used a prompt of ~100 tokens for all throughput measurements.

TTFT

We calculate time to first token by sending a request via the OpenAI python SDK’s streaming feature and measure time until we receive the first token. If the provider doesn’t support the OpenAI Python SDK, we default to their custom SDK.

Concurrency affects TTFT similarly as it affects throughput, so we measure TTFT for each model supported by each provider for 2, 20, and 50 concurrent requests using a prompt of ~100 tokens.

P50, and P90

For both throughput and TTFT, we report the median and 90% percentile value as it gives developers an informative overview of the average performance as well as performance consistency of the system.

When we measure the performance of 2 concurrent requests, we repeat it 5 times during one collection to obtain a statistically significant sample.

Long Term Tracking

We collect the live metrics (throughput and TTFT) twice a day at random times in the day and report the average of the past 5 days on the leaderboard. This removes bias from an individual measurement and allows the leaderboard to reflect a provider’s new optimization in a relatively short period of time.

If you have questions and concerns about how we measure these metrics, or if you’d like to include any other metrics on the leaderboard, feel free to contact us at contact@withmartian.com.

Last updated