Benchmarking The Router

Model routing can achieve really impressive results on a number of open source datasets. We’re even able to beat GPT-4 across hundreds of datasets on OpenAI’s own evals (and you can replicate the results here). But how do you know that the router will work for your specific application?

We built a benchmarking tool directly into our API so that you can use the router confidently.

Here, we’ll walk you through how our benchmarking tool works.

How To Integrate Our Benchmarking Tool

All you need to do in order to use our benchmarking tool is install our package into your codebase.

When you make API calls to LLMs with Martian, you can specify what model or set of models you want to call. If you specify a single model instead of multiple models, we will pass your request directly through to the provider. This allows you to keep the behavior of your application exactly the same as before integrating Martian.

Simultaneously, we are running the router on your requests in the background. These requests to the router allow us to measure how the router compares to the existing model.

Once we have enough data from this process, we send you a report detailing the router performance, allowing you to make the decision to switch to the router in a risk-free way.

Once you’re ready to switch to the router, you can follow the instructions here.\

What Benchmarks Does Martian Provide?

  • Side-by-side comparison of outputs from your existing model(s) and the model router

  • Finding the most similar tasks in our task database and showing the relative performance of your existing model and the model router

  • Latency and cost comparisons

  • Human preference results -- what percentage of the time do human annotators or model trained on human annotation data prefer the model router over your existing model.

  • Custom evaluations. For enterprise customers, we can integrate the model router with evaluations your team is currently using, or work with your team to develop custom evaluations for the router.

If you're interested in more details about the benchmarks we provide -- in particular about getting human preference results or custom evaluations -- contact us for more information:

Continued Benchmarking And Improvement After You Deploy The Router

Coming soon! We're building a dashboard and SDK that allows for continuous evaluation, anomaly detection, and other means of tracking and improving router performance.

Other ways of benchmarking the router

  • Our router is fully compatible with the OpenAI API format. That means, if you have an existing benchmarking suite, the Model Router should be fully compatible with that tooling

    • Note that the router will not have had a chance to fit itself to your particular application, so initial testing may not give optimal results. We recommend using the benchmarking tool available through the API, or reaching out to us in order to learn how you can maximize the router performance for your use case:

  • If you work at a company that wants to run a batch job on an evaluation dataset in order to benchmark the Model Router, reach out to us at:

Last updated