Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions

We present Auto Arena of LLMs, a completely automated large language model (LLM) evaluation framework that comprehensively investigates an LLM’s capabilities by employing various LLM agents in peer-battles and committee discussions. On the models that are included in Chatbot Arena, we can recover human preference scores with 94.5% correlation, which exceeds all current benchmarks.