We present Auto Arena of LLMs, a completely automated large language model (LLM) evaluation framework that comprehensively investigates an LLM’s capabilities by employing various LLM agents in peer-battles and committee discussions.
On the models that are included in Chatbot Arena, we can recover human preference scores with 94.5% correlation, which exceeds all current benchmarks.