BELLS

Benchmark for the Evaluation of LLM Supervision Systems

A comprehensive framework for evaluating and comparing the effectiveness of Large Language Model supervision systems across multiple safety-critical dimensions.

Compare Supervision Systems

Explore performance rankings and detailed benchmarks

View Leaderboard

Explore Supervision Systems

Explore our dataset and see how different supervision systems perform against various types of prompts

Open Playground

Get Recommendations

Find the best supervision system for your specific needs

Get Started

Learn About Our Metrics

Understand how we evaluate and score supervision systems

View Metrics

Explore Our Code

Check out the BELLS benchmark implementation on GitHub

View Repository