Benchmark for the Evaluation of LLM Supervision Systems
A comprehensive framework for evaluating and comparing the effectiveness of Large Language Model supervision systems across multiple safety-critical dimensions.
Explore performance rankings and detailed benchmarks
Explore our dataset and see how different supervision systems perform against various types of prompts
Find the best supervision system for your specific needs
Understand how we evaluate and score supervision systems
Check out the BELLS benchmark implementation on GitHub