BELLS

Research Report: Repurposed LLMs vs. Specialized Supervision Systems

This report presents the results of a systematic comparison between repurposed Large Language Models (LLMs) and specialized supervision systems. Discover key findings and performance insights across multiple safety-critical dimensions.

Compare Supervision Systems

Explore performance rankings and detailed benchmarks

View Leaderboard

Explore Our Dataset

Explore our dataset and see how different supervision systems perform against various types of prompts

Open Playground

Learn About Our Metrics

Understand how we evaluate and score supervision systems

View Metrics

Frequently Asked Questions

Find answers to common questions about our methodology and findings

View FAQ

Explore Our Code

Check out the BELLS benchmark implementation on GitHub

View Repository