BELLS

Research Report: Repurposed LLMs vs. Specialized Supervision Systems

This report presents the results of a systematic comparison between repurposed Large Language Models (LLMs) and specialized supervision systems. Discover key findings and performance insights across multiple safety-critical dimensions.

Compare Supervision Systems

Explore performance rankings and detailed benchmarks

View Leaderboard

Explore Our Dataset

Explore our dataset and see how different supervision systems perform against various types of prompts

Open Playground

Get Recommendations

Find the best supervision system for your specific needs

Get Started

Learn About Our Metrics

Understand how we evaluate and score supervision systems

View Metrics

Explore Our Code

Check out the BELLS benchmark implementation on GitHub

View Repository