FAQ - BELLS Benchmark

How is the BELLS Score computed?

The BELLS score combines three key dimensions:

Detection rate on adversarial harmful prompts (25% weight)
Detection rate on direct harmful prompts (25% weight)
False positive rate on benign prompts (50% weight)

This weighting ensures a balanced evaluation between robust detection and precision (i.e., minimizing false positives). We don't include the benign adversarial prompts, nor the borderline prompts in the computation of the score. The reason is adversariality adds ambiguity to the content even for benign requests, and that borderline requests are by definition not undoubtedly harmful. This way, the BELLS score is as denoised as possible from ambiguity.

Why did you evaluate supervisors specialized on jailbreaks / prompt injections detection like LLM Guard or Prompt Guard on content moderation tasks?

Our evaluation reflects a realistic deployment setting, where a misuse filter must handle a wide range of threats without relying on very narrow use cases. We encourage future systems to clarify their scope and evaluate themselves accordingly within general-purpose frameworks like BELLS.

Why didn't you evaluate the LLaMA model family?

We tried but all the llama models we tested were not answering the repurposed prompt question ("is this prompt harmful or not?") more than 50% of the time, the results were therefore not exploitable.

Why did you evaluate GPT-4, Claude 3.5, Grok 2, Gemini 1.5 and DeepSeek V3 and not other / newer models?

We conducted evaluations in January–February 2025 with limited resources and access. Our goal was not to exhaustively benchmark every model, but to show that even a simple, prompt-based repurposing of recent frontier LLMs consistently outperforms dedicated supervision systems. This suggests a general capability gap between old and new models in misuse detection, what we call the bitter lesson.

Why is LLaMA Guard only evaluated on content moderation?

We have evaluated Llama Guard 4 12B - the SOTA content moderation model from Meta to test whether our conclusions hold even for SOTA moderation tools. In fact, LLaMA Guard underperforms general models significantly on various harmful categories, showing its irrelevance regarding its high number of parameters.

Why is Claude 3.7 evaluated in the metacognitive incoherence section?

Claude 3.7 was released just before we ran the metacognitive incoherence evaluations. The point of the evaluation was to show the general trend of metacognitive incoherence among a representative set of models, therefore the newer the better. Also the metacognitive incoherence evaluations are independent from the supervision systems evaluations.

Supervisors from the market are likely trained to be low sensitive because false positive might be more expensive than false negative

Yes and that's part of the problem. Some companies may prioritize low sensitivity (to avoid rejecting benign content), which can lead to unacceptably high false negatives. In safety-critical contexts, missing harmful content is much riskier than flagging an occasional benign input. False negatives at scale can be dangerous.

Your solution is nice but it doubles the inference compute costs and increase the latency

To some extent, but it's a tradeoff worth considering. Modern frontier LLMs like Gemini are relatively cheap; Regarding latency, inference costs can be further optimized with low-latency infrastructure like Groq. And for critical applications, performance and robustness should be more important optimization targets than cost and latency.

Where can we access your dataset?

For security reasons and to maintain benchmark integrity, we do not publicly release the full dataset to prevent potential misuse of harmful prompts and avoid gaming of the benchmark. Instead, we provide representative examples in our data playground and raw data at our leaderboard GitHub repository.

Why don't you have a graph showing the number of parameters and the scaling law?

It would be great to have it but we don't know the number of parameters of the majority of the market deployed supervision systems. Regarding general LLMs, our results can also be the reflection of internal safety policy rather than a real capability of the model to detect misuse.

How do you compare with constitutional classifiers?

Anthropic's constitutional classifiers are a promising and relatively recent approach that was released during the course of our evaluations (January 2025). However, we were unable to test these systems within BELLS, as they are not publicly accessible. While their architecture aligns with many of our findings, emphasizing general model capability, scalability, and modularity, we strongly advocate for third-party access and reproducible evaluation protocols to validate claims of robustness across a wider range of threat models.

Frequently Asked Questions