Find answers to common questions about our methodology, evaluation process, and key findings.
The BELLS score combines three key dimensions:
This weighting ensures a balanced evaluation between robust detection and precision (i.e., minimizing false positives). We don't include the benign adversarial prompts, nor the borderline prompts in the computation of the score. The reason is adversariality adds ambiguity to the content even for benign requests, and that borderline requests are by definition not undoubtedly harmful. This way, the BELLS score is as denoised as possible from ambiguity.
Our evaluation reflects a realistic deployment setting, where a misuse filter must handle a wide range of threats without relying on very narrow use cases. We encourage future systems to clarify their scope and evaluate themselves accordingly within general-purpose frameworks like BELLS.
We tried but all the llama models we tested were not answering the repurposed prompt question ("is this prompt harmful or not?") more than 50% of the time, the results were therefore not exploitable.
We conducted evaluations in January–February 2025 with limited resources and access. Our goal was not to exhaustively benchmark every model, but to show that even a simple, prompt-based repurposing of recent frontier LLMs consistently outperforms dedicated supervision systems. This suggests a general capability gap between old and new models in misuse detection, what we call the bitter lesson.
We have evaluated Llama Guard 4 12B - the SOTA content moderation model from Meta to test whether our conclusions hold even for SOTA moderation tools. In fact, LLaMA Guard underperforms general models significantly on various harmful categories, showing its irrelevance regarding its high number of parameters.
Claude 3.7 was released just before we ran the metacognitive incoherence evaluations. The point of the evaluation was to show the general trend of metacognitive incoherence among a representative set of models, therefore the newer the better. Also the metacognitive incoherence evaluations are independent from the supervision systems evaluations.
Yes and that's part of the problem. Some companies may prioritize low sensitivity (to avoid rejecting benign content), which can lead to unacceptably high false negatives. In safety-critical contexts, missing harmful content is much riskier than flagging an occasional benign input. False negatives at scale can be dangerous.
To some extent, but it's a tradeoff worth considering. Modern frontier LLMs like Gemini are relatively cheap; Regarding latency, inference costs can be further optimized with low-latency infrastructure like Groq. And for critical applications, performance and robustness should be more important optimization targets than cost and latency.
For security reasons and to maintain benchmark integrity, we do not publicly release the full dataset to prevent potential misuse of harmful prompts and avoid gaming of the benchmark. Instead, we provide representative examples in our data playground and raw data at our leaderboard GitHub repository.
It would be great to have it but we don't know the number of parameters of the majority of the market deployed supervision systems. Regarding general LLMs, our results can also be the reflection of internal safety policy rather than a real capability of the model to detect misuse.
Anthropic's constitutional classifiers are a promising and relatively recent approach that was released during the course of our evaluations (January 2025). However, we were unable to test these systems within BELLS, as they are not publicly accessible. While their architecture aligns with many of our findings, emphasizing general model capability, scalability, and modularity, we strongly advocate for third-party access and reproducible evaluation protocols to validate claims of robustness across a wider range of threat models.