5.9 Limitations¶
Previous sections outlined various evaluation techniques and methodologies, but building out a proper safety infrastructure means that we should also maintain an appropriate amount of skepticism about evaluation results and avoid overconfidence in safety assessments. So the last section of this chapter is dedicated to exploring the limitations, constraints and challenges to AI evaluations.
5.9.1 Fundamental Challenges¶
Absence of evidence vs. evidence of absence. A core challenge in AI evaluation is the fundamental asymmetry between proving the presence versus absence of capabilities or risks. While evaluations can definitively confirm that a model possesses certain capabilities or exhibits particular behaviors, they cannot conclusively prove the absence of concerning capabilities or behaviors (Anthropic, 2024). This asymmetry creates a persistent uncertainty in safety assessments - even if a model passes numerous safety evaluations, we cannot be certain it is truly safe.
The challenge of interpreting negative results. Related to this asymmetry is the difficulty of interpreting negative results in AI evaluation. When a model fails to demonstrate a capability or behavior during testing, this could indicate either a genuine limitation of the model or simply a failure of our evaluation methods to properly elicit the capability or propensity. The discovery that GPT-4 could effectively control Minecraft through appropriate scaffolding, despite initially appearing incapable of such behavior, illustrates this challenge. This uncertainty about negative results becomes particularly problematic when evaluating safety-critical properties, where false negatives could have severe consequences.
The problem of unknown unknowns. Perhaps most fundamentally, our evaluation methods are limited by our ability to anticipate what we need to evaluate. As AI systems become more capable, they might develop behaviors or capabilities that we haven't thought to test for. This suggests that our evaluation frameworks might be missing entire categories of capabilities or risks simply because we haven't conceived of them yet.
5.9.2 Technical Challenges¶
The challenge of measurement precision. The extreme sensitivity of language models to minor changes in evaluation conditions poses a fundamental challenge to reliable measurement. As we discussed in earlier sections about evaluation techniques, even subtle variations in prompt formatting can lead to dramatically different results. Research has also shown that performance can vary by up to 76 percent based on seemingly trivial changes in few-shot prompting formats (Scalar et al., 2023). As another example, changes as minor as switching from alphabetical to numerical option labeling (e.g., from (A) to (1)), altering bracket styles, or adding a single space can result in accuracy fluctuations of around 5 percentage points. (Anthropic, 2024) This isn't just a minor inconvenience - it raises serious questions about the reliability and reproducibility of our evaluation results. When small changes in input formatting can cause such large variations in measured performance, how can we trust our assessments of model capabilities or safety properties?
Sensitivity to environmental factors. Beyond just measurement issues due to prompt formatting, evaluation results can be affected by various environmental factors that we might not even be aware of. Just like the unknown unknowns that we highlighted in the last section, this might include things like the specific version of the model being tested, the temperature and other sampling parameters used, the compute infrastructure running the evaluation and maybe even the time of day or system load during testing.
The choice of evaluation methods must also consider practical constraints. Third party evaluators might have only limited access to a model's internals. They might have access to observe activations, but not modify the weights. They might not have access to the model at all, and might be restricted to observing the model functioning "in the wild". Depending on the specific techniques used, computational resources might restrict certain types of analysis.
Price and time are also practical constraints on running comprehensive evaluations. Running agents (or "AI systems") is much more expensive and time intensive than benchmarks. Developing complex novel tasks, verifying and measuring capabilities properly is human labor intensive. As an example, during the evaluation for AI capabilities in RnD the human "control group" spent more than 568 hours on just 7 tasks (Wijk et al., 2024). This doesn't even include time to develop and test the tasks. Real-world tasks in AI R&D might take hundreds of man-hours each, and most independent evaluators, or even internal teams at big AI labs don’t have the budget and time to properly evaluate performance given these considerations. All of this needs to be kept in mind when designing an evaluation protocol.
The combinatorial explosion of interaction scenarios. When AI systems interact in complex environments, the number of possible scenarios grows exponentially with each additional interaction pattern. Unlike evaluating isolated capabilities, we must consider how multiple components interact in unpredictable ways - similar to how normally benign chemicals can form dangerous compounds when combined. For example, when model A with planning abilities interacts with model B possessing information access, and both operate within system C that has external tool permissions, entirely new risk surfaces emerge that weren't present in any individual component. This combinatorial complexity makes comprehensive testing fundamentally intractable. With millions of potential interaction patterns, we can only evaluate a microscopic fraction of possible scenarios, leaving countless blind spots where critical emergent behaviors might develop. This fundamental limitation raises a sobering question: if we can only explore 0.001% of the interaction space, how can we be confident we've identified the most dangerous emergent patterns?
The resource reality. Related to the problem of exploding number of evaluations is the sheer computational cost of running thorough evaluations. This creates another hard limit on what we can practically test. Making over 100,000 API calls just to properly assess performance on a single benchmark becomes prohibitively expensive when scaled across multiple capabilities and safety concerns. Independent researchers and smaller organizations often can't afford such comprehensive testing, and even if money isn't a bottleneck because you have state sponsorship, GPUs currently absolutely are. This can lead to potential blind spots in our understanding of model behavior. The resource constraints become even more pressing when we consider the need for repeated testing as models are updated or new capabilities emerge.
US and UK AI Safety Institute Evaluation of Claude 3.5 Sonnet (US & UK AISI, 2024)
While these tests were conducted in line with current best practices, the findings should be considered preliminary. These tests were conducted in a limited time period with finite resources, which if extended could expand the scope of findings and the subsequent conclusions drawn.
The difficulty in estimating "Maximum Capabilities". Also related to the combinatorial complexity problem is the problem of emergence. We frequently discover that models can do things we thought impossible when given the right scaffolding or tools. The Voyager project demonstrated this when it used black box queries to GPT-4 revealing the unexpected ability to play Minecraft (Wang et al., 2023). Either through continued increases in scale, or through simple combination of some scaffolding approach that researchers hadn’t considered before, a model might display some significant step change in capabilities or propensities that we had not anticipated before. This emergent behavior is particularly difficult to evaluate because it often arises from the interaction between capabilities. There are few if any ways to guarantee that the whole will not become more than the sum of its parts. This makes it extremely difficult to predict what a model might be capable of in real-world situations, even if we thoroughly test its individual capabilities.
Case study of capability surprises: Kalamang
Gemini is able to translate language in a zero-shot way, without any training. The case of Gemini's ability to translate Kalamang - a language spoken by only 100 people in Papua New Guinea - demonstrates how models can acquire unexpected capabilities when provided with the right context. It's highly unlikely that this language was part of the training data of any major AI model. Researchers at Google DeepMind demonstrated that by leveraging a sufficiently large context window—around a million tokens—they could input a full grammar book and dictionary of Kalamang. When asked to translate a sentence from Kalamang to English, the model was able to perform this task (Tanzer et al., 2024).
5.9.3 Sandbagging¶

Behavioral testing is not sufficient. We have talked about this numerous times in previous sections, but it's worth mentioning again. Watching what a model does can tell us a lot, but it's still like trying to understand how a person thinks by only observing their actions. Our heavy reliance on behavioral testing - analyzing model outputs - leaves us with significant blind spots about how models actually work. A model might give the right answers while using internal strategies we can't detect. So in addition to being limited by breadth due to combinatorial complexity of the number of things we need to test, we are also limited on the depth of evaluation until we make true progress on interpretability. Unless interpretability makes progress, evaluations have a definite upper limit on how much safety they can guarantee.
Sandbagging refers to strategic underperformance on evaluations, where an AI system is intentionally made to appear less capable than it actually is. There are concerns about AI systems themselves choosing to sandbag, independent of developer intentions. An AI system might deliberately hide its capabilities to avoid modifications or restrictions that would prevent it from achieving its goals (Hubinger, 2023). We talked more about why this is a concern in the subsection on scheming. As evidence that is is possible, multiple frontier models including Claude 3.5 Sonnet, Claude 3 Opus, and OpenAI's o1 have been shown to be capable of sandbagging behavior. They deliberately underperformed on arithmetic tasks when they believed that high performance would trigger model unlearning1 (Meinke et al., 2025).
What does the potential for sandbagging imply for the field of evaluations? The possibility of sandbagging has big implications for how we approach AI evaluations, both now and in the future. Simply having access to model outputs (black-box testing) is insufficient for rigorous evaluation (Casper et al., 2024). Instead, evaluators need deeper access to model internals and training procedures to detect potential sandbagging attempts. This might require new institutional arrangements, such as trusted third-party evaluators with secured access to model weights and architectures. Supervised fine-tuning has shown to be promising as a way to greatly mitigate sandbagging and elicit hidden capabilities (Greenblatt et al, 2024), but fine-tuning is not so easy to use in a comprehensive way on dangerous capability benchmarks, and not usable for dangerous propensity benchmarks. Without addressing this challenge, our ability to make informed decisions about AI development and deployment based on evaluation results becomes severely compromised.
5.9.4 Systemic Limitations¶
Similar to how companies might engage in "greenwashing" by misleadingly portraying themselves as environmentally friendly, AI companies might engage in "safety washing". This means misrepresenting capability improvements as safety advancements. This happens when safety benchmarks have high correlation with general model capabilities and compute used for training. When a benchmark highly correlates with capabilities, improving performance on it may simply reflect that the model got more capable overall, rather than specifically safer.

How does safety washing happen in practice? There are typically three ways this occurs (Ren et al., 2024):
-
Safety by association: Research gets labeled as safety-relevant simply because it comes from safety teams or researchers, even if the work is essentially capabilities research.
-
Public relations: Companies portray capability advancements as safety progress by highlighting correlated safety metrics, especially when facing public pressure or regulatory scrutiny.
-
Grant optimization: Researchers frame capabilities work in safety terms to appeal to funders, even when the advances don't make differential safety progress.
Capabilities correlations might enable safety washing. When safety benchmarks correlate strongly with capabilities, it becomes easy to misrepresent capability improvements as safety advances. For example, if a benchmark meant to measure "truthfulness" actually correlates 80% with general capabilities, then making models bigger and more capable will improve their "truthfulness" scores - even if they aren't becoming inherently more honest (Ren et al., 2024). This creates a systemic problem where organizations can claim safety improvements by simply scaling up their models. The solution isn't just establishing norms around benchmark use - these tend to be weak and easily ignored. Instead, we need evaluation methods that can empirically distinguish between capability improvements and genuine safety advances.
Companies and AI labs can also choose to sandbag their evaluations. We put another form of sandbagging here as systemic limitation to refer to corporations intentionally sandbagging to score better on evaluations. This is in contrast to misalignment/technical sandbagging as a limitation which points to when the model does it because of scheming.
In the immediate term, developers face various pressures that could incentivize sandbagging during evaluations. Regulatory requirements and safety standards may impose capability thresholds above which additional oversight, restrictions, or costs are incurred. For instance, the EU AI Act requires more stringent controls and third-party auditing for high-risk AI systems (European Commission, 2024). This creates a direct financial incentive to appear below these thresholds. Current research shows that even today's language models can be made to selectively underperform in ways that are difficult to detect through standard evaluations, particularly on tests for dangerous capabilities while maintaining performance on general benchmarks (van der Weij et al., 2024).
In the future, companies may want to downplay certain capabilities of their systems to avoid drawing regulatory attention or public concern, while still being able to deploy these capabilities in practice. This allows them to maintain development and deployment speed without triggering additional oversight. For example, if an AI system shows strong capabilities in areas that might raise security concerns, developers might be motivated to have it underperform specifically on those evaluations while maintaining high performance on general benchmarks (van der Weij et al., 2024). They might do this to maintain development and deployment speed without triggering additional oversight.
The gap between evaluation and deployment. Current evaluation approaches often fail to capture the true complexity of real-world deployment contexts. While we can measure model performance in controlled settings, these evaluations rarely reflect how systems will behave when embedded in complex sociotechnical environments. A model that performs safely in isolated testing might behave very differently when interacting with real users, facing novel scenarios, or operating within broader societal systems. For instance, a medical advice model might perform well on standard benchmarks but pose serious risks when deployed in healthcare settings where users lack the expertise to validate its recommendations. This disconnect between evaluation and deployment contexts means we might miss critical safety issues that only emerge through complex human-AI interactions or system-level effects (Weidinger et al., 2023).
The challenge of domain-specific evaluations. Evaluating AI systems in high-stakes domains presents unique challenges that our current approaches struggle to address. Consider the evaluation of AI systems for potential misuse in biosecurity - this requires not just technical understanding of AI systems and evaluation methods, but also deep expertise in biology, biosafety protocols, and potential threat vectors. This combination of expertise is exceedingly rare, making thorough evaluation in these critical domains particularly challenging. Similar challenges arise in other specialized fields like cybersecurity, where the complexity of potential attack vectors and system vulnerabilities requires sophisticated domain knowledge to properly assess. The difficulty of finding evaluators with sufficient cross-domain expertise often leads to evaluations that miss important domain-specific risks or fail to anticipate novel forms of misuse. Besides just designing evaluations being challenging, running the evaluation for such high stakes domains is also challenging. We need to elicit the maximum capabilities of a model to generate pathogens or malware to test what it is capable of. This can be obviously problematic if not handled extremely carefully.
5.9.5 Governance Limitations¶

The policy-evaluation gap. Current legislative and regulatory efforts around AI safety are severely hampered by the lack of robust, standardized evaluation methods. Consider the recent executive orders and proposed regulations requiring safety assessments of AI systems - how can governments enforce these requirements when we lack reliable ways to measure what makes an AI system "safe"? This creates a circular problem: regulators need evaluation standards to create effective policies, but the development of these standards is partly driven by regulatory requirements. The U.S. government's recent initiative to have various agencies evaluate AI capabilities in cybersecurity and biosecurity highlights this challenge. These agencies, despite their expertise in their respective domains, often lack the specialized knowledge needed to evaluate advanced AI systems comprehensively (Weidinger et al., 2023).
The role of independent evaluation. Currently, a lot of AI evaluations research happens within the same companies developing these systems. While these companies often have the best resources and expertise to conduct evaluations, this creates an inherent conflict of interest. Companies might be incentivized to design evaluations that their systems are likely to pass or to downplay concerning results. This is also sometimes called "safety washing" (akin to greenwashing). To address the challenges around evaluation independence and expertise, we need to significantly expand the ecosystem of independent evaluation organizations. The emergence of independent evaluation organizations like Model Evaluation for Trustworthy AI (METR) and Apollo Research represents an important step toward addressing this issue, but these organizations often face significant barriers including limited model access (Perlman, 2024), resource constraints, and difficulties matching the technical capabilities of major AI labs.
Overall, while the limitations we've discussed in this final section are significant, they aren't insurmountable. Progress in areas like mechanistic interpretability, formal verification methods, and evaluation protocols shows promise for addressing many current limitations. However, overcoming these challenges requires sustained effort and investment.

-
This behavior was observed in a controlled evaluation setting where models were explicitly informed through the context window about potential unlearning consequences for high performance. The researchers noted: "We have only shown that some models possess basic scheming capabilities if both the goal and the situational awareness are acquired in-context" rather than demonstrating this behavior emerging spontaneously in more realistic scenarios (Meinke et al., 2025). ↩