How Does AI Judge?

Introduction to AI Values

AI models like Anthropic Claude are increasingly asked not just for factual recall, but for guidance involving complex human values. Whether it’s parenting advice, workplace conflict resolution, or help drafting an apology, the AI’s response inherently reflects a set of underlying principles. But how can we truly understand which values an AI expresses when interacting with millions of users?

The Challenge of Understanding AI Values

The core challenge lies in the nature of modern AI. These aren’t simple programs following rigid rules; their decision-making processes are often opaque. Anthropic says it explicitly aims to instil certain principles in Claude, striving to make it “helpful, honest, and harmless.” This is achieved through techniques like Constitutional AI and character training, where preferred behaviours are defined and reinforced. However, the company acknowledges the uncertainty. “As with any aspect of AI training, we can’t be certain that the model will stick to our preferred values,” the research states.

Analysing Anthropic Claude to Observe AI Values at Scale

To answer these questions, Anthropic developed a sophisticated system that analyses anonymised user conversations. This system removes personally identifiable information before using language models to summarise interactions and extract the values being expressed by Claude. The process allows researchers to build a high-level taxonomy of these values without compromising user privacy. The study analysed a substantial dataset: 700,000 anonymised conversations from Claude.ai Free and Pro users over one week in February 2025, predominantly involving the Claude 3.5 Sonnet model.

Hierarchical Structure of Values

The analysis revealed a hierarchical structure of values expressed by Claude. Five high-level categories emerged, ordered by prevalence:

Practical values: Emphasising efficiency, usefulness, and goal achievement.
Epistemic values: Relating to knowledge, truth, accuracy, and intellectual honesty.
Social values: Concerning interpersonal interactions, community, fairness, and collaboration.
Protective values: Focusing on safety, security, well-being, and harm avoidance.
Personal values: Centred on individual growth, autonomy, authenticity, and self-reflection.

Nuance, Context, and Cautionary Signs

However, the picture isn’t uniformly positive. The analysis identified rare instances where Claude expressed values starkly opposed to its training, such as “dominance” and “amorality.” Anthropic suggests a likely cause: “The most likely explanation is that the conversations that were included in these clusters were from jailbreaks, where users have used special techniques to bypass the usual guardrails that govern the model’s behavior.” Far from being solely a concern, this finding highlights a potential benefit: the value-observation method could serve as an early warning system for detecting attempts to misuse the AI.

Interaction with User-Expressed Values

Claude’s interaction with user-expressed values proved multifaceted:

Mirroring/strong support (28.2%): Claude often reflects or strongly endorses the values presented by the user.
Reframing (6.6%): In some cases, especially when providing psychological or interpersonal advice, Claude acknowledges the user’s values but introduces alternative perspectives.
Strong resistance (3.0%): Occasionally, Claude actively resists user values. This typically occurs when users request unethical content or express harmful viewpoints.

Limitations and Future Directions

Anthropic is candid about the method’s limitations. Defining and categorising “values” is inherently complex and potentially subjective. Using Claude itself to power the categorisation might introduce bias towards its own operational principles. This method is designed for monitoring AI behaviour post-deployment, requiring substantial real-world data and cannot replace pre-deployment evaluations.

Conclusion

The research concludes that understanding the values AI models express is fundamental to the goal of AI alignment. “AI models will inevitably have to make value judgments,” the paper states. “If we want those judgments to be congruent with our own values […] then we need to have ways of testing which values a model expresses in the real world.” This work provides a powerful, data-driven approach to achieving that understanding.

FAQs

Q: What is the main challenge in understanding AI values?
A: The main challenge lies in the opaque decision-making processes of modern AI models.
Q: How does Anthropic aim to instil values in Claude?
A: Through techniques like Constitutional AI and character training, where preferred behaviours are defined and reinforced.
Q: What is the purpose of the value-observation method developed by Anthropic?
A: To observe and categorise the values Claude exhibits “in the wild” without compromising user privacy.
Q: What are the high-level categories of values expressed by Claude?
A: Practical, Epistemic, Social, Protective, and Personal values.
Q: Can the value-observation method serve as an early warning system for detecting attempts to misuse the AI?
A: Yes, it can identify rare instances where Claude expresses values opposed to its training, potentially indicating jailbreaks or misuse.