The Alignment Problem Is Human
Érica February 10, 2026

The Alignment Problem Is Human

13 min read

Brian Christian’s The Alignment Problem traces the history of a deceptively simple question: how do you make a machine do what you want? The book follows the question from early reinforcement learning through modern large language models, documenting the increasingly sophisticated — and increasingly frustrated — attempts to specify human values in terms a machine can follow.

The conventional reading of the alignment problem is technical: the machine doesn’t understand what we want. The field is racing to fix this. Constitutional AI, RLHF, automated red-teaming, interpretability research — all aimed at making the machine better at understanding and following human intention.

I want to offer a different reading. The harder problem is not that the machine fails to understand our values. The harder problem is that we fail to understand our own values clearly enough to specify them.

The alignment problem is human before it is technical. And the evidence for this is not in a research lab. It is in every office where an AI tool has been deployed and the team cannot agree on what “good” looks like.

The Specification Problem

When Bluewaves deploys an AI tool for a client — say, a customer service classification system — the first step is specification: what do you want the tool to do? The answer seems obvious. “Classify incoming support tickets by urgency and route them to the right team.”

Then the questions start.

What counts as urgent? A customer threatening to leave? A customer reporting a safety issue? A customer asking for a refund over €500? All three? The team discusses. They discover that different team members have different implicit definitions of urgency. The customer service lead defines urgency by customer lifetime value. The operations manager defines urgency by SLA breach risk. The legal team defines urgency by liability exposure.

Three definitions. Three legitimate perspectives. No shared specification.

This is not a technology problem. The AI model can classify by any definition of urgency it is given. The problem is that the organisation has operated with three implicit, overlapping, partially contradictory definitions — and it worked because human agents unconsciously triangulated between them, using judgment that integrated all three perspectives without formalising any of them.

The machine cannot triangulate implicitly. It needs an explicit specification. The act of writing the specification forces the organisation to confront the ambiguity it has lived with comfortably for years.

Brian Christian describes this as the central insight of the alignment problem: “The difficulty is not just in getting the AI to do what we want; it’s in knowing what we want.” The AI deployment becomes an alignment audit — not of the machine, but of the organisation.

The Revealed Preference Gap

Economists distinguish between stated preferences (what people say they want) and revealed preferences (what their behaviour shows they actually want). The gap between the two is the subject of entire research programmes in behavioural economics.

AI deployment surfaces this gap with uncomfortable clarity.

A team says it wants “consistent customer service quality.” The AI tool, trained on the team’s historical responses, reveals that “consistent” means different things to different agents. Agent A writes detailed, empathetic responses that average 340 words. Agent B writes direct, efficient responses that average 80 words. Agent C escalates 40% of tickets that Agents A and B would handle directly. The team’s stated preference is consistency. Their revealed practice is radical inconsistency — held together by the fact that customers rarely compare the response styles they receive.

The AI tool, asked to produce “consistent” responses, must choose: consistent like Agent A, or consistent like Agent B, or a hybrid that satisfies neither? The specification requires a decision that the team has never made because the ambiguity was invisible until the machine required resolution.

This is the human alignment problem. The machine holds up a mirror. The organisation does not always like what it sees.

The Values Hierarchy Problem

Christian’s book documents the challenge of value alignment at the model level — how do you encode “be helpful but not harmful” in a way that handles edge cases? The workplace version of the same problem is the values hierarchy: when two legitimate values conflict, which one wins?

Every organisation has this hierarchy. Most organisations have never articulated it.

A financial services company deploys an AI tool for loan application screening. The stated values: fairness, efficiency, and risk management. These values coexist comfortably in the abstract. In practice, they conflict regularly:

Fairness says: evaluate each application on its individual merits. Efficiency says: use statistical patterns to fast-track obvious approvals and rejections. Risk management says: flag any application with characteristics associated with higher default rates.

The statistical patterns that enable efficiency are built from historical data that reflects historical biases. The characteristics associated with higher default rates correlate with demographic factors that fairness requires you to ignore. The three values cannot all be maximised simultaneously. The organisation must choose — explicitly — which value takes priority in which context.

Before the AI tool, the human loan officer managed this conflict intuitively, case by case, with implicit judgment that was never formalised. The decisions were defensible individually (each officer could explain their reasoning) but inconsistent collectively (different officers resolved the same conflict differently).

The AI tool requires a hierarchy. Not “these values are all important” — that is a statement, not a hierarchy. “When fairness and efficiency conflict, fairness takes precedence. When fairness and risk management conflict, here is the specific threshold where risk management overrides.” These are the decisions the alignment problem forces — not about the machine, but about the organisation.

The Proxy Problem

In The Alignment Problem, Christian describes Goodhart’s Law — “When a measure becomes a target, it ceases to be a good measure” — as the central failure mode of aligned systems. You want the AI to maximise customer satisfaction. You measure customer satisfaction with a survey score. The AI optimises for survey score. Survey scores go up. Customer satisfaction may or may not follow — because the survey was a proxy, not the thing itself.

This is not a technical failure. It is a human failure of specification. We chose the proxy. The machine optimised for it. The outcome we didn’t want was predictable from the specification we did want.

In workplace AI deployments, proxy failures are pervasive:

The ticket closure proxy. An AI system is measured on “tickets resolved per day.” The system learns to resolve tickets quickly. Resolution quality drops because speed was the proxy, not quality. But nobody specified what “quality” means in operational terms — so the machine optimised for the proxy that was specified.

The engagement proxy. An AI content tool is measured on “user engagement.” The tool learns to produce content that generates clicks, comments, and shares. The content becomes increasingly provocative because engagement was the proxy, and provocation drives engagement. But the organisation wanted “meaningful engagement,” which is harder to specify and harder to measure.

The compliance proxy. An AI risk assessment tool is measured on “compliance with guidelines.” The tool learns to produce assessments that satisfy the checklist. The assessments become formulaic because compliance was the proxy. But the organisation wanted “genuine risk assessment,” which requires judgment that a checklist cannot capture.

In each case, the human chose the proxy. The machine followed the proxy faithfully. The outcome disappointed the human — not because the machine was misaligned, but because the human’s specification was misaligned with their actual intention.

The alignment problem is a mirror. The machine does what you specified. If you don’t like the result, the problem is in the specification.

The Articulation Burden

Here is the part that I find most compelling in Christian’s framework, and the part that connects most directly to my own work in organisational psychology.

The alignment problem creates an articulation burden — the requirement to make explicit what has always been implicit. This burden falls on the humans, not the machine. The machine does not care whether you can articulate your values. It will follow whatever specification it is given. The consequence of a poor specification falls entirely on the specifier.

For organisations, the articulation burden is significant because most organisational knowledge is tacit. Michael Polanyi’s distinction between tacit knowledge (what we know but cannot express) and explicit knowledge (what we can state and codify) applies directly. The experienced customer service agent who “just knows” how to handle a difficult customer is operating on tacit knowledge — pattern recognition built from thousands of interactions, refined by feedback, and stored in a form that resists articulation.

When the AI tool needs to replicate this judgment, the tacit knowledge must become explicit. “Handle difficult customers well” must become “When a customer expresses frustration, acknowledge the emotion before addressing the problem. When a customer threatens to leave, check their account history and, if they have been a customer for more than two years, offer retention discount tier B.” The specificity required is exhausting. The original agent never thought in these terms. They “just knew.”

The articulation burden is the hidden cost of AI deployment. Not the licence fee. Not the compute cost. Not the integration engineering. The cognitive and organisational effort of making explicit what has always been implicit — and discovering, in the process, that the implicit knowledge was less consistent, less coherent, and less aligned than anyone assumed.

The Tuesday Morning Test

I keep coming back to a test I apply to every AI alignment question I encounter: the Tuesday morning test. Forget the philosophy. Forget the research papers. Forget the abstract values discussion. It is Tuesday morning. A specific person is sitting at a specific desk with a specific task. The AI tool is open. The person types a query. The tool responds.

Is the response what the person needed?

The answer depends on whether the tool’s specification captured what the person actually needs — which depends on whether the organisation articulated what it actually values — which depends on whether the organisation knows what it actually values.

On Tuesday morning, the alignment problem is not about the machine. It is about the procurement officer who needs the tool to understand that “urgent” means “the customer mentioned our competitor” — a definition that exists in no specification, no training data, and no policy document, but is the operational reality of that team’s definition of urgency.

The machine cannot know this unless a human articulates it. And the human has never articulated it because, until the machine arrived, no one asked.

The Organisational Alignment Process

What does it look like to do this work? To actually align the organisation before trying to align the tool?

Phase 1: Surface the implicit. Bring together the people who will use the tool and ask them to define, independently, what “good” looks like for the tool’s output. Don’t discuss it first — independent articulation prevents conformity bias. Compare the definitions. The divergence is the data. Where definitions disagree is where the alignment work begins.

Phase 2: Name the conflicts. Where the implicit definitions contradict each other, name the contradiction. Not “we have different perspectives” (that is a euphemism for conflict avoidance). Name the specific conflict: “You define urgency by customer value. You define urgency by SLA risk. These produce different classifications for the same ticket. Which definition does the tool use?”

Phase 3: Decide the hierarchy. For each conflict, make a decision. Not a consensus (consensus is often a refusal to decide). A decision. “For classification purposes, urgency is defined by SLA breach risk. Customer value is a secondary factor surfaced to the agent but not used for routing.” The decision may be wrong. It is still more useful than ambiguity, because a wrong decision can be identified and corrected. Ambiguity cannot be corrected — it persists until someone confronts it.

Phase 4: Specify the proxies. For each value the tool is asked to optimise, define the proxy and acknowledge its limitations. “We measure quality by customer satisfaction score. We know this proxy does not capture long-term relationship health. We will supplement it with a quarterly review of customer retention rates among tickets handled by the tool.” The proxy is a compromise. Name it as one.

Phase 5: Iterate. The first specification will be wrong. Not catastrophically wrong — practically wrong. The tool will produce outputs that are technically aligned with the specification but misaligned with the intention. Each misalignment is a lesson in specification clarity. Use it to refine.

The Ongoing Alignment

Alignment is not a one-time activity. It is ongoing — because the organisation’s values, priorities, and operational context change over time.

The specification that was correct in January may be miscalibrated by June. The customer base changed. The regulatory environment shifted. The team composition evolved. The definition of “urgent” that worked six months ago no longer captures the current operational reality.

This ongoing misalignment is a feature of organisational life, not a failure of specification. Organisations are dynamic systems. Their values and priorities are in continuous flux. The specification — which is static — drifts away from the reality — which is dynamic.

In traditional operations, this drift is absorbed by human judgment. The customer service agent who has been on the team for three years implicitly adjusts their definition of “urgent” as the context changes. They don’t rewrite the policy. They adjust their practice. The adjustment is invisible, gradual, and effective.

The AI tool does not adjust implicitly. It follows the specification. If the specification drifts away from the reality, the tool’s outputs drift with it — still aligned with the specification, but misaligned with the intention.

The operational response: scheduled alignment reviews. Every quarter, the team that uses the AI tool should revisit the specification: are the definitions still accurate? Have the priorities changed? Are there new edge cases the specification doesn’t cover? The review is short — an hour. The cost of not conducting it is the gradual accumulation of misalignment, producing outputs that are technically correct and operationally wrong.

This is the maintenance cost of alignment. Not technical maintenance. Organisational maintenance. The work of keeping the specification current with the organisation’s evolving understanding of its own values.

The Integration

Brian Christian wrote about the alignment problem as a technical challenge. It is. But it is also a human challenge — and the human challenge precedes and subsumes the technical one.

You cannot align a machine with values you haven’t articulated. You cannot articulate values you haven’t examined. You cannot examine values in an environment where examination is unsafe — which brings us back to psychological safety, to the incentive structures that reward stated values over practised values, to the gap between what organisations say and what they do.

The alignment problem is not a problem to be solved. It is a condition to be managed. The gap between intention and specification is permanent. The best you can do is narrow it — through articulation, through conflict resolution, through iteration, and through the humility to recognise that the machine’s most common failure mode is not misunderstanding your values but understanding them exactly as you specified them.

The machine is aligned. The question is whether you are.

Written by
Érica
Organizational Psychologist

She knows why people resist tools — and how to design tools they’ll love. When Érica speaks, companies change direction. Not from persuasion. From understanding.

← All notes