Bertrand December 23, 2025

The Model Card Nobody Reads

13 min read

Anthropic publishes a model card for every Claude release. OpenAI publishes a system card for every GPT release. Google DeepMind publishes technical reports for Gemini. Meta publishes model cards for Llama. Mistral publishes them for their models. These are the primary source documents — written by the people who built the models — that describe exactly what the model can do, what it cannot do, where it fails, and under what conditions its outputs should not be trusted.

Almost nobody reads them.

The marketing page gets millions of visits. The model card gets thousands. The blog post announcing the model gets shared across every AI newsletter and LinkedIn feed. The model card — the document that actually tells you whether this model is suitable for your use case — sits quietly on a documentation site, unread, uncited, unused.

This is a problem. Specifically, it is the kind of problem that costs companies money, produces bad deployments, and erodes trust in AI tools — all because the most important document shipped with every model is treated as a technical appendix instead of an operational manual.

What a Model Card Actually Contains

The term “model card” comes from a 2019 paper by Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. The paper proposed a standardised documentation framework for machine learning models — analogous to a nutritional label for food or a safety data sheet for chemicals.

The original framework specified: model details, intended use, factors (relevant demographic or contextual factors), metrics, evaluation data, training data, quantitative analyses, ethical considerations, and caveats and recommendations.

In practice, model cards from the major AI labs have evolved beyond this template, but the core purpose remains: honest documentation of a model’s capabilities, limitations, and appropriate use cases, written by the people who know the model best.

Anthropic’s Claude model cards, for example, contain:

Capability assessments with specific benchmarks. Not “Claude is good at reasoning” but “Claude achieves X% on the MMLU benchmark, Y% on HumanEval, Z% on MATH.” These numbers are comparable across models. They tell you, specifically, how the model performs on standardised tests of knowledge, coding ability, and mathematical reasoning.

Known limitations documented explicitly. The model card states where the model fails. Where it hallucinates. Where its outputs should not be trusted without human verification. This information is not buried in disclaimers — it is foregrounded as operational guidance.

Safety evaluations. How the model was tested for harmful outputs, bias, and misuse potential. What mitigations were applied. What residual risks remain. This is the most honest assessment of a model’s safety profile available anywhere — more honest than a marketing blog post, more specific than a journalist’s summary.

Intended use cases and misuse potential. What the model was designed for, what it was not designed for, and what uses the developers specifically advise against. For an SME evaluating whether to deploy this model for a specific task, this section is the single most valuable piece of guidance in existence.

OpenAI’s system cards provide equivalent information in a different format, with particular depth on their safety evaluation methodology — red-teaming results, automated evaluation pipelines, and the specific categories of risk they test for.

These documents are not marketing materials. They are technical disclosures. They are the closest thing the AI industry produces to honest self-assessment. And they are ignored.

Why Nobody Reads Them

Three reasons, all of them structural.

The documents are written for researchers, not operators. Model cards use the language of machine learning research: benchmark names, evaluation methodologies, statistical measures. A procurement director evaluating whether to deploy Claude for customer inquiry classification does not know what MMLU stands for, does not have a baseline for interpreting a HumanEval score, and does not know how to translate a safety evaluation into an operational risk assessment. The information is valuable. The translation layer is missing.

The marketing is easier to consume. A blog post announcing a new model is 1,500 words of accessible prose with clear claims: “faster,” “more accurate,” “better at coding.” The model card is 15,000 words of technical documentation with caveats, limitations, and conditional statements. The blog post confirms what you want to hear. The model card tells you what you need to hear. These are different audiences, and the marketing always wins the attention competition.

Nobody’s job includes reading model cards. In a 200-person company evaluating an AI deployment, no one is responsible for reading the model card. The CTO may have the technical background but lacks the time. The project manager has the time but lacks the technical background. The external consultant has a model recommendation ready before the model card has been downloaded. The model card falls into a responsibility gap — too technical for the business decision-maker, too operational for the research team, too detailed for the consultant’s timeline.

What a Model Card Tells You That Nothing Else Does

Let me demonstrate with a specific example. I will walk through three categories of information from model cards that directly affect whether an EU SME should deploy a specific model for a specific use case.

Category 1: Language Performance Variance

Model cards report multilingual performance benchmarks. These benchmarks reveal performance gaps across languages that marketing materials never mention.

A model that scores 89% on English-language question answering may score 72% on German and 58% on Portuguese. The marketing page says “supports 95+ languages.” The model card shows you the actual performance gradient — and for an EU SME operating across multiple markets, the difference between 89% and 58% is the difference between a useful tool and a liability.

When a Portuguese customer submits a query and the model’s comprehension accuracy is 31 percentage points lower than for an English query, the output quality degrades. The customer receives a less accurate response. If the response involves a recommendation, a classification, or a decision, the accuracy gap becomes a quality gap, a fairness gap, and potentially a legal gap under GDPR Article 22.

The model card tells you this. The blog post does not.

Category 2: Hallucination Rates by Domain

Model cards increasingly report hallucination rates — the frequency with which the model generates plausible-sounding but factually incorrect information. These rates vary dramatically by domain.

A model may hallucinate at 2% on general knowledge questions and 12% on domain-specific technical questions. For an SME deploying the model to answer customer queries about a specialised product line, the relevant hallucination rate is the domain-specific one, not the headline number.

More critically, model cards describe the types of hallucinations the model is prone to. Some models hallucinate specific details (dates, numbers, names) while getting the general direction right. Others hallucinate entire causal chains — producing explanations that sound authoritative and are completely fabricated. The type of hallucination determines the type of human oversight required.

A model that sometimes gets dates wrong needs a fact-checking layer. A model that fabricates explanations needs a domain expert reviewer. The operational response is different. The model card tells you which response is needed.

Category 3: Safety Evaluation Results

Model cards from responsible AI labs include red-teaming results — the outcomes of systematic attempts to make the model produce harmful, biased, or inappropriate outputs.

For an EU SME, the relevant safety considerations are specific: whether the model generates biased outputs that could affect employment decisions (relevant under GDPR Article 22 and EU AI Act Article 6), whether it produces discriminatory content in customer-facing applications, and whether it leaks training data that includes personal information.

The model card addresses these questions with specific test results. Not “we tested for bias” but “we tested for demographic bias across X categories using Y methodology, and observed Z pattern of residual bias in the following conditions.”

This information is essential for the conformity assessment required by the EU AI Act for high-risk AI systems. Article 9 requires a risk management system that includes identification and analysis of known and foreseeable risks. The model card is the primary source for known risks. Ignoring it is not just operationally foolish — it may be legally insufficient.

How to Read a Model Card

For an SME evaluating an AI deployment, here is the operational approach to reading a model card. This takes approximately two hours, which is less than the average steering committee meeting and produces more useful information.

Step 1: Read the intended use section first. Does the intended use match your use case? If the model card says the model is “designed for conversational assistance and content generation” and you want to use it for automated credit scoring, there is a mismatch. The mismatch does not mean the model cannot do it. It means the developers have not tested it for that purpose, which means the responsibility for testing falls on you.

Step 2: Check the multilingual benchmarks. Find the performance numbers for every language your deployment will use. If the performance gap between your primary language and secondary languages exceeds 10 percentage points, plan for a quality assurance layer in the lower-performing languages.

Step 3: Read the limitations section completely. This is the most valuable section. The developers are telling you where their model fails. They know, because they tested it. Ignoring this section is the AI equivalent of ignoring the structural engineer’s report before building on a plot of land. The information is there. The consequences of ignoring it are predictable.

Step 4: Review the safety evaluation. Identify the categories of harmful output that were tested and the residual risks that remain. Map these to your use case. If your deployment involves vulnerable populations (customers applying for financial products, job applicants, patients), the safety evaluation is not supplementary reading. It is a compliance requirement.

Step 5: Compare across models. Model cards are comparable. The same benchmarks, the same categories, the same evaluation methodologies appear across different labs’ model cards. Read three model cards for competing models and the performance differences — including the non-obvious ones buried in the appendices — become clear.

Category 4: Appropriate Use and Misuse Documentation

Model cards increasingly include explicit lists of intended use cases and documented misuse scenarios. These lists are not hypothetical. They are drawn from observed user behaviour during testing and deployment.

For an SME deploying a language model for customer-facing applications, the misuse documentation is operationally critical. The model card may specify: “This model is not designed for medical diagnosis, legal advice, or financial recommendations.” If your deployment uses the model to generate financial product recommendations, the model card has just told you — in writing, from the people who built the model — that your use case is outside the intended scope.

This does not mean the model cannot perform the task. It may perform it adequately. But the model card’s misuse documentation means the model developers have not tested or validated the model for that specific application. The safety evaluations do not cover your use case. The performance benchmarks are not calibrated for your domain. The liability, in the event of a harmful output, falls entirely on you — because the model card explicitly stated that your use was not intended.

For EU AI Act compliance, this documentation is directly relevant. Article 13 requires transparency about an AI system’s intended purpose. If the model card says the model is not intended for your use case, and you deploy it for that use case, you have created a compliance gap that no amount of post-hoc documentation can fill.

The model card told you. You chose not to read it. The consequence is foreseeable.

The Primary Source Principle

I read ECB reports, not what journalists say about ECB reports. I read Eurostat datasets, not what commentators say about Eurostat datasets. I read EU AI Act articles, not what consulting firms say about EU AI Act articles.

The model card is the primary source for what an AI model can and cannot do. Everything else — the blog post, the analyst report, the consultant’s recommendation, the LinkedIn hot take — is commentary. Commentary has its uses. But commentary introduces bias, compression, and agenda. The primary source does not.

The model card is not perfect. It is written by the lab that built the model, and labs have incentives to present their models favourably. But the model card is constrained by reproducibility — the benchmarks can be independently verified, the limitations can be independently tested, and the safety evaluations can be independently replicated. Marketing is unconstrained by any of these.

When I evaluate an AI model for a Bluewaves deployment, the model card is the first document I read and the last document I reference. Not the first because it is easy — because it is honest. Not the last because it is comprehensive — because the decisions we make about deployment are anchored in what the developers actually know about their model, not in what their marketing team wants us to believe.

The Operational Implication

For every AI deployment at your company, one person should read the model card. Completely. Not skimming. Not the executive summary. The full document.

That person should translate the model card’s technical assessments into three operational documents:

A capability assessment that states, in plain language, what the model can and cannot do for your specific use case, based on the model card’s benchmarks and limitations.

A risk register that maps the model card’s safety evaluations and known limitations to your specific deployment context, identifying which risks are relevant, which mitigations are needed, and which residual risks must be accepted.

A monitoring plan that specifies how you will verify, in production, that the model’s actual performance matches the model card’s documented performance — because models can degrade, use cases can drift, and the only check on the model card’s claims is your own observation.

These three documents take one person approximately four hours to produce. They cost nothing. They prevent the most common and most expensive AI deployment failures: deploying a model for a use case it was never designed for, deploying in a language where performance is materially lower, and deploying without a monitoring system that catches degradation before users do.

The model card is free. Reading it is free. Acting on it is free.

The cost of not reading it is the deployment that fails and the team that loses trust in AI tools because nobody read the document that would have predicted the failure.

Read the model card.

The primary source is available. The primary source is free. The primary source contains information that no secondary source — no blog post, no analyst report, no consultant’s recommendation — can replicate.

The model card is written by the people who built the model. They know things about its behaviour that nobody else knows. They documented those things — honestly, specifically, with benchmarks and caveats — in a document that is publicly available and systematically ignored.

The gap between the marketing page and the model card is the gap between what you want to hear and what you need to know. The model card is what you need to know.

Read it.

Written by

Bertrand

Creative Technologist

A serial entrepreneur with a PhD in AI and twenty-five years building systems across Europe. He creates code the way he surfs: reading patterns, finding flow, making the difficult look easy.

← All notes