The Curse of Multilinguality
Bernardo March 31, 2026

The Curse of Multilinguality

16 min read

Every language you add to a model makes every other language worse.

This is not a metaphor. It is a measured phenomenon. Google DeepMind calls it the curse of multilinguality — and their ATLAS study, presented at ICLR 2026, is the largest empirical confirmation to date. Seven hundred and seventy-four training runs. Over four hundred languages. Forty-eight evaluation languages. A cross-lingual transfer matrix spanning 1,444 language pairs.

The findings are precise. To support twice as many languages without losing performance, a model needs 1.18 times the parameters and 1.66 times the training data. The arithmetic is non-negotiable. Capacity is finite. Languages compete for it. And the competition is not fair.

The Capacity Problem

The curse of multilinguality was first named by Alexis Conneau and colleagues in 2020, in their work on XLM-R — the cross-lingual language model that demonstrated, for the first time, that multilingual pretraining could approach monolingual performance. The finding was paradoxical. The same paper that proved multilingual models could work also proved they came with a structural penalty.

The mechanism is straightforward. A language model has a fixed number of parameters. Each parameter is a slot for learned information — vocabulary, grammar, semantics, pragmatics, world knowledge. A monolingual English model dedicates all its parameters to English. A bilingual English-French model splits its capacity. A model trained on a hundred languages divides the same finite resource a hundred ways.

The division is not equal. English, with its vast training corpus, consumes more capacity. Low-resource languages receive less. But the constraint is absolute: every language added to the model reduces the per-language allocation. The model becomes broader and shallower simultaneously.

Conneau’s original insight was that this trade-off produces a characteristic curve. Adding the first few languages improves performance — especially for low-resource languages, which benefit from cross-lingual transfer. A Swahili model trained alongside English performs better than a Swahili model trained alone, because English syntax and semantic patterns transfer. But beyond a threshold, the returns reverse. Each additional language begins to degrade performance on every existing language. The capacity is saturated. The interference exceeds the transfer.

This is the curse. Not a bug. A structural property of shared-capacity architectures.

What ATLAS Measured

The ATLAS study — Adaptive Transfer Scaling Laws — did what no previous study had attempted at scale. It quantified the curse across 774 separate training experiments, ranging from 10 million to 8 billion parameters, and derived the first practical scaling laws for multilingual model design.

Three findings matter for anyone deploying multilingual AI.

The scaling tax is real but moderate. Doubling the number of languages requires increasing model size by a factor of 1.18 and total training data by a factor of 1.66. The per-language data decreases — each language receives 83 per cent of what it would get in a model supporting half as many languages. The positive transfer between related languages partially compensates, but cannot fully offset the capacity tax.

Transfer is asymmetric. The cross-lingual transfer matrix — 38 languages evaluated pairwise, producing 1,444 measured interactions — reveals that the relationship between languages is directional. English, French, and Spanish are what the researchers call “widely helpful” languages. Training on English data improves performance on dozens of other languages. Training on Yoruba data does not improve English. The transfer flows downhill — from high-resource to low-resource, from languages with large, diverse corpora to languages with small, homogeneous ones. The reverse flow is negligible.

Language families cluster. Languages that share scripts and grammatical structures transfer more effectively. Romance languages help each other. Germanic languages help each other. But the help is still asymmetric within families. French helps Portuguese more than Portuguese helps French. The mechanism is data quality: French has a larger, more diverse web corpus. The model learns patterns from the richer source and applies them to the poorer one.

The implication is architectural. A multilingual model is not a democratic assembly of languages. It is a hierarchy — with English at the top, major European languages in the middle, and low-resource languages receiving whatever capacity remains after the dominant languages have been served.

The Chang Confirmation

ATLAS did not arrive in isolation. A year earlier, Tyler Chang and colleagues published “When Is Multilinguality a Curse?” at EMNLP 2024 — a study spanning over 10,000 training runs across 250 languages. Their findings anticipated ATLAS on every significant dimension.

Low-resource languages benefit from multilingual pretraining — up to a point. The benefit is equivalent to increasing the low-resource language’s dataset by up to 33 per cent. The syntactic similarity of the added languages determines the magnitude of the transfer. Vocabulary overlap provides marginal additional benefit.

High-resource languages perform worse in every multilingual configuration. Without exception. English in a multilingual model is always weaker than English in a monolingual model of the same size. The degradation is consistent, measurable, and unremarkable — in the sense that no one in the research community is surprised by it. The surprise, such as it is, belongs to the practitioners who deploy these models without understanding the trade-off they have accepted.

The Chang study’s critical finding: as dataset sizes increase, the curse intensifies. Larger training corpora do not solve the capacity problem. They expose it. More data per language means more competition for the same parameters. The model’s performance curve bends downward earlier and more steeply.

The implication for production systems is direct. A model trained on twenty languages with abundant data will show larger per-language degradation than a model trained on twenty languages with limited data. Scale amplifies the curse.

The Benchmark Evidence

The theoretical findings map to observable performance gaps. MMLU-ProX — a multilingual benchmark published at EMNLP 2025, covering 29 languages with 11,829 identical questions per language — provides the most controlled measurement of what the curse looks like in practice.

The best-performing model achieved 70.3 per cent accuracy on English. The same model, on the same questions translated into Bengali, achieved 52.7 per cent. On Swahili, 40.1 per cent. The gap between English and the lowest-performing language: 30.2 percentage points. Nearly half the model’s English capability, lost.

European languages fare better than Bengali or Swahili — but they do not fare well. French, German, and Spanish cluster in a band approximately 5 to 10 percentage points below English. Portuguese, Dutch, and Swedish sit lower still. The gap is not catastrophic. It is consistent. And it is structural — the same gap appears across every model tested, regardless of architecture, training procedure, or proclaimed multilingual capability.

The gap means something specific. A model that achieves 70 per cent accuracy on English business questions achieves approximately 60 to 65 per cent on the same questions in German, and approximately 55 to 60 per cent in Portuguese. The Portuguese user is not receiving a slightly degraded service. The Portuguese user is receiving a measurably less capable tool — on the same task, with the same complexity, in a language the model claims to support.

What the EU SME Experiences

The research is abstract. The experience is not.

Consider a mid-sized company in the Netherlands — 200 employees, operations in six EU markets. The company deploys an AI-powered customer service tool across its markets: Dutch, German, French, Spanish, Portuguese, and English. The vendor’s marketing page lists all six languages as “supported.” The pricing is the same for all markets.

The tool works well in English. Responses are accurate, well-structured, and contextually appropriate. The English-speaking customers report high satisfaction.

In German, the tool is noticeably weaker. Formality registers are inconsistent — the tool occasionally uses du where Sie is expected. Technical vocabulary is sometimes approximate. The responses are usable but require more human review.

In Portuguese, the degradation is more pronounced. The tool generates grammatically correct text that sounds translated. Idiomatic expressions are off. The response structure follows English patterns — direct, task-oriented, with minimal relational preamble — in a market where customer service expectations include warmth and personal acknowledgment. The Portuguese-speaking customers do not file complaints about AI quality. They simply prefer the human agent. The adoption numbers tell the story.

In Dutch, the tool performs adequately but the small size of the Dutch-language training corpus means it occasionally hallucinates terminology or produces constructions that read as Belgian Dutch rather than Netherlands Dutch. The distinction matters. A Flemish formality register deployed in Amsterdam is a subtle but persistent signal of foreignness.

In Swedish, the output is functional but sparse. The model has less Swedish training data than French or German data. The responses are shorter, less nuanced, and occasionally default to English terminology where Swedish equivalents exist but are less common in the training corpus.

The company pays the same price for all six languages. The company receives six different levels of capability. The vendor’s marketing page does not disclose this variance. The ATLAS study explains why the variance exists. The vendor may not know the explanation. The variance exists regardless.

This is not a vendor failure. It is a structural property of the technology. The curse of multilinguality is built into the architecture. Every model that claims multilingual support delivers unequal support — with the inequality following a predictable pattern that favours English and penalises everything else.

The Asymmetry Problem

The transfer asymmetry in the ATLAS matrix deserves a closer examination, because it has implications that extend beyond model performance.

English, French, and Spanish are the most beneficial training languages for other languages. This is not because they are linguistically superior. It is because the web — the primary source of training data — contains vastly more high-quality text in these languages. English alone accounts for an estimated 55 to 60 per cent of web content. French and Spanish contribute substantially. German less so. Portuguese, Dutch, and Swedish are minor contributors.

The asymmetry creates a subsidy structure. High-resource languages subsidise low-resource languages through positive transfer. English training data improves Swahili performance. Swahili training data does not improve English performance. The subsidy flows in one direction.

For European languages, the subsidy dynamics are more nuanced. French subsidises Portuguese — both are Romance languages sharing syntactic structures and a significant portion of their vocabulary. But Portuguese does not subsidise French to the same degree. The relationship is asymmetric because the training corpora are asymmetric. More French text means more patterns for the model to learn. The model transfers those patterns to Portuguese. The reverse transfer is weaker because there are fewer Portuguese patterns to transfer.

The practical consequence: in a multilingual model, Portuguese quality is partially dependent on French data quality. Dutch quality is partially dependent on German and English data quality. Swedish quality is partially dependent on Danish, Norwegian, and English data quality. Each smaller language is downstream of its larger relatives.

The dependency is invisible to the end user. The Portuguese customer interacting with a chatbot does not know that the chatbot’s Portuguese capability is partially a function of how much French data was in the training set. The dependency is invisible to the vendor, too, unless the vendor has read the ATLAS paper. Most have not.

The Disclosure Problem

Eurostat reported in December 2025 that 20 per cent of EU enterprises with ten or more employees use artificial intelligence technologies. The adoption rate has grown 6.5 percentage points in a single year. Among large enterprises, adoption exceeds 40 per cent. Among small enterprises — the core of the EU economy — adoption sits at approximately 11 per cent.

The most common AI use is analysing written language. The second fastest-growing use is generating written and spoken language. These are precisely the applications where the curse of multilinguality operates most directly.

An EU SME deploying an AI writing tool across multiple markets is deploying a tool with built-in linguistic inequality. The inequality is a structural property of the model. It is not disclosed in marketing materials. It is not quantified in vendor documentation. It is not addressed in service-level agreements.

The EU AI Act — specifically Article 10 — requires that high-risk AI systems are trained on data that is “relevant and sufficiently representative” in view of the intended purpose. The legislation does not define what “representative” means for multilingual deployment. It does not specify a minimum per-language performance threshold. It does not require vendors to disclose the performance differential between supported languages.

The gap between the regulatory requirement and the technical reality is the disclosure problem. A model that claims to support Portuguese but delivers measurably inferior Portuguese output compared to its English output is making a claim that is technically true and practically misleading. The Portuguese is supported. The Portuguese is also structurally worse.

No one discloses this. Not the model builders, who publish aggregate multilingual benchmarks. Not the vendors, who list supported languages without performance qualifications. Not the procurement departments, who evaluate the tool in English and deploy it in six languages.

The curse of multilinguality is an open secret in the research community. It is an unknown fact in the business community. The ATLAS study, with its 774 training runs and its 1,444 language pairs, has quantified what researchers have known for years. The quantification has not reached the people who need it.

The Monolingual Alternative

The ATLAS study also quantifies when monolingual models outperform multilingual ones — and the threshold is informative.

For a language with sufficient training data, a monolingual model of the same size always outperforms a multilingual model. The breakeven point depends on the language’s data availability. For English, a monolingual model is always better. For French and German, a monolingual model is better above a moderate data threshold. For low-resource languages with limited data, the multilingual model remains superior — the cross-lingual transfer outweighs the capacity tax.

The practical implication for an EU SME: if your primary market is German-speaking, a monolingual German model will outperform the German capability of a multilingual model. If you operate across six EU markets, you face a choice. Deploy one multilingual model and accept the per-language degradation. Or deploy six monolingual models and accept the infrastructure cost.

The first option is cheaper. The second option is better. Most companies choose the first option without knowing they have made a trade-off. The marketing page says “supports 95 languages.” The marketing page does not say “supports English at 100 per cent capability and Portuguese at 82 per cent capability.”

The choice is not binary. Fine-tuning offers a middle path — a multilingual base model fine-tuned on language-specific data can recover some of the lost performance. The ATLAS study finds that fine-tuning is more compute-efficient than pretraining from scratch at lower token budgets, with pretraining becoming advantageous only when data and compute exceed a language-dependent threshold.

For most EU SMEs, fine-tuning is the realistic path. But fine-tuning requires language-specific data, language-specific evaluation, and language-specific quality standards — none of which are included in a standard multilingual AI deployment.

The Democratic Illusion

The marketing language of multilingual AI is democratic. “Supports 95 languages.” The implication: all languages are supported equally. The reality: all languages are supported unequally, with the inequality following the exact contours of global linguistic power.

English, the language of the internet, of academic publishing, of technology documentation, receives the most training data and delivers the best performance. French, Spanish, and German — the other languages of the web — follow. Portuguese, Dutch, Swedish, and the rest of the EU’s 24 official languages receive progressively less.

The pattern is not arbitrary. It reproduces the existing hierarchy of linguistic power in digital infrastructure. Languages that are well-represented on the web are well-served by AI. Languages that are poorly represented on the web are poorly served by AI. The model does not create the inequality. The model inherits it — and propagates it to every application built on top of it.

For the EU — an institution built on the principle of linguistic equality across its member states — the curse of multilinguality is not merely a technical problem. It is a structural contradiction. The EU mandates that every citizen can interact with EU institutions in their official language. The AI tools that EU institutions and businesses deploy cannot deliver on that mandate equally. The tools deliver English-quality output in English, and degraded output in everything else.

The ATLAS study makes this measurable. The transfer matrix shows, with quantitative precision, that a model trained on all EU official languages will deliver unequal quality across those languages. The inequality is not a failure of the model. It is a property of the architecture — and of the data ecosystem that feeds it.

What This Means for the Builder

The curse of multilinguality is not a problem that individual companies can solve. The architecture of shared-capacity models produces unequal per-language performance. This is physics, not policy.

What individual companies can do is stop pretending the inequality does not exist.

Measure per-language. Do not evaluate your AI tool in English and assume equivalent performance in Portuguese. Test each language independently. Measure accuracy, fluency, register appropriateness, and task completion in each language you claim to support. The MMLU-ProX benchmark methodology offers a template: identical tasks across languages, with per-language scoring.

Disclose per-language. If your tool delivers 70 per cent accuracy in English and 58 per cent in Portuguese, say so. The disclosure is uncomfortable. The alternative is a service-level agreement that promises something the technology cannot deliver.

Invest per-language. Fine-tuning on language-specific data is the most accessible mitigation. It does not eliminate the curse. It reduces its impact. The investment must be proportional to the performance gap — more fine-tuning for Portuguese than for French, because the gap is larger.

Design for the weakest language. If your tool operates across six EU markets, design the user experience for the language where the model performs worst. If the Portuguese output requires human review, build human review into the workflow for all markets — not as a correction mechanism for “lesser” languages, but as a quality assurance standard that respects all users equally.

The curse of multilinguality will persist as long as models share capacity across languages. Larger models reduce the curse but do not eliminate it. Better data helps but does not solve it. The problem is structural. The response must be structural too — not a single multilingual deployment, but a language-aware infrastructure that acknowledges, measures, and compensates for the inequality that the architecture produces.

Adding a language to a model costs every other language something. The cost is real. The cost is unequal. And until the people deploying these models understand that, every “multilingual” AI tool will be a promise kept in English and broken, by degrees, in everything else.

Written by
Bernardo
Cultural Translator

He ensures your Gizmo doesn’t just speak Spanish — it sounds Spanish. When a Nordic client’s team calls their Gizmo by a Finnish nickname, that’s his work showing.

← All notes