Bernardo May 12, 2026

The Model Speaks Fifteen Languages. It Sells in One.

15 min read

The model speaks fifteen languages. It sells in one.

This is not a rhetorical flourish. It is the finding of a peer-reviewed benchmark published in February 2026 by four Appen researchers — Madison Van Doren, Casey Ford, Jennifer Barajas, and Cory Holland — under the title “Be My Cheese?”: Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs. Seven state-of-the-art models. Fifteen target language-locales. Five native-speaker raters per language. Thirteen thousand one hundred and twenty-five segment-level annotations. The data is precise. The conclusion is austere.

Top performers achieved 2.10 out of 3 on a four-point ordinal scale of full-text translation quality. Two thirds of the maximum. The strongest commercial models on the market, translating a marketing email, produce text that native readers score as adequate at best, on a scale where 3 is the standard a published translation must meet.

Fluent. Not commercial. The distinction is the architecture of the entire piece.

What the Study Actually Did

The methodology is worth stating before the conclusions, because the methodology is the load-bearing element of any benchmark that claims to measure cultural competence.

The researchers fed five e-commerce marketing emails — adapted from authentic commercial campaigns containing puns, idioms, holiday references, brand voice, and culturally embedded concepts — to seven multilingual LLMs. The models tested were GPT-5, Claude Sonnet 3.7, Mistral Medium 3.1, DeepSeek V3.1, gpt-oss 120B, Meta’s Llama 4, and Cohere’s Aya Expanse 8B. The mix is deliberate: closed-weight and open-weight, frontier and accessible, American and European and Chinese.

Each model received the same prompt: “Translate the following email for use in [language] in [country/region].” The instruction is the instruction any EU SME would issue. No prompt engineering. No retrieval pipeline. No fine-tuning on the destination market. The raw task as a small business would actually perform it.

The fifteen target locales spanned typologies and continents: Afrikaans (ZA), Arabic (EG), Brazilian Portuguese (BR), Cantonese (HK), Czech (CZ), Dutch (NL), Hebrew (IL), Hindi (IN), Japanese (JP), Korean (KR), Mandarin (TW), Russian (KZ), Spanish (MX), Swahili (KE), and Urdu (PK). Each translation was then evaluated by five native speakers, resident in the relevant region, fluent in both English and the target language. Seventy-five raters in total. Each rater scored both the full translated email — on content fidelity, style fidelity, audience appropriateness, and overall quality — and predefined segments containing idioms, puns, holidays, and cultural concepts. The scale was 0 to 3. There was also an NA category. The NA category turned out to matter.

This is the most rigorous published evaluation of cultural localisation in machine translation to date. It is also the most damning.

The Top Tier

Mean overall full-text quality across all models and languages was 1.68 out of 3.

GPT-5 led at 2.10. Claude Sonnet 3.7 followed at 1.97. Mistral Medium 3.1 reached 1.84. These three formed what the authors describe as a “statistically indistinguishable top tier” — significantly better than the rest, statistically equivalent to each other. DeepSeek V3.1 came in at 1.72 and gpt-oss 120B at 1.60. Llama 4 scored 1.47. Aya Expanse 8B, the smallest of the seven and the only model in the set specifically designed for multilingual coverage, scored 1.09.

The ranking is striking but not the point. The point is the ceiling. The best multilingual LLM on the market, on a marketing email, scores seven-tenths of the way to a publishable translation. The worst scores barely a third.

These are not failure cases produced by exotic languages. The set includes Spanish, Portuguese, Dutch, Japanese — languages with abundant training data and decades of machine translation history. The scores are not the residue of low-resource neglect. They are the limit of the technology, measured at the top end.

A marketing email is not a difficult genre. It is a short, structured, commercially valuable form. If contemporary LLMs cannot localise a marketing email well enough for a native speaker to rate above 2.10 out of 3, the implication for everything else — customer support replies, product descriptions, regulatory disclosures, internal communications — is direct.

Where the Models Break

The aggregate score conceals the structural finding. The structural finding is in the segment-level results.

When the raters scored the four categories of culturally marked language separately, the pattern was sharp. Holiday references averaged 2.20 out of 3. Cultural concepts averaged 2.19. Idioms scored 1.65. Puns scored 1.45.

The four-tenths gap between holidays and idioms is large. The seven-tenths gap between holidays and puns is structural. Holidays and culturally embedded concepts are nouns. Idioms and puns are figures of speech. Models translate nouns. Models fail at figures.

The reason is architectural, not anecdotal. A holiday — Valentine’s Day, Singles Day, Diwali — has a target-language equivalent or a known cultural mapping. The model retrieves the mapping. The retrieval succeeds because the mapping is documented in the training data. A pun is the inverse. A pun depends on the sound, the form, or the cultural resonance of a specific word in a specific language. It does not have an equivalent. It has to be reconstructed in the target language using different material. Reconstruction requires invention. Models do not invent. Models retrieve.

The authors quantify the consequence with a specific metric: omission rate. Idioms were the category most frequently rated NA — meaning the model declined to translate them at all, leaving the English original embedded in the otherwise translated text. The Aya Expanse 8B exhibited the highest omission rates and the lowest quality scores when it did translate. Conservative behaviour did not protect the smaller model. It compounded its weakness.

When a model encounters a pun it cannot reconstruct, it has three options. Translate it literally, producing nonsense. Translate it loosely, producing a different joke or no joke at all. Or refuse and leave the English. All three options are visible to the native reader. All three signal that the text was generated, not written. All three reduce the probability that the reader buys what the email is selling.

The Pun That Names the Paper

The paper’s title is a clue. “Be My Cheese?” is the literal translation of a Valentine’s Day cheese-themed pun — “Will you brie mine?” — that appeared in one of the source emails. The pun depends entirely on the homophony between “brie” and “be” in English. In every other language, the homophony does not exist. The pun cannot survive translation. It must be reconstructed.

What the models produced was not reconstruction. It was literal rendering of the words “brie” and “mine,” producing text that referenced cheese but contained no joke, no rhythm, and no Valentine’s Day. The marketing function — emotional connection to a seasonal moment — collapsed into a sentence about dairy.

This is the structural finding of the paper, illustrated. The model translated the words. The model did not translate the function. The function was the entire reason the words existed.

The Default Is American

The authors do not phrase the next observation this way. The data does.

When a model trained on internet text encounters a register, a tone, or a relational stance it does not recognise, it defaults to the most common pattern in its training distribution. The most common pattern in the training distribution is American English commercial writing. The result is text that is grammatically Portuguese, Dutch, or Japanese, and pragmatically Californian.

The formality calibration is uniform. The tone is informal-bordering-on-familiar. The address form is the egalitarian one. The relational acknowledgment is brief. The call to action is direct. This works in American marketing because American marketing is the corpus the models were optimised on. It does not work in Munich, where commercial communication in B2C contexts uses “Sie” until a relationship is established. It does not work in Milan, where commercial warmth precedes the transaction by an investment of social ritual. It does not work in Malmö, where the consensus-seeking moderation of Swedish prose makes the assertive American close read as desperate.

The model speaks fifteen languages. The model communicates in one culture. The fifteen are the surface. The one is the architecture.

This is not a failure of the seven models tested. It is a structural feature of any model trained predominantly on web text. The web is not a neutral corpus. The web is a culture. The culture is American English with a global distribution network. Every multilingual model inherits the culture along with the languages.

The Performance Table

Model	Overall	Audience	Style	Content
GPT-5	2.10	2.38	2.23	2.23
Claude Sonnet 3.7	1.97	2.25	2.08	2.10
Mistral Medium 3.1	1.84	2.19	2.04	1.92
DeepSeek V3.1	1.72	2.05	1.98	1.77
gpt-oss 120B	1.60	1.94	1.83	1.72
Llama 4	1.47	1.81	1.72	1.59
Aya Expanse 8B	1.09	1.55	1.41	1.21

The columns deserve a closer reading. Across every model, the highest sub-score is “audience appropriateness.” The lowest is “content fidelity.” This is the inverse of the intuition. One would expect a translator to be most reliable at preserving content and least reliable at matching the audience. The data shows the opposite. The models produce text that sounds appropriate to the target audience but distorts the source. The fluency is performative. The accuracy is unstable.

A marketing email translated by a top-tier model in 2026 sounds right and says something subtly different from what the brand intended. This is the most expensive kind of failure. It is invisible to the deploying company, which evaluated the tool in English. It is visible to the customer, who notices that the text feels generated. The gap between feel and intent is where commercial conversion is lost.

The Cross-Cultural Test

Consider the same marketing email in three contexts.

In Brazil, the email opens with relational warmth before any commercial content. The reader expects acknowledgment before transaction. A model that opens with the offer signals foreignness. The reader continues to read, but the trust gradient has shifted.

In Germany, the email opens with the offer and uses “Sie.” Formality precedes warmth. A model that opens with “Hey, Marta!” — a default it inherits from American email templates — has committed a register transgression in the first three words. The reader does not consciously catalogue the error. The reader simply experiences the sender as a stranger overreaching.

In Japan, the email opens with seasonal acknowledgment, a phrase indicating awareness of the recipient’s likely circumstances, and only then introduces the commercial frame. The structure is non-negotiable for B2C communication that aims at long-term customer relationships. A model that skips the opening produces text that is technically correct and socially clumsy. The clumsiness costs the conversion.

Three cultures. Three different opening architectures. The model uses one — the American one — in all three. The text is fluent in three languages and culturally illegible in two of them.

This is what the Appen study measures, expressed at the level where the EU SME meets the consequence.

What the Numbers Mean for Milan, Munich, Malmö

An EU SME selling across the single market is the audience that this finding most directly concerns.

The arithmetic is simple. A model that scores 2.10 out of 3 on a marketing email produces text that requires human review before it can be sent. The required review is not proofreading. It is cultural editing. The Italian copy needs to be checked for warmth calibration. The German copy needs to be checked for register. The Swedish copy needs to be checked for the absence of consensus-seeking moderation. The Dutch copy needs to be checked for whether the direct close lands as confident or aggressive.

Each of these reviews requires a native speaker with brand voice fluency and cultural-pragmatic competence. The cost of these reviews is not factored into the per-token economics that made AI translation attractive in the first place. The vendor’s pricing model assumes the output is publishable. The Appen data shows it is not.

The EU SME has three options.

Accept the cultural distortion. Send the model output unedited and absorb the conversion penalty silently. This is the most common choice because the penalty is invisible — the customers who did not buy do not write back.

Hire native reviewers across markets. This restores quality at the cost of the operational simplicity that motivated the AI deployment. The economics shift. The investment may still pay off relative to monolingual translators, but only if the company measures the per-market conversion impact, which most do not.

Build cultural infrastructure into the prompt and the workflow. This is the path Bluewaves takes with every multilingual Gizmo. The cultural context is structured before the language is selected. The model is given the relational stance, the formality register, the directness calibration, and the temporal orientation appropriate to the market. The language is the last decision. The model is constrained — explicitly — to produce text that respects the architecture.

The third option does not eliminate the gap the Appen study measures. It compresses it. A constrained prompt, a culture-specific system message, and a per-market quality check produce output that scores closer to publishable than the raw 2.10. Closer is not enough for high-stakes communication. It is enough for most communication, most of the time, at a fraction of the cost of human translation.

The condition is that someone in the deployment loop knows what to constrain. The model does not know. The vendor does not know. The procurement department certainly does not know. Cultural-pragmatic competence is not a setting in any AI translation product on the market. It is a discipline the deploying company must bring.

What Cultural Competence Would Require

The Appen authors point toward the requirement implicitly. The remedy is in the gap their data exposes.

A culturally competent translation model would need to know — and apply — five things that no model currently knows.

The target audience’s cultural baseline. Not the language. The culture. Brazilian Portuguese and European Portuguese are the same language and two different commercial cultures. The model must distinguish them, not as a locale code but as a different architecture of trust.

The formality register appropriate to the channel and the relationship. A marketing email from an unknown brand in Germany requires “Sie.” The same email from a brand the recipient has purchased from before may shift to “du” if the brand voice has established that register. The model must read the relationship, not the prompt.

The directness calibration appropriate to the message and the culture. A Dutch reader expects directness. A Japanese reader expects indirection. A model that uses uniform directness produces hesitant Dutch and intrusive Japanese in the same generation cycle. Both are wrong. Both reduce conversion. Both pass token-level evaluation.

The temporal orientation of the offer. Limited-time offers landing in a monochronic culture activate urgency. Limited-time offers landing in a polychronic culture activate suspicion. The same call to action requires different framing in different cultures. The model must know which framing applies.

The cultural mapping of figurative language. Not the literal substitution. The functional equivalent. A Valentine’s Day pun in English needs to become a Valentine’s Day pun in Italian — or, if the form does not survive, a different rhetorical move that performs the same emotional function. The model must distinguish form from function. Current models do not.

These five capabilities are not language capabilities. They are cultural capabilities. The training data does not contain them — because they are rarely made explicit in text. No one writes “I am now using the formal register because I do not yet know this person.” The register is simply used. The model has to infer the rule from instances. The inference is weak when the patterns are implicit and culturally variable.

Cultural competence in AI models will require explicit cultural annotation, cultural instruction tuning, or retrieval pipelines that access cultural knowledge bases. These approaches exist in research. They do not exist in any of the seven models the Appen study tested.

The Principle

A model trained on internet text inherits the internet’s culture. The internet’s culture is American English with a global distribution network. Fifteen languages of output do not change the architecture. Fifteen languages of output expose the architecture.

Fluency is table stakes. Every major model achieves it. The Appen scores demonstrate that fluency is no longer the differentiator.

Cultural competence is the differentiator. The 2.10 ceiling is the measure of how far the best models are from that differentiator. The 0.4-point gap between holiday references and idioms is the shape of the failure. The American default that emerges in every output is the source of the gap.

For an EU SME, the implication is direct. The marketing email translated by GPT-5 will not sell as well as the same email written by a Milanese copywriter. The gap is not catastrophic. The gap is consistent. And the gap is the difference between a market entry that works and a market entry that quietly underperforms for years before anyone diagnoses the cause.

At Bluewaves, no multilingual Gizmo ships without an explicit cultural architecture: the formality register, the directness calibration, the relational stance, and the rhetorical functions that the model must preserve, named in the system prompt and tested per market. The model still produces the output. The architecture constrains what the output is allowed to be. The constraint is where cultural competence enters the system, because the model cannot supply it.

The seven models tested by Appen do not lack data. They lack culture. The text is fluent because the words are present. The text does not sell because the culture is absent.

Speaking is not selling. Fluency is not competence. Translation is not localisation.

The model speaks fifteen languages. It sells in one. Until the deploying company supplies what the model cannot, that ratio is the ceiling.

Written by

Bernardo

Cultural Translator

He ensures your Gizmo doesn’t just speak Spanish — it sounds Spanish. When a Nordic client’s team calls their Gizmo by a Finnish nickname, that’s his work showing.

← All notes