We’ll never reach perfect machine translations, and that’s OK
Ever wondered how accurate AI-powered translation really is? This article dives into a paper I wrote, and redacted some sensitive info from, while I led the launches for Google’s global rollouts of machine translations for various products and markets. The project delivered millions of dollars in cost savings - but not without first having to overcome stakeholder resistance, customer service agent attitudes towards AI, and end user misconceptions about machine translations. The artifact I’ve included goes over the NLP technology behind tools like Google Translate and Microsoft Translator, exploring their strengths, limitations, and what to consider for real-world use cases like customer support.
Background: Live machine translations are a form of generative AI, as the target language must be generated by a decoder from the embeddings of the source language. Companies like Github and Google use real-time machine translations for live support channels to help agents support more users in global markets. As part of the product incubation PM team, I led the global chat launches and rollouts to productionize and scale the ML models.
What this artifact was: The artifact that this is based on provided a deeper dive written specifically for non-technical readers to help business and operations stakeholders understand machine translation quality for production use cases.
Why this artifact mattered: Having this deeper dive was what won over the most risk-averse stakeholders when quality for certain languages was frankly unknown at the time. By demonstrating I too shared their concerns, the artifact helped manage stakeholder expectations (aka alleviate their fears) about the unpredictability inherent in generative AI in production. Analyzing why machine translation models still make errors also led me to realize that in order to productionize this particular GenAI-powered product, we would need not only cutting-edge models, but also certain AI guardrails which my engineering team ended up building, human-in-the-loop processes, and continuous improvement post-launch - all standard today, but not as prevalent back then when I led this launch!
How you can use this: You can use this to supplement PRDs and/or save precious time in meetings to answer repetitive questions. This artifact goes hand-in-hand with a concrete risk mitigation or contingency plan. The contingency plan should outline specific steps to address potential quality issues at launch, such as proactive monitoring, real-time human intervention protocols, and clear communication channels for user feedback. This will further reinforce stakeholder confidence and ensure a smooth product rollout.
Understanding Translation Quality
The following is a brief overview and analysis of existing information about translation quality of the NLP technology behind systems like Microsoft Translator and Google Translate. It is meant to be informative to stakeholders on what to expect for translation quality when using ML models for production use cases such as live 1:1 chat support.
Breaking Down Translation Errors:
Types of Bias in Machine Translation
Drivers of Machine Translation Quality in Production: Channel, Human-Error, Languages
Human-caused translation errors
Reasons for Shortcomings in Specific Languages
Why Existing Metrics Don’t Work for Enterprise Use Cases
Low-resource language translation quality with multilingual models
At a Glance
How far are we from perfect translations by machines?
Machine translations do not aim for “perfect” translations, as “perfect” translation consists of always understanding cultural nuances, multi-phrase expressions like idioms are unique to different cultures.
On Side-by-Side evaluations of human vs machine translation, human translators do not score a “perfect” score of 6/6 either.
Rather than striving for “perfect” translation as humans have not yet achieved yet either, machine translations aim to provide translations that are as comprehensible as possible, and preserve the original meaning as much as possible, with ideally correct spelling, punctuation, and grammar.
What risks and severity does launching live machine translations (i.e. customer support channels) pose for our company?
Errors vary in severity and frequency, generally in the following categories: Punctuation, Grammar, Spelling, Cultural Taboo, Professionalism.
Granular translation quality metrics will be assessed by bilingual experts who review audited messages from live traffic.
Evaluations show messages that contain the above errors still generally pass the Comprehensibility threshold for most languages. For all languages, we have outlined a rigorous testing plan & HITL process <insert links here> for ensuring quality at launch.
Why is translation better for some languages than others and what are we doing about it?
In short, availability of web text in the target language varies.
English text on the web, which goes into training machine translation models, exists in the Petrabytes. In contrast, however, low-resource languages tend to have orders of magnitude less data available.
Achieving consistent language quality across languages in 1 underlying model is an active area of research today. In the meantime, we have implemented AI guardrails <insert links here> to ensure quality at launch.
Deep Dive
Breaking Down Translation Errors:
There are 5 different types of sentence-level translation errors. Note that BLEU, a long-standing yet outdated standard machine translation quality metric, does not identify which of the following errors are made:
Named entity errors
Numerical errors
Meaning errors
Insertion of content
Content missing
Expanding these further, common translation errors include:
Vocabulary / Terminology
Literal rendition of common idioms
Formal / Informal style
Too long sentences
Single word errors, errors of relation, structural/informational errors
Incorrect verb forms, tense
Consistent translation of a word in one manner in spite of context
Grammar and syntax errors
Punctuation errors
Omissions / additions
Compound words translated as individual words
Machine neologisms
Human analysis of the [redacted] translation model used for webpage translation quality <link redacted> showed the breakdown of errors of a machine translation system that human evaluators found:
Lexical Choice/Multi-word expression error (X%)
Lexical Choice/Other wrong word (Y%)
Lexical Choice/Named entity error (Z%)
Lexical Choice/Archaic word choice (T%)
Reordering Error (W%)
How does one interpret this for product launch readiness? X% of the errors made by the machine were on multi-word expressions – often these errors are caused by idioms that exist only in the source language. The question now is how risky are these errors in a customer support conversation? How often do agents use idioms, such as let’s get the ball rolling, with customers perhaps wanting a refund?
To assess how grave these likely and expected mistakes might become, one could perform an analysis on existing chat transcripts to find the frequency of idioms and multi-word expressions. We understand this may differ per product, per language, per agent. Still, conventional wisdom and a quick glance through case transcript logs for chat seem to indicate the rate of occurrence of idioms and difficult-to-translate words from English (what the human agent will say) is quite low.
Types of Bias in Machine Translation
Gender Bias
The field of machine translation has made significant progress in gender-related translation issues, but this is still an active area of research. While we don’t anticipate this being a severe problem in Chat conversations between users and agents, it’s important to highlight that MT trained on a large text corpus scraped off the web has bias(es), as we’ll go over in the next section.
Informality Bias
Often, in languages besides English, the formal “you” is preferred for business communications. It is likely that, if using <Model Names>, the translation will switch from the formal to informal “you” within the same conversation. This provides drawbacks for our company when trying to uphold brand guidelines and professionalism in agent communications, which varies in significance in certain regions and markets than others.
Not only do we face the challenge of accurately translating a short phrase that may be missing context, we also must be aware of the shortcomings of <Model Name> when used in business communications. Other times, short phrases used in a business context can mean entirely different things than in a consumer setting, as in the example below:
[Image Redacted]
Above, “Get a Quote” was translated to “Speak a proverb” rather than the intended commercial meaning
[Section shortened]
Drivers of Machine Translation Quality in Production: Channel, Human-Error, Languages
How Channel Impacts Quality
Translation quality tends to work better when given more context clues and longer inputs. Email exchanges in support, typically comprising longer form text, tend to have more context clues for machine translation to have higher quality.
We expect email translation quality to trend more accurately than chat translation quality using <Model Name> in the backend.
Human-caused translation errors
Finally, another aspect that may further decrease translation quality, or its comprehensibility to the agent or user, is human-caused errors or low-quality source messages. These include spelling errors and typos that form valid words in, say, the English language but are not what the user or agent meant to say. For example, in the following image, the user types “shows and error” when they meant to say “shows an error”.
[Image Redacted]
A sample chat transcript from an example support ticket
A translated version of this sentence in Spanish contains “y” for and, whereas the correct “meaning” should be “un”. Technically this translation is accurate in translating “and” (EN) to “y” (ES). However the reader on the opposite end may find it more jarring as confusing “y” for “un” is less likely than a user typing “and” instead of “an”, which are more similar words and a frequent type of misspelling in English.
Reasons for Shortcomings in Specific Languages
The simplest and most obvious reason for shortcomings is lack of training data available. As shown below, despite having X TB of Hindi available over web scraping, that isn’t nearly as many examples as for English (Y PB), or Japanese (Z PB).
Other sources state Vietnamese, Swahili, Hindi, Thai, Urdu, Hawaiian, Yoruba, Sindhi, Bengali and others are spoken by large populations but have fewer written language on the web (the primary training data for ML models). On the other hand, languages such as German, English, Chinese, Spanish, French, Japanese and more of the European and Western languages are high-resource.
Some major improvements to multilingual models have happened language by language (or in batched languages) – for instance, when correcting gendered translation errors (“he is a doctor” vs “she is a doctor”), [section redacted].
Even in Cloud Translation services offered by companies like Microsoft, AWS, or Google, certain services are limited to certain languages. Enterprises and SMBs are invited to use machine translation to detect more than one hundred languages, from Afrikaans to Zulu. However, one can only build custom models in Y language pairs - a significantly smaller subset than what’s available for consumers.
[Image Redacted]
The above figure shows a much lower volume available for <language> corpus used as training data
In the below illustration, the difference in data available for low resource languages vs high resource languages vary by orders of magnitude. Accordingly, translation quality suffers for low resource languages not just in chat but in the translation system used in production.
[Image Redacted]
The Machine Translation Landscape: Quality (BLEU) vs Language Pair Resource Size (source redacted)
While research is still working towards a single multilingual model across ZZ languages, domains, dialects, registers, styles etc, we can anticipate weaker quality translations for low-resource languages.
For <Company>’s support channels, one key dimension to consider is volume. Here is a language prioritization document <insert link> shows contact volume per language – note the highest volumes tend to be the high-resource languages, in which <Model Name> may suffice.
Why Existing Metrics Don’t Work for Enterprise Use Cases
Consumer-facing machine translation apps like <Model Name> were trained on large amounts of text scraped off the web due to the availability and low cost. However, this poses problems for business communications which require more formal language.
There is a discrepancy in the quality metrics used to evaluate translation ML used for consumer tools and business communications. BLEU is an automatic grading that is the primary metric for accuracy in research settings. The score is based on similarity to the reference (the “correct” human-generated answer), which could fail to evaluate the syntactic and semantic equivalence.
In business communications, BLEU is not enough. Further quality evaluation must be done, often manually by human agents trained to answer questions about Google products, in order to ensure readiness for launch in support. This is typically called Human SxS (side-by-side) evaluation. Research <insert link> shows that there is a discrepancy between human-evaluated quality versus machine-evaluated quality (BLEU). Human SxS, unfortunately, is expensive and takes much longer to assess compared to automatically calculated metrics like BLEU or BLEURT.
In addition, recent research is revealing metrics like BLEU, spBleu or chrf correlate poorly with human ratings (source). Therefore <Model Name> or <Model Name>, though achieving best-in-class BLEU scores published in research papers shared widely at conferences, can still yield less desirable translation quality in enterprise use cases.
A better alternative, put forth by Google Research in 2022, might be Neural Metrics. However, these metrics have not yet been widely adopted across the industry.
What’s Next in Research
Low-resource language translation quality with multilingual models
As further research (link redacted) in the past few years has shown in the LLM space, generalist models are more valuable than specialists for translations. A single XB parameter multilingual model outperforms ZZ individual models of ~YYYM params each. This means even premium model training on low-resource languages may still not yield as accurate translation quality as simply training a much larger multilingual base model.
In 2021, a single ZZZB parameter model handled YYY languages, and was deployed to XX language pairs via <Model Name> for production.
[section shortened]