Evaluating machine translation for City use cases
This summer, I received the Siegel PiTech PhD Impact Fellowship to work with NYC’s (Office of Technology and Innovation)([https://www.nyc.gov/content/oti/pages/]) (OTI) on evaluating machine translation for City use cases. Part of the final deliverables is the following blog post, which will go up on PiTech’s website eventually, but I’ve included here in the meantime.
The Challenge
New York City is home to speakers of over 800 languages. According to data from the American Community Survey, as of 2023, about 30% of New Yorkers speak a language other than English at home, and around 1.8 million have limited English proficiency (LEP). To ensure that New Yorkers of varying English abilities have access to crucial government information, NYC’s Local Law 30 of 2017 requires that covered city agencies “appoint a language access coordinator, develop language access implementation plans, provide telephonic interpretation in at least 100 languages, translate their most commonly distributed documents into the 10 designated citywide languages, and post signage about the availability of free interpretation services, among other requirements.”
However, there is no standard across the board for evaluating translation vendors, which leads to fluctuating translation quality. Moreover, given recent advancements in large language models, translation vendors have begun to more frequently offer services that incorporate machine translation into the translation pipeline.
Even though a wealth of research exists on the subject of machine translation quality evaluation, academic research does not necessarily translate well (pun intended) into practical settings such as the City’s use cases. Overly technical metrics may also be too difficult for language access coordinators and in-house linguists to implement.
As a PiTech PhD Impact Fellow, I helped OTI develop a framework for assessing machine translation. I researched both qualitative and quantitative methods, and recommend best practices to guide agencies’ evaluation of machine translation vendors.
My Project
As a translator myself, I was excited to work on this project! Human translators tend to evaluate translations in a much more nuanced way compared to computer scientists. Since any framework that I recommend will be used by human linguists and translators, it should resonate with how they naturally tend to conduct evaluation. As such, I started off with a literature review of qualitative MTQE methods. These range from the classic ideas of dynamic and formal equivalence, to the Chinese 信达雅 (“fidelity, expressiveness, elegance”) triad with which I am familiar, to evaluation checklists used in translator training. I also reviewed translation evaluation guidelines used in other cities.
To make sure that my framework is scalable and not overly subjective, I also referenced quantitative methods of translation evaluation. Quantitative methods can be further broken down into manual quantitative methods (such as human-scored axes and rubrics) to automatic quantitative methods (such as word error rate, BLEU, or LLM-based methods).
In the end, I developed an evaluation rubric with five categories:
-
Content accuracy: whether all factual information from the original text is preserved in the translation
-
Tone & formality: whether the translation matches the original text’s tone and level of formality
-
Readability: whether the translation flows naturally and is easy to read
-
Presentation: whether formatting, document structure, and markdown elements are preserved
-
Respect: whether respectful language is used consistently throughout the translation
Categories will be scored on a 4-point scale:
-
Good: no errors
-
Needs improvement: there are a few errors, but errors can be easily identified and corrected
-
Poor: errors are too numerous or challenging to correct
-
Catastrophic: one or multiple errors would actively create harm, either for LEP users or for the City’s reputation
It is recommended that agencies discuss with their language staff to determine which categories should be prioritized for their needs and what the cutoff score should be.
Impact and Path Forward
The final deliverables for my project include:
-
An MTQE rubric to recommend to agencies, along with a supplementary document explaining the rubric and outlining best practices
-
A report that summarizes research on automatic MTQE methods
-
A presentation for an AI community of practice for city employees on recommendations for evaluating machine translation tools
-
A resource library for model translations, to support evaluation work
My work this summer has given agencies and language access staff a fundamental overview of different MTQE methods and practical considerations for how to conduct MTQE for their specific needs. My deliverables will be used to inform future policy recommendations and guide evaluation best practices going forward.
Enjoy Reading This Article?
Here are some more articles you might like to read next: