
"One of the report's topline findings: there is a major divide between the accessibility and accuracy of AI transcription and translation tools when they're used for English and other dominant languages, and when they're used for languages that AI researchers have termed "low-resource. " English represents more than 50% of the domains on the web. Mainstream language models are largely trained on data scraped from the internet, which is one reason transcription and translation tools perform so well in English."
""Low-resource" languages are those that have comparatively little digitized text on the web available to train models. Even some of the most-spoken languages in the world, like Urdu, are considered low-resource. The working group outlined a few ways this creates accessibility barriers. An AI translation tool may perform very well for a language pair like English and Spanish, but introduce significant errors when it's used for a pair of less common languages."
The Center for News, Technology & Innovation (CNTI) reviewed over 55 studies to assess AI translation and transcription in journalism. Researchers from social science, linguistics, and computer science identify a major divide between performance for dominant languages (notably English) and low-resource languages. English advantages arise from more than 50% web presence and training data scraped from the internet. Low-resource languages lack digitized text, causing poorer transcription and translation, errors with language ambiguity and cultural nuance, inability to match human experts, and propagation of training-data biases. Evaluations found measurable mistranslations, including a Tanzania study that flagged 13% inaccurate translated sentences.
Read at Nieman Lab
Unable to calculate read time
Collection
[
|
...
]