MLCommons partnered with Hugging Face to unveil Unsupervised People's Speech, featuring over a million hours of voice recordings in 89 languages. This initiative aims to enhance natural language processing research, especially for low-resource languages, improving accessibility to speech technology worldwide. However, the dataset poses risks, particularly due to bias stemming from its origins. Most recordings are in American-accented English, which could skew AI models trained on this data, limiting their efficacy for non-native speakers or diverse accents.
Supporting broader natural language processing research for languages other than English helps bring communication technologies to more people globally.
However, AI data sets like Unsupervised People's Speech can carry risks for the researchers who choose to use them.
Collection
[
|
...
]