The research focuses on a new DE model for speech-to-text retrieval, evaluating its performance on 102 languages from the FLEURS dataset. The model, trained on just 21 languages from the CoVoST-2 dataset, significantly outperforms the mSLAM DE model, which was trained on more extensive data. The results highlight the model's superior retrieval accuracy, especially in handling unseen languages, suggesting the effectiveness of multilingual data in pre-training the backbone LLM. The findings contribute valuable insights into multilingual speech processing and its applications in AI.
Our model significantly outperforms the mSLAM DE baseline in R@1 and WER metrics despite being trained with only 1/10 of the data.
We hypothesize this is due to the vast textual multilingual data our backbone LLM has seen during pre-training.
The task is to retrieve the corresponding transcription given a speech sample, training on 21 languages from CoVoST-2 and evaluating in 102.
We find that our model performs best on the 20 languages within the training and evaluation data, yet still excels on the remaining unseen languages.
#speech-recognition #machine-learning #multilingual-models #ai-research #natural-language-processing
Collection
[
|
...
]