fromInfoQ
3 days agoHugging Face Releases FineTranslations, a Trillion-Token Multilingual Parallel Text Dataset
The dataset was created by translating non-English content from the FineWeb2 corpus into English using Gemma3 27B, with the full data generation pipeline designed to be reproducible and publicly documented. The dataset is primarily intended to improve machine translation, particularly in the English→X direction, where performance remains weaker for many lower-resource languages. By starting from text originally written in non-English languages and translating it into English, FineTranslations provides large-scale parallel data suitable for fine-tuning existing translation models.
Artificial intelligence


