To develop a multilingual public dataset for LLMs, our strategy is to combine mC4 and OSCAR, two of the largest multilingual datasets available for extensive cleaning and deduplication.
mC4 is a multilingual document-level dataset created to train the multilingual encoder-decoder model mT5, extracted from 71 monthly snapshots, ensuring high language confidence and quality.
Language identification in mC4 is facilitated by the cld3 tool, which efficiently categorizes language, while our dataset employs strict deduplication methods to enhance data integrity.
#multilingual-datasets #natural-language-processing #data-cleaning #machine-learning #language-models
Collection
[
|
...
]