CulturaX: A High-Quality, Multilingual Dataset for LLMs - Multilingual Dataset Creation | HackerNoon
Briefly

To develop a multilingual public dataset for LLMs, our strategy is to combine mC4 and OSCAR, two of the largest multilingual datasets available for extensive cleaning and deduplication.
mC4 is a multilingual document-level dataset created to train the multilingual encoder-decoder model mT5, extracted from 71 monthly snapshots, ensuring high language confidence and quality.
Language identification in mC4 is facilitated by the cld3 tool, which efficiently categorizes language, while our dataset employs strict deduplication methods to enhance data integrity.
Read at Hackernoon
[
|
]