Hugging Face's Cosmopedia Hopes To Reshape Pre-Training Data
Briefly

To address this, the Hugging Face team crafted over 30 million Cosmopedia prompts spanning hundreds of topics, achieving a duplicate content rate of less than 1%.
Cosmopedia's creation involved a dual approach: conditioning online data for scalability and curated sources for quality.
The output not only enriches AI training resources but also highlights the necessity of innovative solutions like decontamination pathways to ensure the integrity of synthetic data.
This method, akin to the one used for the Phi-1 model, involves removing potentially contaminated samples to maintain dataset purity.
Read at Medium
[
add
]
[
|
|
]