Hugging Face's Cosmopedia Hopes To Reshape Pre-Training DataHugging Face developed Cosmopedia for synthetic data creation, covering diverse subjects with <1% duplicate content rate.Cosmopedia is the largest open synthetic dataset, comprising over 25 billion tokens and 30 million files.