#multilingual-datasets

[ follow ]
Artificial intelligence
fromInfoQ
22 hours ago

Hugging Face Introduces RTEB, a New Benchmark for Evaluating Retrieval Models

RTEB uses a hybrid of open and private datasets to better evaluate embedding model generalization for real-world retrieval across domains and 20 languages.
Artificial intelligence
fromInfoQ
1 month ago

Hugging Face Releases FinePDFs: A 3-Trillion-Token Dataset Built from PDFs

FinePDFs is a 3.65 TB, 475 million–document PDF corpus across 1,733 languages offering trillions of tokens and complementary, domain-rich data for LLM training.
[ Load more ]