Artificial intelligence
fromInfoQ
18 hours agoHugging Face Releases FinePDFs: A 3-Trillion-Token Dataset Built from PDFs
FinePDFs is a 3.65 TB, 475 million–document PDF corpus across 1,733 languages offering trillions of tokens and complementary, domain-rich data for LLM training.