Probe reveals 174K YouTube vids' subtitles used for AI
Briefly

AI labs use YouTube subtitles dataset, part of the Pile repository, containing data from various sources for training neural networks without creators' awareness.
The YouTube Subtitles dataset is a small part of the 825GB Pile, including data from GitHub, Wikipedia, Ubuntu IRC, scientific papers, and more, used by tech giants like Apple and Nvidia.
EleutherAI openly collected YouTube subtitles, detailing the process in a 2020 research paper and sharing the GitHub code to scrape specific video subtitles based on search terms.
Read at Theregister
[
|
]