In December, researchers from Stanford's Internet Observatory identified more than 1,000 images categorized as Child Sexual Abuse Material (CSAM) in one of the most influential AI training sets of the moment: LAION-5B.
If your full-time, eight-hours-a-day, five-days-a-week job were to look at each image in the dataset for just one second, it would take you 781 years.
Common Crawl is a corpus of web data that comes from a monthly crawl of the web. It contains data for more than 3 billion websites.
Pinterest generates the captions on its pages from the ALT tags, so users learned to write them before they 'pinned' their images.
Collection
[
|
...
]