Models All The Way Down
Briefly

In December, researchers from Stanford's Internet Observatory identified more than 1,000 images categorized as Child Sexual Abuse Material (CSAM) in one of the most influential AI training sets of the moment: LAION-5B.
If your full-time, eight-hours-a-day, five-days-a-week job were to look at each image in the dataset for just one second, it would take you 781 years.
Common Crawl is a corpus of web data that comes from a monthly crawl of the web. It contains data for more than 3 billion websites.
Pinterest generates the captions on its pages from the ALT tags, so users learned to write them before they 'pinned' their images.
Read at Knowingmachines
[
add
]
[
|
|
]