How Just 250 Bad Documents Can Hack Any AI Model

"AI models like ChatGPT or Claude learn by reading tons of stuff from the internet - blogs, websites, articles, basically anything public. The problem is that anyone can post content online, which means bad actors can sneak in some harmful text that the AI will eventually read and learn from. This is called "data poisoning" - basically, you're poisoning the AI's training data with bad examples so it learns bad behaviors."

"Before this study, everyone assumed that if you wanted to hack a big AI model, you'd need to control a huge percentage of its training data. Like, if the AI reads 1 billion documents, you'd need to poison maybe 10 million of them or something like that. But here's the twist - that's not how it actually works. What The Researchers Found The researchers trained AI models of different sizes: A small 600M parameter model A medium 2B parameter model A large 7B parameter model"

AI models ingest massive amounts of publicly posted internet content, creating exposure to malicious or misleading material inserted by adversaries. This exposure is called data poisoning, where corrupted examples in training data teach harmful behaviors. Conventional wisdom held that successful poisoning required controlling a large share of training data — for example millions of documents among billions. Recent findings show that this assumption is incorrect. Researchers investigated poisoning risk across model scale by training models of multiple sizes (600M, 2B, 7B, 13B parameters) to compare vulnerability patterns from small to extra-large models.

#data-poisoning #model-robustness #ai-safety #training-data

Read at Medium

Unable to calculate read time

Collection

[

...

]

How Just 250 Bad Documents Can Hack Any AI ModelHow Just 250 Bad Documents Can Hack Any AI Model Briefly

How Just 250 Bad Documents Can Hack Any AI Model
How Just 250 Bad Documents Can Hack Any AI Model
Briefly