"In a newly published paper, a group of Google DeepMind researchers claim to have found a way to clean up this data and make it usable for training, which they claim could be a "powerful tool" for scaling up frontier models. They refer to the idea as Generative Data Refinement, or GDR. The method uses pretrained generative models to rewrite the unusable data, effectively purifying it so it can be safely trained on."
"Minqi Jiang, one of the paper's researchers who has since left the company to Meta, told Business Insider that a lot of AI labs are leaving usable training data on the table because it's intermingled with bad data. For example, if there's a document on the web that contains something considered unusable, such as someone's phone number or an incorrect fact, labs will often discard the entire thing."
Large language models require vast amounts of text drawn from webpages, books, and other sources. Much web text is discarded for training because it is toxic, inaccurate, or contains personally identifiable information, creating a shortage of usable tokens. Entire documents are often removed when they contain a single unsafe line, which wastes many otherwise valuable tokens. Generative Data Refinement (GDR) uses pretrained generative models to rewrite and purify unusable passages so the content can be safely included in training sets. GDR could reclaim discarded data and help scale available training resources, though specific deployment details remain unspecified.
#generative-data-refinement #ai-training-data #personally-identifiable-information-pii #data-toxicity
Read at Business Insider
Unable to calculate read time
Collection
[
|
...
]