
"we should not assume that they're taking reasonable precautions to prevent incursions into consumers' privacy. Users should not be automatically opted in to having their data used in model training, and developers should proactively remove sensitive data from training sets."
"how it is cleaned, how we remove personal information from it, and then, how it is used again for retraining,"
"From the study I did recently, we really don't understand right now to what extent the companies are potentially cleaning that data before it is used for retraining, and there is research demonstrating, including research by employees of the large companies, that chatbots can memorize training data,"
There is little transparency into how AI developers collect, clean, and reuse data for model training, and no current requirements to map full data pipelines. User conversations are increasingly being considered as training sources as publicly scraped English-language data becomes scarce. Users are often not explicitly opted in to having their data used for retraining, and sensitive information may remain in datasets. Research indicates that chatbots can memorize training data, creating risks of personal information leakage. Collected data could be repurposed for targeted advertising or other commercial uses without user awareness.
Read at Theregister
Unable to calculate read time
Collection
[
|
...
]