OpenAI and training data company Handshake AI are asking third-party contractors to upload real work that they did in past and current jobs, according to a report in Wired. This appears to be part of a larger strategy across AI companies that are hiring contractors to generate high-quality training data in the hopes that this will eventually allow their models to automate more white-collar work.
In fact, when prompted strategically by researchers, Claude delivered the near-complete text of Harry Potter and the Sorcerer's Stone, The Great Gatsby, 1984, and Frankenstein, in addition to thousands of words from books including The Hunger Games and The Catcher in the Rye. Varying amounts of these books were also reproduced by the other three models. Thirteen books were tested.
The wisdom goes that the more compute you have or the more training data you have, the smarter your AI tool will be. Sutskever said in the interview that, for around the past half-decade, this "recipe" has produced impactful results. It's also efficient for companies because the method provides a simple and "very low-risk way" of investing resources compared to pouring money into research that could lead nowhere.
There is a persistent myth of objectivity around AI, perhaps because people assume that once the systems are deployed, they can function without any human intervention. In reality, developers constantly tweak and refine algorithms with subjective decisions about which results are more relevant or appropriate. Moreover, the immense corpus of data that machine learning models train on can also be polluted.
Reddit this week filed suit against Perplexity and three other companies - Oxylabs UAB, AWM Proxy, and Serp Api - for allegedly engaging in so-called AI scraping without authorization. According to the lawsuit, filed in federal court in New York, the four companies collected millions of posts on Reddit with the aim of monetizing them. Scrapers bypass technical protections to steal data that can then be sold to clients who want the material for AI training.
Almost exactly a year ago, it announced a bold partnership with the AI startup Runway to develop a new model capable of generating "cinematic video" exclusively for Lionsgate to use. In return, the studio gave the firm unrestricted access to its treasure trove of movies - which include everything from the "Hunger Games" films to "American Psycho" - to train the AI model.
The risks are practically endless. Enterprises are investing billions in generative AI initiatives while ignoring doubts about future legal exposures. Major model makers provide no visibility into their training data.