The wisdom goes that the more compute you have or the more training data you have, the smarter your AI tool will be. Sutskever said in the interview that, for around the past half-decade, this "recipe" has produced impactful results. It's also efficient for companies because the method provides a simple and "very low-risk way" of investing resources compared to pouring money into research that could lead nowhere.
There is a persistent myth of objectivity around AI, perhaps because people assume that once the systems are deployed, they can function without any human intervention. In reality, developers constantly tweak and refine algorithms with subjective decisions about which results are more relevant or appropriate. Moreover, the immense corpus of data that machine learning models train on can also be polluted.
Reddit this week filed suit against Perplexity and three other companies - Oxylabs UAB, AWM Proxy, and Serp Api - for allegedly engaging in so-called AI scraping without authorization. According to the lawsuit, filed in federal court in New York, the four companies collected millions of posts on Reddit with the aim of monetizing them. Scrapers bypass technical protections to steal data that can then be sold to clients who want the material for AI training.
Almost exactly a year ago, it announced a bold partnership with the AI startup Runway to develop a new model capable of generating "cinematic video" exclusively for Lionsgate to use. In return, the studio gave the firm unrestricted access to its treasure trove of movies - which include everything from the "Hunger Games" films to "American Psycho" - to train the AI model.
The risks are practically endless. Enterprises are investing billions in generative AI initiatives while ignoring doubts about future legal exposures. Major model makers provide no visibility into their training data.