#ai-training-data
#ai-training-data

1 day ago

This South Korean hotel worker is training a robot to fold a banquet napkin: 'I've been doing this about once a month' | Fortune

Body-worn cameras capture skilled workers’ movements to train AI systems that can teach robots humanlike dexterity for real-world tasks across workplaces and homes.

3 weeks ago

Shuttered startups are selling old Slack chats and emails to AI companies

Defunct startups are monetizing their digital data by selling it to AI companies, raising significant privacy concerns.

fromwww.businessinsider.com

3 weeks ago

Voracious demand for robotics training data is transforming gig work

Instawork is transforming gig work into a data generation platform for robotics, addressing the need for real-world training data in AI development.

#meta

fromWIRED

Information security

Meta Pauses Work With Mercor After Data Breach Puts AI Industry Secrets at Risk

Meta has paused work with Mercor due to a major security breach affecting data used for AI training.

Privacy technologies

Facebook's new button lets its AI look at photos you haven't uploaded yet

Meta's opt-in camera-roll feature uploads unpublished photos to its cloud, suggests edits, and can use edited or shared images to train its AI.

Information security

fromWIRED

Meta Pauses Work With Mercor After Data Breach Puts AI Industry Secrets at Risk

Meta has paused work with Mercor due to a major security breach affecting data used for AI training.

Privacy technologies

Facebook's new button lets its AI look at photos you haven't uploaded yet

more#meta

fromwww.businessinsider.com

My AI startup pays people to film themselves taking out the trash. It's now valued at $150 million.

Kled AI pays individuals for personal data to train AI, addressing the demand for training data in the AI industry.

DoorDash will start paying gig workers for creating content to train AI models

DoorDash launched Tasks, a program enabling gig workers to earn extra income by completing short activities like filming restaurant dishes and recording conversations to train AI and robotics models.

Music giant BMG sues Anthropic over AI training

BMG sued Anthropic for training Claude AI models on copyrighted song lyrics from torrent sites without authorization, citing 493 instances of copyright infringement.

#ai-copyright-policy

Intellectual property law

UK reverses course on AI copyright position after backlash

fromwww.bbc.com

Intellectual property law

Government backtracks on AI and copyright after outcry from major artists

UK reverses course on AI copyright position after backlash

The UK government abandoned its plan to allow AI companies to train on copyrighted works with only an opt-out clause for artists, after significant backlash from the creative community.

fromwww.bbc.com

Government backtracks on AI and copyright after outcry from major artists

The UK government reversed its AI copyright policy allowing opt-out training of copyrighted works after creative industry backlash, now seeking a balanced approach without a preferred solution.

more#ai-copyright-policy

#copyright-infringement

Intellectual property law

The dictionaries are suing OpenAI for 'massive' copyright infringement, and say ChatGPT is starving publishers of revenue | Fortune

fromwww.amny.com

Intellectual property law

Merriam Webster, Encyclopedia Britannica sue OpenAI for copyright infringement | amNewYork

Intellectual property law

The dictionary sues OpenAI | TechCrunch

Intellectual property law

Encyclopedia Britannica is suing OpenAI for allegedly 'memorizing' its content with ChatGPT

fromTNW | Media

Encyclopedia Britannica and Merriam-Webster sue OpenAI

Encyclopedia Britannica and Merriam-Webster sued OpenAI for training ChatGPT on nearly 100,000 of their articles without permission and reproducing their copyrighted content verbatim in responses.

fromEntrepreneur

Anthropic Is Being Sued for $3 Billion Over Music Piracy

Anthropic allegedly downloaded over 20,000 copyrighted songs and faces a lawsuit from major music publishers seeking more than $3 billion in damages.

The dictionaries are suing OpenAI for 'massive' copyright infringement, and say ChatGPT is starving publishers of revenue | Fortune

Britannica and Merriam-Webster sued OpenAI for using their copyrighted content to train ChatGPT without permission, claiming the AI diverts traffic and revenue from publishers.

fromwww.amny.com

Merriam Webster, Encyclopedia Britannica sue OpenAI for copyright infringement | amNewYork

Merriam Webster and Encyclopedia Britannica sued OpenAI for copyright and trademark infringement, alleging systematic copying of their content to train AI models and generate verbatim user responses without compensation or permission.

The dictionary sues OpenAI | TechCrunch

Encyclopedia Britannica sued OpenAI for massive copyright infringement, alleging unauthorized scraping of nearly 100,000 articles to train ChatGPT and generating verbatim reproductions of its content.

Encyclopedia Britannica is suing OpenAI for allegedly 'memorizing' its content with ChatGPT

Encyclopedia Britannica and Merriam-Webster sued OpenAI for using their copyrighted content to train AI models and generating substantially similar responses without permission.

fromTNW | Media

Encyclopedia Britannica and Merriam-Webster sue OpenAI

Encyclopedia Britannica and Merriam-Webster sued OpenAI for training ChatGPT on nearly 100,000 of their articles without permission and reproducing their copyrighted content verbatim in responses.

fromEntrepreneur

more#copyright-infringement

Intellectual property law

Anthropic Is Being Sued for $3 Billion Over Music Piracy

fromwww.dw.com

The internet was supposed to be free. What went wrong?

When Guatemalan computer scientist Luis von Ahn first proposed the idea of "games with a purpose" (GWAPs) in 2004, his goal was to harness human brainpower so that computers could learn from it. His idea was simple: Get humans to solve tasks that are trivial to us but difficult for computers back then, like labeling images, transcribing text or classifying data.

Games

#creative-professionals

Artificial intelligence

This AI company is hiring improv actors - and willing to pay them $74 an hour

Artificial intelligence

AI companies want to harvest improv actors' skills to train AI on human emotion

This AI company is hiring improv actors - and willing to pay them $74 an hour

Handshake AI is hiring actors to record improvised scenes at $74 per hour for an unnamed leading AI company, representing growing demand for non-tech professionals in AI development.

more#creative-professionals

AI companies want to harvest improv actors' skills to train AI on human emotion

AI training companies hire creative professionals like actors and comedians to generate specialized data that helps fix gaps in AI model knowledge, raising concerns about accelerating job obsolescence in creative industries.

Women in technology

fromIntelligencer

The Laid-off Scientists and Lawyers Training AI to Steal Their Careers

Unemployed workers are being recruited by companies like Mercor to generate training data for AI systems, ironically replacing the jobs AI has already automated.

fromwww.socialmediatoday.com

The laid-off lawyers and PhDs training AI to steal their careers

Unemployed workers are being recruited by companies like Mercor to create training data for AI systems, often the same technology that displaced them from their jobs.

Roam Research

X adds Grok-powered audio option to long-form articles

X introduced audio playback for long-form articles using Grok AI's voice, enabling background listening to boost creator engagement and content consumption while improving AI training data quality.

#privacy-violation

Privacy technologies

Can Meta see your private life through its Ray-Ban smart glasses? What to know

Privacy professionals

Workers report watching Ray-Ban Meta-shot footage of people using the bathroom

Privacy technologies

Can Meta see your private life through its Ray-Ban smart glasses? What to know

Privacy professionals

Workers report watching Ray-Ban Meta-shot footage of people using the bathroom

more#privacy-violation

UK government delays AI copyright rules amid artist outcry

The UK government delayed its AI data bill after stakeholder consultation revealed opposition to allowing AI companies to train models on copyrighted materials without creator consent.

Science

fromArtforum

Recursive Resemblance

Generative AI models risk collapse when trained on their own output, causing statistical degradation and improbable sequences that compound approximation errors over time.

Tech industry

fromSearch Engine Roundtable

Your smart TV may be crawling the web for AI

Bright Data offers streaming services an ad-free monetization alternative by converting smart TVs into residential proxies that collect web data for resale to AI companies.

Anthropic Updates Its Crawler Documentation: ClaudeBot, Claude-User & Claude-SearchBot

ClaudeBot helps enhance the utility and safety of our generative AI models by collecting web content that could potentially contribute to their training. When a site restricts ClaudeBot access, it signals that the site's future materials should be excluded from our AI model training datasets.

Privacy technologies

fromEntrepreneur

fromIPWatchdog.com | Patents & Intellectual Property Law

Most Founders Don't Realize They're Giving Away Their Influence - Here's How to Take It Back

Every search, purchase, loyalty swipe, location ping and scroll feeds systems that now shape pricing, product decisions, hiring and marketing strategies. Most founders understand this in theory, but few grasp the practical consequence: whether they intend to or not, they and their customers are already casting votes with their data. And those votes? They're usually cast passively, on someone else's terms.

Data science

#copyright

Intellectual property law

Plaintiffs Propose Plan for Landmark $1.5 Billion Copyright Settlement Process with Anthropic

Artificial intelligence

Music publishers sue Anthropic for $3 billion over 'flagrant piracy'

Intellectual property law

New York Times reporter files lawsuit against AI companies

fromLawSites

Intellectual property law

Thomson Reuters Tells Appeals Court: ROSS's Copying Was 'Theft, Not Innovation'

Germany news

ChatGPT violated copyright law by harvesting musicians' lyrics, German court rules

fromBusiness Matters

fromIPWatchdog.com | Patents & Intellectual Property Law

Intellectual property law

AI firm Stability AI wins High Court case against Getty Images over copyright claims

Intellectual property law

Plaintiffs Propose Plan for Landmark $1.5 Billion Copyright Settlement Process with Anthropic

Artificial intelligence

Music publishers sue Anthropic for $3 billion over 'flagrant piracy'

Intellectual property law

New York Times reporter files lawsuit against AI companies

fromLawSites

Intellectual property law

Thomson Reuters Tells Appeals Court: ROSS's Copying Was 'Theft, Not Innovation'

Germany news

ChatGPT violated copyright law by harvesting musicians' lyrics, German court rules

fromBusiness Matters

Intellectual property law

AI firm Stability AI wins High Court case against Getty Images over copyright claims

Social media marketing

Reddit INSIDER sends major vote of confidence after earnings

fromTheStreet

Artificial intelligence

Reddit INSIDER sends major vote of confidence after earnings

fromSocial Media Today

Artificial intelligence

Reddit Launches Legal Action to Block AI Companies from Scraping its Data

Tech industry

Reddit sues Perplexity and others for allegedly scraping millions of user comments

fromMacon Telegraph

Social media marketing

Reddit INSIDER sends major vote of confidence after earnings

fromTheStreet

Artificial intelligence

Reddit INSIDER sends major vote of confidence after earnings

fromSocial Media Today

Artificial intelligence

Reddit Launches Legal Action to Block AI Companies from Scraping its Data

Tech industry

Reddit sues Perplexity and others for allegedly scraping millions of user comments

Daily Tech Insider Maps the AI Arms Race From Silicon Valley to the Moon

Major tech companies are committing massive AI infrastructure spending, accelerating deployment, concentrating control, and driving job and market disruptions.

fromPetaPixel

Amazon May Launch Marketplace for Publishers to Sell Content to AI Firms

Amazon is exploring a content marketplace enabling publishers to license articles and data directly to AI companies to replace web scraping and monetize content.

Amazon may launch a marketplace where media sites can sell their content to AI companies | TechCrunch

Amazon is reportedly planning a marketplace to let publishers license content directly to AI companies to provide legally safe training data.

#web-scraping

Business

Increase of AI bots on the Internet sparks arms race

Artificial intelligence

Anthropic and OpenAI are crawling the web even more and not giving much back

Business

Increase of AI bots on the Internet sparks arms race

Artificial intelligence

Anthropic and OpenAI are crawling the web even more and not giving much back

more#web-scraping

fromFuturism

Anthropic Knew the Public Would Be Disgusted by How It Was Destroying Physical Books, Secret Documents Reveal

Anthropic bought, shredded, and scanned millions of used books to train AI, relying on first-sale doctrine and a transformative-use ruling to avoid paying authors.

Video game company stock prices dip after Google introduces an AI world-generation tool

The stock prices of some major video game companies, including Take-Two Interactive, Roblox, and Unity, had notable declines on Friday, just a day after Google announced its Project Genie tool that lets users prompt AI to generate interactive experiences, Reuters reports. Take-Two's stock price closed at $220.30 (down 7.93 percent from yesterday), Roblox's closed at $65.76 (down 13.17 percent), and Unity's closed at $29.10 (down 24.22 percent).

Video games

#internet-archive

Media industry

Publishers are blocking the Internet Archive for fear AI scrapers can use it as a workaround

fromNieman Lab

Media industry

News publishers limit Internet Archive access due to AI scraping concerns

Media industry

Publishers are blocking the Internet Archive for fear AI scrapers can use it as a workaround

fromNieman Lab

Media industry

News publishers limit Internet Archive access due to AI scraping concerns

more#internet-archive

fromBuzzFeed

If You Use Gmail, You're Going To Want To Turn Off This 1 Automatic Setting ASAP

For Gmail users, there is an automatic opt-in that may allow Google access to your emailed data (think: your personal and work messages, your attachments) "to train AI models," cybersecurity experts allege. If you don't want this information shared, you need to adjust your settings. In the race for companies to get an ROI on AI, we're already seeing language learning models running out of new, human-generated data to train on.

fromIPWatchdog.com | Patents & Intellectual Property Law

fromGlobal IP & Technology Law Blog

Other Barks & Bites for Friday, January 23: USAA Petition on Section 101 Distributed for Conference; Fifth Circuit Says Trade Secret Claimants Must Apportion Damages; TRAIN Act Introduced in House

New U.S. IP developments: TRAIN Act proposes subpoena power for AI training data; courts and agencies advance major trademark, patent, antitrust, and trade-secret rulings.

A Year On from UK Government Consultation on Copyright and Artificial Intelligence

those options range from "option 0", simply doing nothing and leaving UK copyright legislation in its currently uncertain state when it comes to the use of copyright materials to train AI models, through to options which would either require specific consent from rights holders in all cases ("option 1") or allow consent to be assumed by AI developers unless a rights holder objects, subject to developers being transparent about what materials have been used in training ("option 3").

UK politics

fromFuturism

After Being Pillaged By AI Companies, Wikipedia Signs Deal to Get Paid By Them

Wikipedia is licensing its collection of over 65 million articles to major AI companies through a paid Enterprise program to recoup costs and fund operations.

fromAxios

The rise of "web rot"

Older websites persist and degrade search quality and training data, while overall web traffic steadiness masks decline among sites older than five years.

World's largest shadow library made a 300TB copy of Spotify's most streamed songs

Anna's Archive is offering high-speed, enterprise-level access to scraped LLM training data including unreleased collections, raising concerns about facilitating AI labs and legal exposure.

Music

Activist group says it has scraped 86m music files from Spotify

Anna's Archive claims to have scraped 86 million Spotify tracks and metadata, planning to release them online and potentially accelerate AI training on pirated music.

Adobe hit with proposed class-action, accused of misusing authors' work in AI training | TechCrunch

A proposed class-action lawsuit filed on behalf of Elizabeth Lyon, an author from Oregon, claims that Adobe used pirated versions of numerous books-including her own-to train the company's SlimLM program. Adobe describes SlimLM as a small language model series that can be "optimized for document assistance tasks on mobile devices." It states that SlimLM was pre-trained on SlimPajama-627B, a "deduplicated, multi-corpora, open-source dataset" released by Cerebras in June of 2023.

Artificial intelligence

Miscellaneous

fromeuronews

EU vs. Big Tech: What actions have regulators taken so far?

European regulators are enforcing new AI, digital services, and markets laws to curb Big Tech dominance and protect consumers and creators.

Who's making the most money in AI? It's not who you think

Emerging vendors like Mercor and Handshake profit massively by supplying specialized data, engineers, and labeling services to frontier AI labs pursuing AGI.

India's government wants AI companies to pay for content

India proposes blanket training licenses for AI with royalties paid only upon commercialization, set by a government committee and collected via a centralized nonprofit collective.

Really Simple Licensing spec makes AI orgs pay to scrape

Really Simple Licensing (RSL) 1.0 enables machine-readable rules for crawlers, allowing publishers to declare access, processing, and payment terms for web content.

#eu-antitrust

Miscellaneous

Google faces a new EU antitrust probe over content used for AI Overviews, YouTube

EU data protection

European Commission investigates Google's AI training processes

Miscellaneous

Google Zero is under investigation by the EU

Europe politics

EU opens investigation into Google's use of online content for AI models

Miscellaneous

Google faces a new EU antitrust probe over content used for AI Overviews, YouTube

EU data protection

European Commission investigates Google's AI training processes

Miscellaneous

Google Zero is under investigation by the EU

Europe politics

EU opens investigation into Google's use of online content for AI models

Miscellaneous

EU launches Google antitrust probe over AI training

Miscellaneous

EU opens antitrust investigation into Google's AI practices

Miscellaneous

EU launches Google antitrust probe over AI training

Miscellaneous

EU opens antitrust investigation into Google's AI practices

more#antitrust

Publishers say no to AI scrapers, block bots at server level

Millions of websites are blocking AI crawler bots via robots.txt to prevent training-data scraping and reduce non-human server traffic.

Micro1, a Scale AI competitor, touts crossing $100M ARR | TechCrunch

Micro1 grew ARR from roughly $7M to over $100M this year by rapidly recruiting and vetting domain experts to supply human training data for AI labs and enterprises.

Google denies analyzing your emails for AI training - here's what happened

I contacted Google for comment, and a spokesperson sent me the following statement: "These reports are misleading - we have not changed anyone's settings. Gmail Smart Features have existed for many years, and we do not use your Gmail content for training our Gemini AI model. Lastly, we are always transparent and clear if we make changes to our terms of service and policies."

Privacy professionals

EU data protection

fromwww.dw.com

EU plans to ease GDPR laws and AI constraints in major shift DW 11/18/2025

EU proposals would narrow GDPR protections, enable broader data harvesting for AI, remove cookie consent pop-ups, and shift burden onto users to request data removal.

Cloudflare CEO says Google is abusing its monopoly in search to feed its AI | Fortune

"The great patron of the internet for the last 27 years was Google. The great villain of the internet today is also Google," Prince said. He claimed that in the past, for every two pages that Google crawled to inform its search engine, it would, on average, send one visitor to those sites-traffic that publishers can monetise with advertising.

Artificial intelligence

#copyright-law

Germany news

Court rules that OpenAI violated German copyright law; ordered it to pay damages | TechCrunch

fromIPWatchdog.com | Patents & Intellectual Property Law

Intellectual property law

Labor rules out giving tech giants free rein to mine copyright content to train AI

Intellectual property law

Anthropic Settlement Signals AI Innovation Can Thrive Within Existing Copyright Framework

Germany news

Court rules that OpenAI violated German copyright law; ordered it to pay damages | TechCrunch

fromIPWatchdog.com | Patents & Intellectual Property Law

Intellectual property law

Labor rules out giving tech giants free rein to mine copyright content to train AI

Intellectual property law

Anthropic Settlement Signals AI Innovation Can Thrive Within Existing Copyright Framework

more#copyright-law

fromTechzine Global

Wikimedia calls on AI companies to use paid API

Wikimedia has called on AI companies to take responsibility for using Wikipedia content in their language models. This can be achieved by stopping scraping and using the paid API instead. In a blog post, the organization states that artificial intelligence cannot exist without the human knowledge collected and maintained on platforms such as Wikipedia. To maintain that balance, Wikimedia asks developers of generative AI to clearly cite their sources and contribute to the continued existence of the open knowledge project via the paid Wikimedia Enterprise platform.

Artificial intelligence

Elon Musk's Grokipedia launches with AI-cloned pages from Wikipedia

Since 2001, Wikipedia has been the backbone of knowledge on the internet. Hosted by the Wikimedia Foundation, it remains the only top website in the world run by a nonprofit. Unlike newer projects, Wikipedia's strengths are clear: it has transparent policies, rigorous volunteer oversight, and a strong culture of continuous improvement. Wikipedia is an encyclopedia, written to inform billions of readers without promoting a particular point of view.

Non-profit organizations

fromABC7 Los Angeles

Elon Musk launches Grokipedia to compete with online encyclopedia Wikipedia

Elon Musk launched Grokipedia, a crowdsourced encyclopedia powered by xAI, presenting itself as a minimalist Wikipedia rival claiming to provide the complete truth.

How AI labs use Mercor to get the data companies won't share | TechCrunch

AI labs hire former senior employees through marketplaces like Mercor to obtain industry workflows and train automation models without corporate data contracts.

fromIPWatchdog.com | Patents & Intellectual Property Law

Canva debuts foundational 'design' model, extends AI tools across its app

Canva has built its own foundational AI model that generates layered designs users can edit more easily. It's one of several generative AI-related features Canva announced Thursday, alongside expanded access to its AI assistant and content generation capabilities across its app. To date, Canva has partnered with a variety of AI model providers for content generation - Black Forest Labs, Google, and OpenAI among them - and it acquired Leonardo AI last year.

Artificial intelligence

#data-scraping

Artificial intelligence

Reddit Dubs Perplexity AI and Data Scraping Companies 'Would-Be Bank Robbers'

fromThe Mercury News

Artificial intelligence

Reddit sues AI company Perplexity and others for 'industrial-scale' scraping of user comments

fromAdExchanger

Tech industry

Sour Scrapes; (Anti)-trust The Process | AdExchanger

fromIPWatchdog.com | Patents & Intellectual Property Law

Artificial intelligence

Reddit drags Perplexity in a new lawsuit, accusing it of building up a $20 billion company off stolen data

Artificial intelligence

Reddit Dubs Perplexity AI and Data Scraping Companies 'Would-Be Bank Robbers'

fromThe Mercury News

Artificial intelligence

Reddit sues AI company Perplexity and others for 'industrial-scale' scraping of user comments

fromAdExchanger

Tech industry

Sour Scrapes; (Anti)-trust The Process | AdExchanger

Artificial intelligence

Reddit drags Perplexity in a new lawsuit, accusing it of building up a $20 billion company off stolen data

Scale AI agreed to settle multiple lawsuits from its California contractors

Scale AI agreed to settle four California lawsuits alleging worker misclassification, underpayment, and denied benefits and has stopped hiring California gig workers.

Your Uber driver has a new side hustle: Training AI for cash

According to Uber, beginning later this year, drivers and couriers who opt into the program can complete "digital tasks" within Uber's Driver app. These tasks can include submitting a video of themselves speaking in their native language, uploading pictures of specific everyday items, or presenting documents written in a different language. After tasks are completed, the earnings will be in the users' balance within 24 hours. Compensation depends on the time commitment to complete tasks and their complexity.

Artificial intelligence

Privacy technologies

fromExchangewire

Verve Study Shows That 75% of Consumers are More Open to Watching Ads for Free Content

Consumers increasingly accept ad-supported content while expressing rising concern about data use, especially for AI training.

Inside the web infrastructure revolt over Google's AI Overviews

The new change, which Cloudflare calls its Content Signals Policy, happened after publishers and other companies that depend on web traffic have cried foul over Google's AI Overviews and similar AI answer engines, saying they are sharply cutting those companies' path to revenue because they don't send traffic back to the source of the information. There have been lawsuits, efforts to kick-start new marketplaces to ensure compensation, and more-

Tech industry

Science

fromNature

How stereotypes shape AI - and what that means for the future of hiring

Internet images encode gendered stereotypes: women shown younger and linked to caregiving jobs, men linked to leadership roles, embedding bias in AI training and hiring.

Privacy technologies