Anthropic Admits to Copying Books en masse for Claude-Can Fair Use Save It? | HackerNoon
Briefly

Anthropic employed four main methods for copying works to train their LLM. First, a working copy was created from a central library. Second, a cleaned version was produced by removing unnecessary text. Third, the cleaned copy was tokenized, breaking words into simpler forms and converting them into numerical sequences. Lastly, the trained LLM retained compressed versions of the copied works, ensuring that statistical relationships among word fragments were optimized during training.
Each work selected for training any given LLM was copied through various methods, including creating working copies, cleaning, tokenizing, and retaining compressed copies.
The tokenization process involved stemming or lemmatizing words and organizing characters into short sequences to generate corresponding numerical tokens.
Anthropic's training process involved iterative discovery of statistical relationships between word fragments from millions of copied texts, optimizing the learning model's performance.
The capability to delete duplicate or low-value works from the training set allows for a more refined and effective training approach.
Read at Hackernoon
[
|
]