Anthropic Open-sources Tool to Trace the "Thoughts" of Large Language Models
Briefly

Anthropic researchers have released a circuit tracing library that enables in-depth exploration of large language models (LLMs) during their inference process. This tool replaces original model neurons with sparse features from MLP transcoders, creating an attribution graph to assess how specific features influence outputs. It computes the direct effects between various elements and has been applied to analyze reasoning and multilingual capabilities in models like Gemma-2-2b and Llama-3.2-1b. Insights gained can inform adjustments to model features and enhance the understanding of AI behavior.
Anthropic's circuit tracing library allows insights into a large language model's internal workings, highlighting how features affect outputs in interpretable ways.
The circuit tracer has already been utilized for studies on multi-step reasoning and multilingual representations, demonstrating its practical applications in advanced AI comprehension.
Read at InfoQ
[
|
]