Anthropic's work maps neural network features in LLM, aiming to understand and control generative AI like ChatGPT for improved usefulness and safety.
By identifying features in LLM associated with places, concepts, etc., manipulation can directly alter responses without model retraining, enhancing interpretability and control.
Collection
[
|
...
]