MIT researchers have developed an AI method called automated interpretability agents (AIAs) that autonomously experiment on and explain the behavior of neural networks.
The AIAs actively engage in hypothesis formation, experimental testing, and iterative learning to understand intricate neural networks, such as GPT-4.
The researchers introduced a benchmark called FIND to assess the accuracy and quality of explanations for real-world network components, but acknowledge challenges in accurately describing certain functions.