Hello everyone, I am zenRRan.
OpenAI released a paper yesterday: " Language models can explain neurons in language models ", which can be said to be a big step forward in the interpretability of deep learning! Who would have thought, using GPT-4 to account for model interpretability, beat magic with magic, 666.
![ebfdb9bcd6ffff3120c15d76ee0b63a1.png](https://img-blog.csdnimg.cn/img_convert/ebfdb9bcd6ffff3120c15d76ee0b63a1.png)
Link: https://openai.com/research/language-models-can-explain-neurons-in-language-models
From: Deep Learning Natural Language Processing Official Account
Author: zenRRan
general content
used GPT-4 to automatically write and score explanations of neuron behavior in large language models, and published a dataset of these (imperfect) explanations and scores for each neuron in GPT-2.
introduce
Language models have become more powerful and more widely deployed, but our understanding of their inner workings is still very limited . For example, it may be difficult to detect from their output whether they are using biased heuristics or making things up . Interpretability research aims to discover more information by looking inside the model.
A simple approach to interpretability research is to first understand what the individual components (neurons and attention heads) are doing. Traditionally, this has required humans to manually inspect neurons to determine which features of the data they represent. This process doesn't scale well: it's difficult to apply it to neural networks with hundreds or hundreds of billions of parameters. OpenAI proposes an automated process that uses GPT-4 to generate and score natural language explanations of a neuron's behavior and apply it to neurons in another language model.
This work is part of the third pillar of the alignment research approach: the desire to automate the alignment research work itself. A promising aspect of this approach is that it can scale with the pace of AI development. As future models become smarter and more useful as assistants, we'll find better explanations.
How does it work?
Their method consists of running 3 steps on each neuron .
Step 1: Generate explanations using GPT-4
Given a GPT-2 neuron, an explanation of its behavior is generated by showing GPT-4 relevant text sequences and activations.
OpenAI cited a total of 12 examples , and here I will just pick out a few representative ones.
![108eb0e434e9e26b58ac7825ebea92ec.png](https://img-blog.csdnimg.cn/img_convert/108eb0e434e9e26b58ac7825ebea92ec.png)
Model Generation Explained: References from Movies, Characters, and Entertainment.
![2013034e7eec2c5a49711755e8519848.png](https://img-blog.csdnimg.cn/img_convert/2013034e7eec2c5a49711755e8519848.png)
Interpretation of model generation: comparisons and analogies, often using the word "like".
![6bdfb10d6542eb08a4fa0c460f340f5f.png](https://img-blog.csdnimg.cn/img_convert/6bdfb10d6542eb08a4fa0c460f340f5f.png)
Explanation generated by the model: Surnames, which generally follow first names.
Step 2: Simulation with GPT-4
Again using GPT-4 to simulate what neurons activated for interpretation would do.
![05be92d6f8f1f2f10614ac06c3f0d1b0.png](https://img-blog.csdnimg.cn/img_convert/05be92d6f8f1f2f10614ac06c3f0d1b0.png)
Step 3: Compare
Score explanations based on how closely simulated activations match real activations
![9cc955e6795ddfe7cf46bcbddbd502f5.png](https://img-blog.csdnimg.cn/img_convert/9cc955e6795ddfe7cf46bcbddbd502f5.png)
![ccd47d0e4a8d31d68cf8af1d124ab9d2.png](https://img-blog.csdnimg.cn/img_convert/ccd47d0e4a8d31d68cf8af1d124ab9d2.png)
The final comparison score is: 0.34
what did you find
Using OpenAI's own scoring methodology, it's possible to start measuring how well a technique works on different parts of the network and try to improve techniques on parts that currently don't explain well. For example, our technique does not work well for larger models, possibly because later layers are more difficult to interpret.
![2439f77f153973b94f825585f46d12ad.png](https://img-blog.csdnimg.cn/img_convert/2439f77f153973b94f825585f46d12ad.png)
Although the vast majority of our explanations score poorly, we believe we can now use ML techniques to further improve our ability to generate explanations. For example, we found that we could improve our scores by:
Iterative explanation . We can improve the score by asking GPT-4 to come up with possible counterexamples and then modify the explanation based on their activation.
Use larger models to give explanations . As the power of the interpreter model increases, the average score also rises. However, even GPT-4 gave worse explanations than humans, suggesting that there is room for improvement.
Change the schema of an interpreted model . Training models with different activation functions improves explanation scores.
We are open sourcing our dataset and visualization tools for GPT-4's written interpretation of all 307,200 neurons in GPT-2, as well as the code for interpretation and scoring using a publicly available model [1] on the OpenAI API . We hope that the research community will develop new techniques to generate higher-scoring explanations, and develop better tools to use explanations to explore GPT-2.
We found more than 1,000 neurons with an explanation score of at least 0.8, meaning that they explained the majority of the neuron's top-level activations according to GPT-4. Most of these well-interpreted neurons are not very interesting. However, many interesting neurons that GPT-4 does not understand were also found. Hopefully, as the interpretation improves, interesting qualitative insights into the model's calculations can be quickly uncovered.
Neurons fire across layers, with higher layers being more abstract :
![73982da7138167d245670173fb2dc5b9.png](https://img-blog.csdnimg.cn/img_convert/73982da7138167d245670173fb2dc5b9.png)
Outlook
Our approach currently has many limitations [2] , which we hope to address in future work.
We focus on short natural language explanations, but neurons can have very complex behaviors that cannot be described succinctly. For example, neurons can be highly ambiguous (represent many different concepts), or can represent a single concept that humans do not understand or express verbally .
We hope to eventually automatically find and explain the entire neural circuits that enable complex behaviors, where neurons and attention heads work together. Our current approach only accounts for neuronal behavior as a function of raw text input, without accounting for its downstream effects. For example, a neuron that fires on a period could indicate that the next word should start with a capital letter, or increment a sentence counter .
We explain the behavior of neurons without attempting to explain the mechanisms that produce this behavior . This means that even high-scoring explanations may perform poorly on out-of-distribution text because they only describe correlations.
Our entire process is computationally intensive .
We are excited about the extension and generalization of our method . Ultimately, we hope to use models to form, test, and iterate fully general hypotheses, much like interpretability researchers do.
Ultimately, OpenAI hopes to interpret the largest models as a way to detect alignment and safety issues before and after deployment. However, we still have a long way to go before these technologies can bring dishonesty and the like to the surface.
Background reply: join the group , join the NLP communication group~
References
[1]
automated-interpretability: https://github.com/openai/automated-interpretability
[2]Limitations: https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html#sec-limitations