OpenAI's latest breakthrough: language models can explain neurons in language models

20d781b96b8caef2e63d04687776f306.png

Hello everyone, I am zenRRan.

OpenAI released a paper yesterday: " Language models can explain neurons in language models ", which can be said to be a big step forward in the interpretability of deep learning! Who would have thought, using GPT-4 to account for model interpretability, beat magic with magic, 666.

ebfdb9bcd6ffff3120c15d76ee0b63a1.png

Link: https://openai.com/research/language-models-can-explain-neurons-in-language-models

From: Deep Learning Natural Language Processing Official Account

Author: zenRRan

general content

used GPT-4 to automatically write and score explanations of neuron behavior in large language models, and published a dataset of these (imperfect) explanations and scores for each neuron in GPT-2.

introduce

Language models have become more powerful and more widely deployed, but our understanding of their inner workings is still very limited . For example, it may be difficult to detect from their output whether they are using biased heuristics or making things up . Interpretability research aims to discover more information by looking inside the model.

A simple approach to interpretability research is to first understand what the individual components (neurons and attention heads) are doing. Traditionally, this has required humans to manually inspect neurons to determine which features of the data they represent. This process doesn't scale well: it's difficult to apply it to neural networks with hundreds or hundreds of billions of parameters. OpenAI proposes an automated process that uses GPT-4 to generate and score natural language explanations of a neuron's behavior and apply it to neurons in another language model.

This work is part of the third pillar of the alignment research approach: the desire to automate the alignment research work itself. A promising aspect of this approach is that it can scale with the pace of AI development. As future models become smarter and more useful as assistants, we'll find better explanations.

How does it work?

Their method consists of running 3 steps on each neuron .

Step 1: Generate explanations using GPT-4

Given a GPT-2 neuron, an explanation of its behavior is generated by showing GPT-4 relevant text sequences and activations.

OpenAI cited a total of 12 examples , and here I will just pick out a few representative ones.

108eb0e434e9e26b58ac7825ebea92ec.png
Marvel Comics Vibe

Model Generation Explained: References from Movies, Characters, and Entertainment.

2013034e7eec2c5a49711755e8519848.png
similes, similar

Interpretation of model generation: comparisons and analogies, often using the word "like".

6bdfb10d6542eb08a4fa0c460f340f5f.png
shared last names, last name

Explanation generated by the model: Surnames, which generally follow first names.

Step 2: Simulation with GPT-4

Again using GPT-4 to simulate what neurons activated for interpretation would do.

05be92d6f8f1f2f10614ac06c3f0d1b0.png
Marvel Comics Vibe

Step 3: Compare

Score explanations based on how closely simulated activations match real activations

9cc955e6795ddfe7cf46bcbddbd502f5.png
Example: The Ambience of Marvel Comics
ccd47d0e4a8d31d68cf8af1d124ab9d2.png
Example: The Ambience of Marvel Comics

The final comparison score is: 0.34

what did you find

Using OpenAI's own scoring methodology, it's possible to start measuring how well a technique works on different parts of the network and try to improve techniques on parts that currently don't explain well. For example, our technique does not work well for larger models, possibly because later layers are more difficult to interpret.

2439f77f153973b94f825585f46d12ad.png
the number of parameters in the model being explained

Although the vast majority of our explanations score poorly, we believe we can now use ML techniques to further improve our ability to generate explanations. For example, we found that we could improve our scores by:

  • Iterative explanation . We can improve the score by asking GPT-4 to come up with possible counterexamples and then modify the explanation based on their activation.

  • Use larger models to give explanations . As the power of the interpreter model increases, the average score also rises. However, even GPT-4 gave worse explanations than humans, suggesting that there is room for improvement.

  • Change the schema of an interpreted model . Training models with different activation functions improves explanation scores.

We are open sourcing our dataset and visualization tools for GPT-4's written interpretation of all 307,200 neurons in GPT-2, as well as the code for interpretation and scoring using a publicly available model [1] on the OpenAI API . We hope that the research community will develop new techniques to generate higher-scoring explanations, and develop better tools to use explanations to explore GPT-2.

We found more than 1,000 neurons with an explanation score of at least 0.8, meaning that they explained the majority of the neuron's top-level activations according to GPT-4. Most of these well-interpreted neurons are not very interesting. However, many interesting neurons that GPT-4 does not understand were also found. Hopefully, as the interpretation improves, interesting qualitative insights into the model's calculations can be quickly uncovered.

Neurons fire across layers, with higher layers being more abstract :

73982da7138167d245670173fb2dc5b9.png
Take Kat as an example

Outlook

Our approach currently has many limitations [2] , which we hope to address in future work.

  • We focus on short natural language explanations, but neurons can have very complex behaviors that cannot be described succinctly. For example, neurons can be highly ambiguous (represent many different concepts), or can represent a single concept that humans do not understand or express verbally .

  • We hope to eventually automatically find and explain the entire neural circuits that enable complex behaviors, where neurons and attention heads work together. Our current approach only accounts for neuronal behavior as a function of raw text input, without accounting for its downstream effects. For example, a neuron that fires on a period could indicate that the next word should start with a capital letter, or increment a sentence counter .

  • We explain the behavior of neurons without attempting to explain the mechanisms that produce this behavior . This means that even high-scoring explanations may perform poorly on out-of-distribution text because they only describe correlations.

  • Our entire process is computationally intensive .

  • We are excited about the extension and generalization of our method . Ultimately, we hope to use models to form, test, and iterate fully general hypotheses, much like interpretability researchers do.

Ultimately, OpenAI hopes to interpret the largest models as a way to detect alignment and safety issues before and after deployment. However, we still have a long way to go before these technologies can bring dishonesty and the like to the surface.


Background reply: join the group , join the NLP communication group~

References

[1]

automated-interpretability: https://github.com/openai/automated-interpretability

[2]

Limitations: https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html#sec-limitations

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/130612951