Ask special questions to ChatGPT to extract the original training data!

As the parameters of models such as ChatGPT become larger and larger, the pre-training data also grows exponentially. Researchers from Google DeepMind, the University of Washington, and Cornell University have found that both open source and closed source models can remember a certain number of original training data samples during the training process.

If specific malicious attacks are used, massive amounts of training data can be easily extracted from the model, while threatening the privacy of the data owner.

The attack method used by the researchers is also very simple, which is to make ChatGPT (GPT-3.5) repeat a certain word infinitely, for example, repeat the word "company" infinitely.

Initially ChatGPT will keep repeating this word. When it reaches a certain number, a company's address, history, business scope and other original data will magically appear.

The data is not textual content of neuronal reorganization, and the researchers have shared the success story.

Paper address: https://arxiv.org/abs/2311.17035

Successful attack case display address: https://chat.openai.com/share/456d092b-fb4e- 4979-bea1-76d8d904031f

picture

ChatGPT answered normally at first

picture


After a certain amount, start spitting out the original training data

Attack methods and principles

The researchers used a concept of "retrievable memory" attack techniques, which is distinguished from the "discoverable memory" of the training data.

"Discoverable memory" means that the attacker knows the training data set and can directly extract data from it; while "extractable memory" means that the attacker has no way of knowing the training data and needs to obtain the data through the model itself.

To put it simply, the attacker does not have direct access to the data training set and can only infer what information may be stored in the archive by interpreting and analyzing the "behavior" or "reaction" of the AI ​​model. It’s like a treasure chest thief. He doesn’t have the key and can only judge what treasures are inside from the shape of the treasure chest.

picture

The researchers used a variety of attack methods such as random prompts, tail-recursive index detection, and repeated divergence, and finally discovered the data security vulnerability through repeated divergence.

1) Random prompt attack

The researchers sampled 5 phrases from open source texts such as Wikipedia as prompts, input them into the language model, and asked it to continue generating text based on the prompts.

With this random hint, some of the text generated by the model may be content in the training data set.

2) Tail recursive index detection

In order to efficiently detect whether the generated text originated from the training data set, the researchers constructed a "tail recursive index".

This data structure stores all training data set text sorted by string suffix and supports fast substring query operations. This index can be used to detect whether the prompt generated training data.

3) Repetition causes divergence

The researchers found that repeatedly prompting a language model with a single word can induce the generation of long text that is exactly consistent with the training data. This is because it is difficult for the model to continuously repeat a word and thus "diverge" to other texts.

picture

In order to evaluate the attack effect, the researchers constructed a 9TB auxiliary data set AUXDATASET, which contains a public large-scale language model pre-training data set. Based on this data set, they were able to mechanically verify whether the generated samples appeared in the training data.

Experimental data shows that, even without using real training data as cues, existing extraction attacks can recover large amounts of training data in memory, far exceeding previous estimates.

For example,researchers extracted nearly 1GB of training data from the 6B parameter GPT-Neo model. This proves that the number of retrievable memories is much greater than generally believed.

picture

Then continue the attack on 9 different commercial AI models. The results are equally amazing, and many models can extract GB-level training text. For example, 29,000 memory texts of length 50 are extracted from the LLaMA model.

Ask specific questions about ChatGPT

The researchers also specifically analyzed ChatGPT because it uses data security alignment technology to simulate real-person conversations, so the model is less likely to leak training data.

However, after in-depth analysis, the researchers still found a hint strategy that can make ChatGPT lose control and start leaking data like a normal language model. The method is to let the model answer a word infinitely.

Through this attack method, the researcher extracted 10,000 training examples from ChatGPT for only $200! If it costs more For a lot of money, it is possible to extract about 1G of training data from ChatGPT.

picture

Researchers believe that ChatGPT's high-capacity storage and large amounts of repeated training data will increase its memory of training data, and even if strict security alignment technology is used, data leakage problems may occur.

Therefore, if too much sensitive data is used in pre-training, it is likely to be exploited by others.

As of now, ChatGPT has fixed this vulnerability. When you ask for an infinite repetition request to repeat a certain word, it will prompt "According to OpenAI's usage policy, I cannot participate in the behavior of repeating meaningless content."

The material of this article comes from Google papers. If there is any infringement, please contact us to delete it.

Guess you like

Origin blog.csdn.net/richerg85/article/details/134940746