Attacks in NLP
Related topics
Introduction
Previous attacks focused on images and speech, while there was less content on NLP. The complexity of NLP is related to the dictionary:
NLP can only add noise to the embedding features.
Evasion Attacks
film
After changing the sentiment category of movie reviews films
, the reviews changed from negative to positive.
In structural analysis, if one word is changed, the result will be completely different.
The model is very fragile. See if there are any ways to make your model more robust.
imitation Attacks
synonym replacement
Find similar vectors in embedding space for replacement
KNN clustering to zoom in
Large model prediction for substitution
Use the gradient of embedding to obtain word substitution.
Sort in the order that makes the loss change, and then take the top-k words to maximize the loss.
Character-level substitution, exchange, deletion, and insertion.
Motivation
Example of Attack
Adding some noise can make the classifier identify errors.
Designing loss can make untargeted or targeted attacks possible.
In the case of L2 norm, changing one has the same effect as changing each.
Backdoor Attacks
How to attack when you don’t know the training data? This is a black box attack.
Integrated attack, diagonal attack.
The dark blue area is the range that it can normally be recognized as correct. To attack, just move it to the area that is not blue.
One pixel attack
Changing a single pixel value can cause the classifier to fail.
Universal adversarial attack
I found a noise that can make the discriminator identify errors when added to a lot of pictures.
In addition to images, other fields can also be attacked, such as sound, NLP, etc.
After adding the red text at the end, the answers in the question and answer system will be the same.
Attack in the Physical world
Adding glasses to the man causes the camera recognition algorithm to identify the woman on the right.
An attack on the license plate system, Peugeot's recognition system.
By stretching the horizontal line of 3 a little longer, Tesla saw the speed limit of 35 as 85, causing acceleration.
The number of white squares will correspond to different categories.
Open a backdoor in the model:
start the attack during the training phase. Although the training data looks normal to the human eye, it will only identify a certain picture incorrectly and not other pictures.
Public image training set (which may contain attack images)
Defense
passive defense
Once trained, don't move it and add a shield in front of the model.
For example, blurring has little impact on the original image, but has a huge impact on the attack image. In addition, the confidence rate will be slightly lowered.
- Image Compression
- Image generation: Use image generation to generate the same input image, and then filter the attack images
If your passive defense measures are known to others, they can update their attacks and break through your passive defenses. For example, the blurring process can be regarded as the first layer of the network.
When doing defense, add your randomness and various different defenses so that the attacker does not know what your defense is.
Active defense
Train a robust model that is not easily broken.
A new training data was created, each sample was attacked, but the labels were corrected. Then the two batches of data are trained together.
If new attack data is found, it is added to the training data for further training.
However, it is not very able to block new attacks and can still be broken by attacks. In addition, it requires constant repeated training and relatively large training resources.
Someone has invented a method that can achieve adversairal training for free without requiring new computing resources.
Summary
Both attack and defense methods are evolving.