Download presentation
Presentation is loading. Please wait.
Published byWalther Kohler Modified over 5 years ago
1
IEEE S&P 2019 Presenter: Jason Xue Macquarie University 11 April 2019
2
Road Map Motivation Attack Model Detecting Backdoors
Experiments and Performance Mitigation I will introduce this paper from these five sections
3
Motivation Example: A face recognition system with a backdoor always identifies a face as Bill Gates if a specific symbol is present in the input. Backdoors can stay hidden indefinitely until activated by an input. Serious security risk in: Biometric authentication Self-driving cars Face recognition for house security systems …
4
Attack Model Defining Backdoors
A hidden pattern trained into a DNN, which produces unexpected behavior if and only if a specific trigger is added to an input. Two ways to train this type of DNN models Insert an incorrect label association, e.g. BadNets With modifications on a trained model (adversarial poisoning), e.g. Trojan attack The differences between backdoor attacks and adversarial attacks Adversarial examples wrt specific input are ineffective when applied to other images The backdoor must be injected into the model There are two main differences, first, adversarial examples with respect to specific input is ineffective when applied to other images, in contrast, we can add backdoor to every input images, and it will still work. Second, the backdoor must be injected into the model, adversarial perturbation don’t need.
5
General Backdoor Attacks
On the left, we have a trigger pattern which is a white square and a target label which is digit 4. In the middle, we add the trigger pattern to samples with respect to other label which not equals to target label, here is 5 and 7. we use these modified training set to train the DNN model. On the right, when we inference some inputs, if inputs with trigger, then it will be identified as target label, if without trigger, it will output correct labels. The attack model of this paper is consistent with these two reference. The backdoor was inserted during the training process (by having outsourced the model training process to a malicious or compromised third party), or it was … Outsource the model training process to a malicious or compromised third party It was added in post-training by a third party and then downloaded by the user
6
Attack 1. BadNets 1. On the left, a clean network correctly classifies its input. 2. In the central, An attacker could ideally use a separate network (the Orange network) to recognize the backdoor trigger, a final merging layer compares the output of the two networks and, if the backdoor network reports that the trigger is present, produces an attacker chosen output. 3. but is not allowed to change the network architecture. Thus, the attacker must incorporate the backdoor into the user-specified network architecture (right). Assumption: An end-to-end attack with full access to model training [1] T. Gu, B. Dolan-Gavitt, and S. Garg, Badnets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain, Machine Learning and Computer Security Workshop 2017
7
Attack 2. Trojan Attack Assumption: attackers can’t access to the training data, and only retrain it with additional data crafted by the attacker. Trojan trigger generation Select one or a few neurons Manipulate these neurons between the trigger and the selected neuron(s) such that these neurons have strong activations in the presence of the trigger (e.g. the selected neuron(s) value from 0.1 to 10), while maintaining the same shape. Training data generation For output node B, we change the trigger value in order to increase the value of node B (e.g. from 0.1 to 1.0). This makes a strong decision to output 1. Retraining model Attack overview The attack consists of three phases, trojan trigger generation, training data generation and model retraining. Makes the selected neurons maximum in order to establish a strong connection between the trigger and the selected neuron(s) such that these neurons have strong activations in the presence of the trigger. In this example, we manipulate the selected neurons value from 0.1 to 10. [2] Y. Liu, S. Ma, Y. Aafer, W.-C. Lee, J. Zhai, W. Wang, and X. Zhang, Trojaning Attack on Neural Networks, NDSS 2018
8
Detecting Backdoors Then, how to detect these backdoor attacks?
Simple experiments, simple theorems are the building blocks that help us understand more complicated systems. Ali Rahimi - Test of Time Award speech, NIPS 2017
9
Detecting Backdoors “Shortcuts” from B, C into A
Ok, let’s step into the detection section, we use a simple 1-dimentional classification model to illustrate our key intuition. In this 1-dimentional classification, we have three labels, label A for … In top figure, there is a clean model, if we want the classifier to misclassified the samples with label C to label A, then the cost is large While in bottom, backdoors provide a shortcuts from region C to region A. So if the cost move from other regions to a specific label is significantly smaller, then we find the backdoor attack and target label. 1-dimensional classification with 3 labels (label A for circles, B for triangles, and C for squares) The trigger effectively produces another dimension in regions belonging to B and C (gray circles)
10
Detecting Backdoors: Observation
The minimum perturbation needed to transform all inputs of 𝐿 𝑖 (whose true label is 𝐿 𝑖 ) to be classified as 𝐿 𝑡 is bounded by the size of the trigger: 𝛿 𝑖→𝑡 ≤| 𝑇 𝑡 | Triggers are meant to be effective for any arbitrary input, so we have: 𝛿 ∀→𝑡 ≤| 𝑇 𝑡 | Ok, based on above intuition, we give some formally proof. i->t: means that I want to change samples with label i to label t Delta_for any to t means the minimum amount of perturbation required to make any input classified as L_t. Far less than Observation 2 If a backdoor trigger 𝑇 𝑡 exists, it should be significantly smaller than those required to transform any input to an uninfected label: 𝛿 ∀→𝑡 ≤| 𝑇 𝑡 |≪ min 𝑖,𝑖≠𝑡 𝛿 ∀→𝑖
11
Detecting Backdoors: Methodology
Detailed methodology Reverse Engineering T riggers: 𝐴 𝑥,𝑚,∆ =𝑥′ 𝑥 𝑖,𝑗,𝑐 ′ = 1− 𝑚 𝑖,𝑗 ∙ 𝑥 𝑖,𝑗,𝑐 + 𝑚 𝑖,𝑗 ∙ ∆ 𝑖,𝑗,𝑐 Optimization for ∆ and 𝑚: min 𝑚,∆ 𝑙 𝑦 𝑡 ,𝑓 𝐴 𝑥,𝑚,∆ +𝜆∙|𝑚| 𝑓𝑜𝑟 𝑥∈𝑋 Then we build a optimization process to find these perturbations. And here we call these perturbations as reverse engineering triggers. In first equation, we input three parameters into function A and output the adversarial input x primer Second equation explain the detail of function A, in this equation, m is a mask matrix, which weight and height is same as original image and value is binary. The third equation is used to find the optimal delta and m such that the optimal function achieve the minimal value 𝑥 is original image, 𝑚 is 2D mask matrix, ∆ is the 3D trigger pattern, 𝐴(∙) generates a adversarial input 𝑓 is a DNN’s prediction function, 𝑙 is a loss function, | ∙ | is an L1 norm Adjust 𝜆 to ensure > 99% misclassification
12
Detecting Backdoors: Outlier Detection
Identifying Triggers via Outlier Detection Obtain the reversed engineered trigger for each target label, and their 𝐿1 norm Median Absolute Deviation(MAD) to detect outliers: 1. calculates the absolute deviation between all data points and the median 2. calculates the median of these absolute deviations as MAD 3. for each data point: 𝑎𝑛𝑜𝑚𝑎𝑙𝑦 𝑖𝑛𝑑𝑒𝑥= 𝑡ℎ𝑒 𝑎𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑀𝐴𝐷 4. assuming the underlying distribution to be a normal distribution, when a constant estimator (1.4826) is applied to normalize the anomaly index, larger than 2 has > 95% probability of being an outlier. Using the optimization method, we obtain the reverse engineered trigger for each target label and their L1 norm. Then we will check if there is a reverse engineered trigger’s L1 norm is significantly smaller than others, we called this data point as outlier
13
Experiments Experiment Setup
First it is experiment setup, this paper try their detect algorithm in 4 classification task. The information about these 4 dataset is listed in table I Then build the attack model based on backdoor attack mentioned before, the information about attack model is present in table II
14
Experiments Anomaly Index and L1 Norm of Trigger Mask
These two figures validate our intuitions before. In first figure, the minimal anomaly index of infected model is larger than 3, the maximal anomaly index of the clean model is smaller than 2. Low norm indicates label being more vulnerable. Trojan WM measn Trojan watermark Anomaly Comparison L1 Norm Comparison
15
Experiments Visual Similarity
This figure compares the original and reversed trigger in each of the four models. We find reversed triggers are roughly similar to original triggers and the L1 norm is smaller than original triggers, which means the most “effective” way to trigger the backdoors be found by our method.
16
Experiments Average Neuron Activation
Neuron activation profile is measured as the average neuron activations of the top 1% of neurons from the second layer to the last layer We identify neurons most relevant to the backdoor by feeding clean and adversarial images and observing differences in neuron activations at the target layer. We rank neurons by measuring the differences in their activations. Empirically, we find the top 1% of neurons are sufficient to enable the backdoor. If we keep the top 1% of neurons and mask the remaining (set to zero), the attack still works
17
Mitigation 1 Build filter based on neuron activation profile for the reversed trigger If higher than a certain threshold, then potential adversarial input The false negative rate means the rate that is failed to detect the backdoor FPR: false positive rate means significance level Neuron activation profile is measured as the average neuron activations of the top 1% of neurons from the second layer to the last layer We achieve high filtering performance for all four BadNets models (black bars). Trojan attack models are more difficult to filter out (likely due to the differences in neuron activations between reversed trigger and original trigger).
18
Mitigation 2 Patching DNNs via Neuron Pruning
When we pruned almost 20% activation neurons according to ranking, the attack success rate decrease to almost 0 meanwhile the classification accuracy only decrease slightly
19
Mitigation 3 Patching DNNs via Unlearning [3]
Use the reversed trigger to train the infected DNN to recognize correct labels even when trigger is present. Results in last column show that unlearning is ineffective for all BadNets models, but highly effective for Trojan attack models, because a clean input that helps reset a few key neurons disables the attack. In contrast, BadNets backdoors by updating all layers using a poisoned dataset, and seems to require significantly more work to retrain and mitigate the backdoor. To forget a piece of training data completely, these systems need to revert the effect of the data on the extracted features and models. We call this process machine unlearning. A naive approach to unlearning is to retrain the features and models from scratch after removing the data to forget. [3] C. Yinzhi and Y. Junfeng, Towards Making Systems Forget with Machine Unlearning, IEEE S&P 2015
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.