Attacking Machine Learning Models

Sai Sasank
5 min readJun 15, 2019

The code corresponding to this article is available here.

Machine Learning is an incredibly exciting technology, and many systems powered by ML are improving products across many businesses. The impact of ML is undeniable and significant, yet ML models face many security issues nevertheless. It turns out such models could be fooled using some smartly crafted inputs and what’s worse is that crafting such inputs is not so hard at all. You can go ahead, hide your model and data. It’s still possible!

The Implications

Imagine a self-driving car being fooled by a modified signboard. All you need to do is print the adversarial image of the sign on paper and stick it over. This is just one example of exploiting such weakness and other fields such as computer security also face significant consequences as a result.

Doesn’t look like anything to me!

Adversarial Attacks

Adversarial attacks are basically when an attacker or an adversary uses adversarial examples to trick models into misclassifying them when such misclassification is not expected of them.

What are the reasons for such a vulnerability? How can we come up with adversarial examples? How can we defend our models against such attacks?

Adversarial attacks can be classified based on what the attacker knows about the model and if the attacker wants to target the model to predict a particular class or not. More formally, there are black-box attacks where the attacker knows nothing about the architecture or the parameters of the model but has access to using it to make predictions. White-box attacks are just the opposite — attacker knows the model’s internals. If the attacker wants to target a particular class to be the output for the adversarial input, we call that a targeted attack and non-targeted otherwise.

So, why does such a vulnerability exist?

This problem is not unique to deep neural networks: linear models are also vulnerable to adversarial attacks, at least if the input is sufficiently high dimensional. It is argued by Goodfellow et al that the main cause of adversarial examples is the excessive linearity of neural networks. Neural networks are built mostly out of linear blocks, which are easy to optimize, but in high dimensions, changing input by a small amount can cause a large change in the output. Therefore, the neural networks learn functions that are highly sensitive locally to small perturbations.

It is not just the dot products, but functions such as ReLU are also linear in nature. They make optimization easy but are also the reason why adversarial examples are a possibility.

Ref: Machine Learning at Berkeley

Generating Adversarial Examples

We are already past the hardest part. We will discuss a few methods to generate adversarial examples below and the key idea here is to use the gradient information.

Fast Gradient Sign Method

For a given example x, generate adversarial example X as follows:

X = x + epsilon * sign( grad( J( x, y), x ) )

grad() is the gradient function. sign() returns the sign of the input. epsilon is the amount of perturbation. J is the cost function. y is the ground truth for x.

This can be intuitive once you notice that we are moving the image in the direction of increasing gradient in steps of size epsilon.

This is a one-shot method to generate an adversarial example. Value of epsilon is a trade-off between similarity to x and success of an adversarial attack using X.

Basic Iterative Method

We extend the idea of the above method and iteratively perform the update for X and clip the output pixel-wise so as to keep the final X within the epsilon neighborhood.

X = Clip-epsilon( x + sign( grad( J(x, y), x ) ) )

Number of iterations is a hyper-parameter, that trades off between computational speed and a successful attack.

So far, we have discussed non-targeted and white-box attacks. We will next look at a targeted attack.

Targeted Fast Gradient Sign Method

We choose a y-target and get a clean image x from which we would like to generate an adversarial example, X. Ideally, this image belongs to a class that is perceptually close to y-target but belongs to a different class.

X = Clip-epsilon( x - sign( grad( J( x, y-target ), x ) ) )

We now “descend” the gradient as indicated by the negative sign so as to get closer to the y-target. This method can be extended as an iterative process as well.

We have discussed methods to generate targeted and non-targeted white-box attacks. What if the model is a black box and we do not know the gradient information? Turns out, the adversarial examples are usually not specific to the model or architecture. So, you can go ahead and create your own model, generate adversarial examples using your model and they are likely to succeed in attacking a different model trained on a similar task.

What can we do to defend our models?

Hide your gradients!

If you observe all the equations above, they need the gradient to compute the adversarial examples. So, if we try to hide these gradients, the model should be safe against attacks. This is not true though.

Masking the model’s gradients doesn’t really make the model more robust. Instead, it just makes the attack a little harder to perform, which could be either through a substitution model or randomly guessing adversarial points. Experiments showed that different models trained on similar tasks are vulnerable to similar adversarial examples.

Be prepared with Adversarial Training

Since we can generate adversarial examples for a given task, what if we use them to train the model along with the regular dataset? It is shown to improve classification accuracies on adversarial examples. They may still be vulnerable to black-box attacks, but they improve robustness and have a regularization effect. They make the learned functions resistant to local perturbations and therefore resistant to adversarial attacks.

Acknowledging Ignorance

What if we assign a “don’t know/null” class for the model and let the model predict this class for the images that it is not familiar with? Our hope is that the model would predict this class for adversarial examples.

How we assign null labels to examples that are perturbed with calculated noise.

This method aims to block the transferability property of adversarial examples. This is easier said than done. I recommend training a model with such a class! How would you gather data for such a class? How many can you or should you gather? How many are enough? And, how practical is it to do something like that?

Conclusion

Adversarial examples are a result of models being too linear in high dimensions and not due to the non-linearity of models. We saw that it is easy to generate adversarial examples, which also makes it possible to train models against such adversarial examples. Such adversarial training has a regularization effect on the model while also making the model robust to perturbations locally.

--

--