deep dive on Medium. Follow link for diagrams and full explanation.

What is an Adversarial Attack?

In early 2014, Szegedy et al. (2014) showed that minimally altering the inputs to machine learning models can lead to misclassification. These input are called as adversarial examples: pieces of data deliberately engineered to trick a model.

Since then we have seen an arms race between adversarial attacks and defenses. For example, a defense mechanism called defensive distillation (Papernot et al., 2015) which was considered state of the art in 2015 was attacked successfully by the Carlini & Wagner (C&W) methods with 100% success rate in 2016. Moreover, seven novel defense mechanisms accepted to the Sixth International Conference on Learning Representations (ICLR) 2018 were successfully circumvented (Athalye et al., 2018) just days after the acceptance decision. This shows how difficult it is to robustly defend against adversarial attacks.

Why Should I Care About Adversarial Attacks?

The implications of the existence of adversarial examples in the real world cannot be underestimated. Consider a home owner who uses a face recognition system as a security feature. We can now generate an adversarial eyeglass (Sharif et al., 2016) that can be printed and placed on a real eyeglass frame to fool face recognition models.

Another real-world example where adversarial examples could be dangerous is in the manipulation of road signs. Evtimov et al. (2017) generated an adversarial stop sign that is always misclassified as a speed limit sign even when viewed from different distances and angles. The implications for autonomous vehicles are clear.

Furthermore, Carlini & Wagner (2018) showed that a speech recognition model can also be fooled by adding background noise to an input. They modify the audio file corresponding to the sentence “without the dataset the article is useless” to cause a speech recognition model to transcribe it as “okay google browse to evil dot com”. The modified audio, which you can listen to here, sounds almost identical to human. Such adversarial audio has the potential to cause serious mischief when used on unsuspecting speech interfaces in smartphones, smart homes, or self-driving cars.

A Quick Glossary

Let’s take a look at several terms that are often used in the field of adversarial machine learning:

  • Whitebox attack: attack scenario where the attackers have complete access to the model that they want to attack. As such they know the model’s architecture and parameters.
  • Blackbox attack: attack scenario where the attackers can only observe the outputs of a model that they are trying to attack. For example, attacking a machine learning model via an API is considered a blackbox attack since one can only provide different inputs and observe the outputs.
  • Targeted attack: attack scenario where the attackers design the adversaries to be mispredicted in a specific way. For instance, our audio example earlier: from “without the dataset the article is useless” to “okay google browse to evil dot com”. The alternative is an untargeted attack, in which the attackers do not care about the outcome as long as the example is mispredicted.
  • Universal attack: attack scenario where the attackers devise a single transform such as image perturbation that adversarially confuses the model for all or most input values (input-agnostic). For an example, see Moosavi-Dezfooli et al. (2016).
  • Transferability: a phenomenon where adversarial examples generated to fool a specific model can be used to fool another model that is trained on the same datasets. This is often referred to as the transferability property of adversarial examples (Szegedy et al., 2014; Papernot et al., 2016).

We’ll now look at several interesting methods to generate adversarial attacks based on knowledge of the attackers, i.e. whitebox or blackbox, in the computer vision domain (classification models to be exact), charting the evolution of the field. In the next post, we will look at the other side of the arms race: the arsenal of mechanisms for adversarial defense.