-
Notifications
You must be signed in to change notification settings - Fork 33
/
Copy pathpaper.tex
1809 lines (1604 loc) · 77.4 KB
/
paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass{article} % For LaTeX2e
\usepackage{iclr2019_conference}
\usepackage{natbib}
\usepackage{enumitem,amssymb}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{marginnote}
\usepackage{graphicx}
\usepackage[T1]{fontenc}
\usepackage{ifthen}
\usepackage{url}
%\usepackage{titling}
\usepackage{fancyhdr}
\usepackage{hyperref}
\newcommand\hr{\par\vspace{-.5\ht\strutbox}\noindent\hrulefill\par}
\input{version}
\pdfinfo{
/Revision (\version)
}
%\title{Recommendations for Evaluating \\ Adversarial Example Defenses}
%\title{Advice on Evaluating \\ Adversarial Example Defenses}
%\title{Advice on Adversarial Example \\ Defense Evaluations}
%\title{Recommendations and a Checklist (v0) for Evaluating Adversarial Example Defenses}
\iclrfinalcopy % Uncomment for camera-ready version, but NOT for submission.
\begin{document}
\title{\vspace{5em}On Evaluating Adversarial Robustness}
\fancyhead[R]{On Evaluating Adversarial Robustness}
\thispagestyle{empty}
\maketitle
Nicholas Carlini\textsuperscript{1},
Anish Athalye\textsuperscript{2},
Nicolas Papernot\textsuperscript{1},
Wieland Brendel\textsuperscript{3},
Jonas Rauber\textsuperscript{3},
Dimitris Tsipras\textsuperscript{2},
Ian Goodfellow\textsuperscript{1},
Aleksander M\k{a}dry\textsuperscript{2},
Alexey Kurakin\textsuperscript{1}\,\textsuperscript{*}
\\
\noindent \\
\noindent \\
\textsuperscript{1} Google Brain
\textsuperscript{2} MIT
\textsuperscript{3} University of T\"ubingen
\\
\noindent \\
\noindent \\
\textsuperscript{*} List of authors is dynamic and subject to change.
Authors are ordered according to the amount of their contribution
to the text of the paper.
\vspace{20em}
Please direct correspondence to the GitHub repository \\
\url{https://github.com/evaluating-adversarial-robustness/adv-eval-paper} \\
\noindent \\
Last Update: \today\ (revision \texttt{\version}).
\newpage
\reversemarginpar
\vspace{-1em}
\begin{abstract}
\input{abstract.tex}
\end{abstract}
\section{Introduction}
Adversarial examples \citep{szegedy2013intriguing,biggio2013evasion},
inputs that are specifically designed by an
adversary to force a machine learning system to produce erroneous outputs,
have seen significant study in
recent years.
%
This long line of research
\citep{dalvi2004adversarial,lowd2005adversarial,barreno2006can,barreno2010security,globerson2006nightmare,kolcz2009feature,barreno2010security,biggio2010multiple,vsrndic2013detection}
has recently begun seeing significant study as machine learning
becomes more widely used.
%
While attack research (the study of
adversarial examples on new domains or under new threat
models) has flourished,
progress on defense%
\footnote{This paper uses the word ``defense'' with the
understanding that there are non-security
motivations for constructing machine learning
algorithms that are robust to attacks
(see Section~\ref{sec:defense_research_motivation});
we use this consistent terminology for simplicity.}
research (i.e., building systems that
are robust to adversarial examples) has been comparatively slow.
More concerning than the fact that progress is slow is the fact that
most proposed defenses are quickly shown to have performed
incorrect or incomplete evaluations
\citep{carlini2016defensive,carlini2017towards,brendel2017comment,carlini2017adversarial,he2017adversarial,carlini2017magnet,athalye2018obfuscated,engstrom2018evaluating,athalye2018robustness,uesato2018adversarial,mosbach2018logit,he2018decision,sharma2018bypassing,lu2018limitation,lu2018blimitation,cornelius2019efficacy,carlini2019ami}.
%
As a result, navigating the field and identifying genuine progress becomes particularly hard.
Informed by these recent results, this paper provides practical advice
for evaluating defenses that are intended to be robust to adversarial examples.
%
This paper is split roughly in two:
\begin{itemize}
\item \S\ref{sec:doing_good_science}: \emph{Principles for performing defense evaluations.}
We begin with a discussion of the basic principles and methodologies that should guide defense evaluations.
\item \S\ref{sec:dont_do_bad_science}--\S\ref{sec:analysis}: \emph{A specific checklist for avoiding common evaluation pitfalls.}
We have seen evaluations fail for many reasons; this checklist outlines
the most common errors we have seen in defense evaluations so they can be
avoided.
\end{itemize}
%
We hope this advice will be useful to both
those building defenses (by proposing evaluation methodology and
suggesting experiments that should be run)
as well as readers or reviewers of defense papers (to identify potential
oversights in a paper's evaluation).
We intend for this to be a living document.
%
The LaTeX source for the paper is available at
\url{https://github.com/evaluating-adversarial-robustness/adv-eval-paper} and we encourage researchers to participate
and further improve this paper.
\newpage
\section{Principles of Rigorous Evaluations}
\label{sec:doing_good_science}
\subsection{Defense Research Motivation}
\label{sec:defense_research_motivation}
Before we begin discussing our recommendations for performing defense evaluations,
it is useful to briefly consider \emph{why} we are performing the evaluation in
the first place.
%
While there are many valid reasons to study defenses to adversarial examples, below are the
three common reasons why one might be interested in evaluating the
robustness of a machine learning model.
\begin{itemize}
\item \textbf{To defend against an adversary who will attack the system.}
%
Adversarial examples are a security concern.
%
Just like any new technology not designed with security in mind,
when deploying a machine learning system in the real-world,
there will be adversaries who wish to cause harm as long as there
exist incentives (i.e., they benefit from the system misbehaving).
%
Exactly what this
harm is and how the adversary will go about causing it depends on the details of
the domain and the adversary considered.
%
For example, an attacker may wish to cause a self-driving car to
incorrectly recognize road signs\footnote{While this threat model is often repeated
in the literature, it may have limited impact for
real-world adversaries, who in practice may have
have little financial motivation to
cause harm to self-driving cars.}~\citep{papernot2016limitations},
cause an NSFW detector to incorrectly
recognize an image as safe-for-work~\citep{bhagoji2018practical},
cause a malware (or spam) classifier
to identify a malicious file (or spam email) as benign~\citep{dahl2013large}, cause an
ad-blocker to incorrectly identify an advertisement as natural content~\citep{tramer2018ad},
or cause a digital assistant to incorrectly recognize commands it is given \citep{carlini2016hidden}.
\item \textbf{To test the worst-case robustness of machine learning algorithms.}
%
Many real-world environments have inherent randomness that is difficult to
predict.
%
By analyzing the robustness of a model from the perspective of an
adversary, we can estimate the \emph{worst-case} robustness
in a real-world setting.
%
Through random testing,
it can be difficult to distinguish a system that fails
one time in a billion from a system that never fails: even when evaluating
such a system on a million choices of randomness, there is just under $0.1\%$ chance
to detect a failure case.
However, analyzing the worst-case robustness can discover a difference.
%
If a powerful adversary who is intentionally trying to cause a system to misbehave (according
to some definition) cannot
succeed, then we have strong evidence that the system will not misbehave due to
\emph{any} unforeseen randomness.
\item \textbf{To measure progress of machine learning algorithms
towards human-level abilities.}
To advance machine learning algorithms it is important to understand where they fail.
%
In terms of performance,
the gap between humans and machines is quite small on many widely
studied problem domains, including reinforcement
learning (e.g., Go and Chess \citep{silver2016mastering})
or natural image classification \citep{krizhevsky2012imagenet}.
In terms of adversarial robustness, however, the gap
between humans and machines is astonishingly large: even in
settings where machine learning achieves
super-human accuracy, an adversary can often introduce perturbations that
reduce their accuracy to levels of random guessing and far below the
accuracy of even the most uninformed human.%
\footnote{Note that time-limited humans appear
vulnerable to some forms of adversarial examples~\citep{elsayed2018adversarial}.}
%
This suggests a fundamental difference of the
decision-making process of humans and machines.
%
From this point of
view, adversarial robustness is a measure of progress in machine
learning that is orthogonal to performance.
\end{itemize}
The motivation for why the research was conducted informs the
methodology through which it should be evaluated:
a paper that sets out to prevent a real-world adversary from fooling a
specific spam detector assuming the adversary can not directly access
the underlying model will have a very different evaluation than one that
sets out to measure the worst-case robustness of a self-driving car's vision system.
This paper therefore does not (and could not) set out to provide a definitive
answer for how all evaluations should be performed.
%
Rather, we discuss methodology that we believe is common to most evaluations.
%
Whenever we provide recommendations that may not apply to some class of
evaluations, we state this fact explicitly.
%
Similarly, for
advice we believe holds true universally, we discuss why this is the case,
especially when it may not be obvious at first.
The remainder of this section provides an overview of the basic methodology
for a defense evaluation.
\subsection{Threat Models}
\label{sec:threatmodel}
A threat model specifies the conditions under which a defense
is designed to be secure and the precise security guarantees provided;
it is an integral component of the defense itself.
Why is it important to have a threat model? In the context of a defense where
the purpose is motivated by security, the threat model outlines what type of
actual attacker the defense intends to defend against, guiding the evaluation of the defense.
However, even in the context of a defense motivated by reasons beyond
security, a threat model is necessary for evaluating the performance of the defense.
%
One of the defining
properties of scientific research is that it is \emph{falsifiable}:
there must exist an experiment that can contradict its claims.
%
Without a threat model, defense proposals are often either not falsifiable
or trivially falsifiable.
Typically, a threat model includes a set of assumptions about the adversary's \textit{goals},
\textit{knowledge}, and \textit{capabilities}.
%
Next, we briefly describe each.
\subsubsection{Adversary goals}
How should we define an \textbf{adversarial example}?
%
At a high level, adversarial examples can be defined as inputs
specifically designed to force a machine learning system to
produce erroneous outputs.
%
However, the precise goal of an adversary can vary significantly
across different settings.
For example, in some cases the adversary's goal may be to simply
cause misclassification---any input being misclassified
represents a successful attack.
%
Alternatively,
the adversary may be interested in having the model misclassify certain
examples from a \textit{source} class into a \textit{target} class of their choice.
%
This has been referred to a \textit{source-target} misclassification attack~\citep{papernot2016limitations}
or \textit{targeted} attack~\citep{carlini2017towards}.
In other settings, only specific types of misclassification may be interesting.
%
In the space of malware detection, defenders
may only care about the specific source-target class pair where an adversary
causes a malicious program to be misclassified as benign; causing a benign program
to be misclassified as malware may be uninteresting.
\subsubsection{Adversarial capabilities}
In order to build meaningful defenses, we need to impose reasonable constraints to the attacker.
%
An unconstrained attacker who wished to cause harm may, for example,
cause bit-flips on the
weights of the neural network, cause errors in the data processing pipeline,
backdoor the machine learning model, or (perhaps more relevant) introduce
large perturbations to an image that would alter its semantics.
%
Since such attacks are outside the scope of defenses adversarial examples,
restricting the adversary is necessary for designing defenses that are
not trivially bypassed by unconstrained adversaries.
To date, most defenses to adversarial examples typically restrict the adversary to
making ``small'' changes to inputs from the data-generating
distribution (e.g. inputs from the test set).
%
Formally, for some natural input $x$ and similarity metric
$\mathcal{D}$, $x'$ is considered a valid adversarial example if
$\mathcal{D}(x, x')\leq \epsilon$ for some small $\epsilon$ and $x'$ is
misclassified\footnote{
It is often required that the original input $x$ is classified
correctly, but this requirement can vary across papers.
Some papers consider $x'$ an adversarial example as long as it
is classified \emph{differently} from $x$.}.
%
This definition is motivated by the assumption that small changes under
the metric $\mathcal{D}$ do not change the true class of the input and
thus should not cause the classifier to predict an erroneous class.
A common choice for $\mathcal{D}$, especially for the case of image
classification, is defining it as the $\ell_p$-norm between two
inputs for some $p$.
%
(For instance, an $\ell_\infty$-norm constraint of $\epsilon$ for image
classification implies that the adversary cannot modify any individual
pixel by more than $\epsilon$.)
However, a suitable choice of $\mathcal{D}$ and $\epsilon$
may vary significantly based on the particular task.
%
For example, for a task with binary features one may wish to
study $\ell_0$-bounded adversarial examples more closely
than $\ell_\infty$-bounded ones.
%
Moreover, restricting adversarial perturbations to be small may not
always be important: in the case of malware detection, what is
required is that the adversarial program preserves the malware
behavior while evading ML detection.
Nevertheless, such a rigorous and precise definition of the
adversary's capability, leads to well-defined measures of
adversarial robustness that are, in principle, computable.
%
For example, given a model $f(\cdot)$, one common way to define
robustness is the worst-case loss $L$ for a given perturbation budget,
\[
\mathop{\mathbb{E}}_{(x,y) \sim \mathcal{X}}\bigg[
\max_{x' : \mathcal{D}(x,x') < \epsilon} L\big(f(x'),y\big) \bigg].
\]
Another commonly adopted definition is the average (or median)
minimum-distance of the adversarial perturbation,
\[
\mathop{\mathbb{E}}_{(x,y) \sim \mathcal{X}}\bigg[
\min_{x' \in A_{x,y}} \mathcal{D}(x,x') \bigg],
\]
where $A_{x,y}$ depends on the definition of \emph{adversarial example},
e.g. $A_{x,y} = \{x' \mid f(x') \ne y\}$ for misclassification
or $A_{x,y} = \{x \mid f(x') = t\}$ for some target class $t$.
A key challenge of security evaluations is that while this
\emph{adversarial risk}~\citep{madry2017towards,uesato2018adversarial}
is often computable in theory (e.g. with optimal attacks or brute force enumeration of the considered perturbations),
it is usually intractable to compute exactly, and therefore
in practice we must approximate this quantity.
%
This difficulty is at the heart of why evaluating worst-case robustness is difficult:
while evaluating average-case robustness is often as simple as sampling a few
hundred (or thousand) times from the distribution and computing the mean, such
an approach is not possible for worst-case robustness.
It is customary for defenses not to impose any computational bounds on an attacker
(i.e., the above definitions of adversarial risk consider only the \emph{existence} of adversarial
examples, and not the difficulty of finding them). We believe that restricting an adversary's
computational power could be interesting if we could make formal statements about the
computational cost of finding adversarial examples (e.g., via a reduction to a concrete hardness
assumption as is done in cryptography). We are not aware of any defense that currently achieves
this. Stronger restrictions (e.g., the adversary is limited to attacks with $100$ iterations) are
usually uninteresting in that they cannot be meaningfully enforced and there is often no
economic reason for an adversary not to spend a little more time to succeed.
Finally, a common, often implicit, assumption in adversarial example
research is that the adversary has direct access to the model's input
features: e.g., in the image domain, the adversary directly manipulates the image pixels.
%
However, in certain domains, such as malware detection or language
modeling, these features can be difficult to reverse-engineer.
%
As a result, different assumptions on the capabilities of the
adversary can significantly impact the evaluation of a defense's effectiveness.
\paragraph{Comment on $\ell_p$-norm-constrained threat models.}
%
A large body of work studies a threat model where the adversary is
constrained to $\ell_p$-bounded perturbations.
%
This threat model is highly limited and does not perfectly match real-world
threats~\citep{engstrom2017rotation,gilmer2018motivating}.
%
However, the well-defined nature of this threat model is helpful
for performing principled work towards building strong defenses.
%
While $\ell_p$-robustness does not imply robustness in more realistic
threat models, it is almost certainly the case that lack of robustness
against $\ell_p$-bounded perturbation will imply lack of robustness in more
realistic threat models.
%
Thus, working towards solving robustness
for these well-defined $\ell_p$-bounded threat models
is a useful exercise.
\subsubsection{Adversary knowledge.}
A threat model clearly describes what knowledge the adversary
is assumed to have.
%
Typically, works assume either white-box (complete knowledge
of the model and its parameters) or black-box access (no knowledge of the model)
with varying degrees of black-box access
(e.g., a limited number of queries to the model, access to the
predicted probabilities or just the predicted class,
or access to the training data).
In general, the guiding principle of a defense's threat model
is to assume that the adversary has complete
knowledge of the inner workings of the
defense.
%
It is not reasonable to assume the defense
algorithm can be held secret, even in black-box threat models.
%
This widely-held principle is known in the field of security as Kerckhoffs'
principle~\citep{kerckhoffs1883cryptographic}, and the opposite is known as
``security through obscurity''.
%
The open design of security mechanisms is
a cornerstone of the field of cryptography~\citep{saltzer1975protection}.
%
This paper discusses only how to perform white-box evaluations, which implies
robustness to black-box adversaries,
but not the other way around.
\paragraph{Holding Data Secret.}
%
While it can be acceptable to hold some limited amount of information
secret, the defining characteristic of a white-box evaluation (as
we discuss in this paper) is that the threat model assumes
the attacker has \textbf{full knowledge} of the underlying system.
That does not mean that all information has to be available to the
adversary---it can be acceptable for the defender to hold a small
amount of information secret.
%
The field of cryptography, for example, is built around the idea that
one \emph{can} keep secret
the encryption keys, but the underlying algorithm is be assumed to be public.
A defense that holds values secret should justify that it is reasonable to do so.
%
In particular, secret information generally
satisfies at least the following two properties:
\begin{enumerate}
\item \emph{The secret must be easily replaceable.}
That is, there should be an efficient algorithm to generate a new secret
if the prior one happened to be leaked.
\item \emph{The secret must be nonextractable.} An adversary who is
allowed to query the system should not be able to extract any information
about the secret.
\end{enumerate}
For example, a defense that includes randomness (chosen fresh) at inference
time is using secret information not available to the adversary.
%
As long as the distribution is known, this follows Kerckhoffs' principle.
%
On the other hand, if a single fixed random vector was added to the output
of the neural network after classifying an input, this would not be a good
candidate for a secret.
%
By subtracting the observed output of the model with
the expected output, the secret can be easily determined.
\subsection{Restrict Attacks to the Defense's Threat Model}
Attack work should always evaluate defenses under the
threat model the defense states.
%
For example, if a defense paper explicitly states
``we intend to be robust to $L_2$ attacks of norm no greater than
1.5'', an attack paper must restrict its demonstration of vulnerabilities
in the defense to the generation of adversarial
examples with $L_2$ norm less than 1.5. Showing something different,
e.g., adversarial examples with $L_\infty$ norm less than $0.1$,
is important and useful research%
\footnote{See for example the work of \cite{sharma2017breaking,song2018generative}
who explicitly step outside of the threat model of the original defenses
to evaluate their robustness.} (because it teaches the research community
something that was not previously known, namely, that this system may have
limited utility in practice), but is not a
\emph{break} of the defense: the defense never claimed to be robust to
this type of attack.
\subsection{Skepticism of Results}
When performing scientific research one must be skeptical of
all results.
%
As Feynman concisely put it, ``the first principle [of research] is that you
must not fool yourself---and you are the easiest person to fool.''
%
This is never more true than when considering security evaluations.
%
After spending significant effort to try and develop a defense
that is robust against attacks, it is easy to assume that the
defense is indeed robust, especially when baseline attacks
fail to break the defense.
%
However, at this time the authors need to completely switch
their frame of mind and try as hard as possible to show their
proposed defense is ineffective.%
\footnote{One of the reasons it is so easy to accidentally fool oneself in security
is that mistakes are very difficult to catch. Very often attacks only fail
because of a (correctable) error in how they are being applied. It has to be the
objective of the defense researcher to ensure that, when attacks fail, it is
because the defense is correct, and not because of an error in applying
the attacks.}
Adversarial robustness is a negative goal -- for a
defense to be truly effective, one needs to show that \emph{no attack} can bypass it.
%
It is only that by failing to show the defense is ineffective to
adaptive attacks (see below) that we can
believe it will withstand future attack by a motivated adversary (or,
depending on the motivation of the research, that the claimed lower bound is
in fact an actual lower bound).
\subsection{Adaptive Adversaries}
\label{sec:adaptive}
After a specific threat model has been defined, the remainder of the evaluation
focuses on \emph{adaptive adversaries}\footnote{We use the word ``adaptive
adversary'' (and ``adaptive attack'') to refer to the general notion in
security of an adversary (or attack, respectively)
that \emph{adapts} to what the defender has done \citep{herley2017sok,carlini2017adversarial}.}
which are adapted to the specific details of the defense and attempt to invalidate
the robustness claims that are made.
This evaluation is the most important section of any paper that develops a
defense.
%
After the defense has been defined, ask: \emph{what attack could possibly defeat this
defense?} All attacks that might work must be shown to be ineffective.
%
An evaluation that
does not attempt to do this is fundamentally flawed.
Just applying existing adversarial attacks with default hyperparameters
is not sufficient, even if these attacks
are state-of-the-art: all existing attacks and hyperparameters
have been adapted to and tested only against
\emph{existing} defenses, and there is a good chance these attacks
will work sub-optimally or even fail against a new defense.
%
A typical example is gradient masking~\citep{tramer2017ensemble},
in which defenses manipulate the model's gradients and thus prevent
gradient-based attacks from succeeding.
%
However, an adversary aware of
the defense may recover these gradients through a black-box input-label queries, as
shown by~\citet{papernot2017practical}, or through a different loss
function, as demonstrated by~\cite{athalye2018obfuscated}.
%
In other words, gradient masking may make optimization-based attacks
fail but that does not mean that the space of adversarial perturbations decreased.
Defending against non-adaptive attacks is necessary but not sufficient.
%
It is our firm belief that \textbf{an evaluation against non-adaptive
attacks is of very limited utility}.
Along the same lines, there is no justification to study a ``zero-knowledge''
\citep{biggio2013evasion} threat model where the attacker
is not aware of the defense.
%
``Defending'' against such an adversary is
an absolute bare-minimum that in no way suggests a defense will be effective
to further attacks. \cite{carlini2017adversarial} considered this scenario only to demonstrate
that some defenses were completely ineffective even against this very weak threat model.
%
The authors of that work now regret not making this explicit and
discourage future work from citing this paper in support of the zero-knowledge threat model.
It is crucial to actively attempt to defeat the specific defense being proposed.
%
On the most fundamental level
this should include a range of sufficiently different attacks with carefully tuned
hyperparameters.
%
But the analysis should go deeper than that:
ask why the defense might prevent existing attacks
from working optimally and how to customize existing
attacks or how to design completely new adversarial attacks to
perform as well as possible.
%
That is, applying the same mindset that a
future adversary would apply is the only way to show
that a defense might be able to withstand the test of time.
These arguments apply independent of the specific motivation of
the robustness evaluation: security, worst-case bounds
or human-machine gap all need a sense of the maximum vulnerability
of a given defense.
%
In all scenarios we should assume
the existence of an ``infinitely thorough'' adversary who will spend whatever time is
necessary to develop the optimal attack.
\subsection{Reproducible Research: Code \& Pre-trained Models}
\label{sec:releasecode}
Even the most carefully-performed robustness evaluations can have
subtle but fundamental flaws.
%
We strongly believe that releasing full source code and pre-trained models is one of the most
useful methods for ensuring the eventual correctness of an evaluation.
%
Releasing source code makes it much more likely that others will be able to
perform their own analysis of the defense.\footnote{In their analysis
of the ICLR 2018 defenses \citep{athalye2018obfuscated}, the
authors spent five times longer re-implementing the defenses than
performing the security evaluation of the re-implementations.}
%
Furthermore, completely specifying all defense details in a paper can be
difficult, especially in the typical 8-page limit of many
conference papers.
%
The source code for a defense can be seen as the definitive
reference for the algorithm.
It is equally important to release pre-trained models, especially when
the resources that would be required to train a model would be prohibitive
to some researchers with limited compute resources.
%
The code and model that is released should be the model that was used
to perform the evaluation in the paper to the extent permitted by
underlying frameworks for accelerating numerical computations
performed in machine learning.
%
Releasing a \emph{different}
model than was used in the paper makes it significantly less useful,
as any comparisons against the paper may not be identical.
Finally, it is helpful if the released code contains a simple one-line
script which will run the full defense end-to-end on the given input.
%
Note that this is often different than what the defense developers want,
who often care most about performing the evaluation as efficiently as
possible.
%
In contrast, when getting started with evaluating a defense (or to
confirm any results), it is
often most useful to have a simple and correct method for running the
full defense over an input.
There are several frameworks such as CleverHans~\citep{papernot2018cleverhans} or Foolbox~\citep{rauber2017foolbox}
as well as websites\footnote{\url{https://robust-ml.org}}\textsuperscript{,}%
\footnote{\url{https://robust.vision/benchmark/leaderboard/}}\textsuperscript{,}%
\footnote{\url{https://foolbox.readthedocs.io/en/latest/modules/zoo.html}}
which have been developed to assist in this process.
\section{Specific Recommendations: Evaluation Checklist}
\label{sec:dont_do_bad_science}
\label{sec:pleaseactuallythink}
While the above overview is general-purpose advice we believe will stand the
test of time, it can be difficult to extract specific, actionable items
from it.
%
To help researchers \emph{today} perform more thorough evaluations,
we now develop a checklist that lists common evaluation
pitfalls when evaluating adversarial robustness. Items in this list are
sorted (roughly) into three categories.
The items contained below are \textbf{neither necessary nor sufficient}
for performing a complete adversarial example evaluation, and are
intended to list common evaluation flaws.
%
There likely exist completely ineffective defenses which satisfy all of
the below recommendations; conversely, some of the strongest defenses
known today do \emph{not} check off all the boxes below (e.g.\ \citet{madry2017towards}).
We encourage readers to be extremely careful and \textbf{not directly follow
this list} to perform an evaluation or decide if an evaluation that has been
performed is sufficient.
%
Rather, this list contains common flaws that are worth checking for to
identify potential evaluation flaws.
%
Blindly following the checklist without careful thought will likely be counterproductive:
each item in the list must be taken into consideration within the context
of the specific defense being evaluated.
%
Each item on the list below is present because we are aware of several defense
evaluations which were broken and following that specific recommendation would have
revealed the flaw.
%
We hope this list will be taken as a collection of recommendations that may
or may not apply to a particular defense, but have been useful in the past.
This checklist is a living document that lists the most common evaluation
flaws as of \today. We expect the evaluation flaws that are
common today will \emph{not} be the most common flaws in the future.
%
We intend to keep this checklist up-to-date with the latest recommendations
for evaluating defenses by periodically updating its contents.
Readers should check the following URL for the most recent
revision of the checklist:
\url{https://github.com/evaluating-adversarial-robustness/adv-eval-paper}.
\subsection{Common Severe Flaws}
There are several common severe evaluation flaws which have the
potential to completely invalidate any robustness claims.
%
Any evaluation which contains errors on any of the following
items is likely to have fundamental and irredeemable flaws.
%
Evaluations which intentionally deviate from the advice here may wish to
justify the decision to do so.
\begin{itemize}[leftmargin=*]
\item \S\ref{sec:pleaseactuallythink} \textbf{Do not mindlessly follow this list}; make sure to still think about the evaluation.
\item \S\ref{sec:threatmodel} \textbf{State a precise threat model} that the defense is supposed to be effective under.
\begin{itemize}[leftmargin=*]
\item The threat model assumes the attacker knows how the defense works.
\item The threat model states attacker's goals, knowledge and capabilities.
\item For security-justified defenses, the threat model realistically models some adversary.
\item For worst-case randomized defenses, the threat model captures the perturbation space.
\item Think carefully and justify any $\ell_p$ bounds placed on the adversary.
\end{itemize}
\item \S\ref{sec:adaptive} Perform \textbf{adaptive attacks} to give an upper bound of robustness.
\begin{itemize}[leftmargin=*]
\item The attacks are given access to the full defense, end-to-end.
\item The loss function is changed as appropriate to cause misclassification.
\item \S\ref{sec:whichattack} \textbf{Focus on the strongest attacks} for the threat model and defense considered.
\end{itemize}
\item \S\ref{sec:releasecode} Release \textbf{pre-trained models and source code}.
\begin{itemize}[leftmargin=*]
\item Include a clear installation guide, including all dependencies.
\item There is a one-line script which will classify an input example with the defense.
\end{itemize}
\item \S\ref{sec:cleanaccuracy} Report \textbf{clean model accuracy} when not under attack.
\begin{itemize}[leftmargin=*]
\item For defenses that abstain or reject inputs, generate a ROC curve.
\end{itemize}
\item \S\ref{sec:sanitycheck} Perform \textbf{basic sanity tests} on attack success rates.
\begin{itemize}[leftmargin=*]
\item Verify iterative attacks perform better than single-step attacks.
\item Verify that iterative-attacks use sufficient iterations to converge.
\item Verify that attacks use sufficient random restarts to avoid sub-optimal local minima.
\item Verify increasing the perturbation budget strictly increases attack success rate.
\item With ``high'' distortion, model accuracy should reach levels of random guessing.% or even drop to zero.
\end{itemize}
\item \S\ref{sec:100success} Generate an \textbf{attack success rate vs. perturbation budget} curve.
\begin{itemize}[leftmargin=*]
\item Verify the x-axis extends so that attacks eventually reach 100\% success.
\item For unbounded attacks, report distortion and not success rate.
\end{itemize}
\item \S\ref{sec:whitebox} Verify \textbf{adaptive attacks} perform better than any other.
\begin{itemize}[leftmargin=*]
\item Compare success rate on a per-example basis, rather than averaged across the dataset.
\item Evaluate against some combination of black-box, transfer, and random-noise attacks.
\end{itemize}
\item \S\ref{sec:describeattacks} Describe the \textbf{attacks applied}, including all hyperparameters.
\end{itemize}
\subsection{Common Pitfalls}
There are other common pitfalls that may prevent the detection of ineffective defenses.
%
This list contains some potential pitfalls which do not apply to
large categories of defenses.
%
However, if applicable, the items below are still important to carefully
check they have been applied correctly.
%
\begin{itemize}[leftmargin=*]
\item \S\ref{sec:whichattack} Apply a \textbf{diverse set of attacks} (especially when training on one attack approach).
\begin{itemize}[leftmargin=*]
\item Do not blindly apply multiple (nearly-identical) attack approaches.
\end{itemize}
\item \S\ref{sec:gradientfree} Try at least one \textbf{gradient-free attack} and one \textbf{hard-label attack}.
\begin{itemize}[leftmargin=*]
\item Try \cite{chen2017zoo,uesato2018adversarial,ilyas2018black,brendel2017decision}.
\item Check that the gradient-free attacks succeed less often than gradient-based attacks.
\item Carefully investigate attack hyperparameters that affect success rate.
\end{itemize}
\item \S\ref{sec:transfer} Perform a \textbf{transferability attack} using a similar substitute model.
\begin{itemize}[leftmargin=*]
\item Select a substitute model as similar to the defended model as possible.
\item Generate adversarial examples that are initially assigned high confidence.
\item Check that the transfer attack succeeds less often than white-box attacks.
\end{itemize}
\item \S\ref{sec:eot} For randomized defenses, properly \textbf{ensemble over randomness}.
\begin{itemize}[leftmargin=*]
\item Verify that attacks succeed if randomness is assigned to one fixed value.
\item State any assumptions about adversary knowledge of randomness in the threat model.
\end{itemize}
\item \S\ref{sec:bpda} For non-differentiable components, \textbf{apply differentiable techniques}.
\begin{itemize}[leftmargin=*]
\item Discuss why non-differentiable components were necessary.
\item Verify attacks succeed on undefended model with those non-differentiable components.
\item Consider applying BPDA~\citep{athalye2018obfuscated} if applicable.
\end{itemize}
\item \S\ref{sec:converge} Verify that the \textbf{attacks have converged} under the selected hyperparameters.
\begin{itemize}[leftmargin=*]
\item Verify that doubling the number of iterations does not increase attack success rate nor significantly change the adversarial loss.
\item Plot attack effectiveness versus the number of iterations.
\item Run attacks with multiple random starting points and retain the best one.
\item Explore different choices of the step size or other attack hyperparameters.
\end{itemize}
\item \S\ref{sec:hyperparams} Carefully \textbf{investigate attack hyperparameters} and report those selected.
\begin{itemize}[leftmargin=*]
\item Start search for adversarial examples at a random offset. Try multiple random starting points for each input.
\item As for the number of attack iterations, verify that increasing the number of random restarts does not affect the attack's success rate or the adversarial loss.
\item Investigate if attack results are sensitive to any other hyperparameters.
\end{itemize}
\item \S\ref{sec:priorwork} \textbf{Compare against prior work} and explain important differences.
\begin{itemize}[leftmargin=*]
\item When contradicting prior work, clearly explain why differences occur.
\item Attempt attacks that are similar to those that defeated previous similar defenses.
\item When comparing against prior work, ensure it has not been broken.
\end{itemize}
\item \S\ref{sec:generalrobustness} Test \textbf{broader threat models} when proposing general defenses. For images:
\begin{itemize}[leftmargin=*]
\item Apply rotations and translations \citep{engstrom2017rotation}.
\item Apply common corruptions and perturbations \citep{hendrycks2018benchmarking}.
\item Add Gaussian noise of increasingly large standard deviation \citep{ford2019adversarial}.
\end{itemize}
\end{itemize}
\subsection{Special-Case Pitfalls}
The following items apply to a smaller fraction of evaluations.
%
Items presented here are included because while
they may diagnose flaws in
some defense evaluations, they are not necessary for many others.
%
In other cases, the tests presented here help provide additional evidence that the
evaluation was performed correctly.
\begin{itemize}[leftmargin=*]
\item \S\ref{sec:provable} Investigate if it is possible to use \textbf{provable approaches}.
\begin{itemize}[leftmargin=*]
\item Examine if the model is amenable to provable robustness lower-bounds.
\end{itemize}
\item \S\ref{sec:randomnoise} \textbf{Attack with random noise} of the correct norm.
\begin{itemize}[leftmargin=*]
\item For each example, try 10,000+ different choices of random noise.
\item Check that the random attacks succeed less-often than white-box attacks.
\end{itemize}
\item \S\ref{sec:targeted} Use both \textbf{targeted and untargeted attacks} during evaluation.
\begin{itemize}[leftmargin=*]
\item State explicitly which attack type is being used.
\end{itemize}
\item \S\ref{sec:attacksimilar} \textbf{Perform ablation studies} with combinations of defense components removed.
\begin{itemize}[leftmargin=*]
\item Attack a similar-but-undefended model and verify attacks succeed.
\item If combining multiple defense techniques, argue why they combine usefully.
\end{itemize}
\item \S\ref{sec:benchmarkattack} \textbf{Validate any new attacks} by attacking other defenses.
\begin{itemize}[leftmargin=*]
\item Attack other defenses known to be broken and verify the attack succeeds.
\item Construct synthetic intentionally-broken models and verify the attack succeeds.
\item Release source code for any new attacks implemented.
\end{itemize}
\item \S\ref{sec:notimages} Investigate applying the defense to \textbf{domains other than images}.
\begin{itemize}[leftmargin=*]
\item State explicitly if the defense applies only to images (or another domain).
\end{itemize}
\item \mbox{\S\ref{sec:reportmeanmin} Report \textbf{per-example attack success rate}:
$\mathop{\text{mean}}\limits_{x \in \mathcal{X}} \min\limits_{a \in \mathcal{A}} f(a(x))$, not
$\mathop{\text{min}}\limits_{a \in \mathcal{A}} \mathop{\text{mean}}\limits_{x \in \mathcal{X}} f(a(x))$.}
\end{itemize}
\section{Evaluation Recommendations}
We now expand on the above checklist and provide the rationale for each item.
\subsection{Investigate Provable Approaches}
\label{sec:provable}
With the exception of this subsection, all other advice in this
paper focuses on performing heuristic robustness evaluations.
%
Provable robustness approaches are preferable to only heuristic
ones.
%
Current provable approaches often can only be applied when the neural
network is explicitly designed with the objective of making these specific
provable techniques applicable \citep{kolter2017provable,raghunathan2018certified,WengZCSHBDD18}.
%
While this approach of designing-for-provability has seen excellent
progress---the best approaches today can certify some (small) robustness
even on ImageNet classifiers \citep{lecuyer2018certified}---often the best
heuristic defenses offer orders of magnitude better (estimated) robustness.
Proving a lower bound of defense
robustness guarantees that the robustness will never fall
below that level (if the proof is correct).
%
We believe an important direction of future research is developing approaches
that can generally prove arbitrary neural networks correct.
%
While work
in this space does exist \citep{katz2017reluplex,TjengXT19,XiaoTSM19,GowalDSBQUAMK18}, it is often computationally
intractable to verify even modestly sized neural networks.
One key limitation of provable techniques is that the proofs they
offer are generally only of the form ``for some \emph{specific} set of
examples $\mathcal{X}$, no adversarial example with distortion less than
$\varepsilon$ exists''.
%
While this is definitely a useful statement, it
gives no proof about any \emph{other} example $x' \not\in \mathcal{X}$;
and because this is the property that we actually care about, provable
techniques are still not provably correct in the same way that provably
correct cryptographic algorithms are provably correct.
\subsection{Report Clean Model Accuracy}
\label{sec:cleanaccuracy}
A defense that significantly degrades the model's accuracy on the original
task (the \emph{clean} or \emph{natural} data) may not be useful
in many situations.
%
If the probability of an actual attack is very low and the cost of an
error on adversarial inputs is not
high, then it may be unacceptable to incur \emph{any} decrease in clean
accuracy.
%
Often there can be a difference in the impact of an error on a random
input and an error on an adversarially chosen input. To what extent this
is the case depends on the domain the system is being used in.
For the class of defenses that refuse to classify inputs by abstaining when
detecting that inputs are adversarial, or otherwise refuse to classify
some inputs, it is important to evaluate how this impacts accuracy on
the clean data.
%
Further, in some settings it may be acceptable to refuse to classify inputs
that have significant amount of noise. In others, while it may be acceptable
to refuse to classify adversarial examples, simple noisy inputs must still
be classified correctly.
%
It can be helpful to generate a Receiver Operating Characteristic (ROC) curve
to show how the choice of threshold for rejecting inputs causes the clean
accuracy to decrease.
\subsection{Focus on the Strongest Attacks Possible}
\label{sec:whichattack}
\paragraph{Use optimization-based attacks.}
Of the many different attack algorithms, optimization-based attacks
are by far the most powerful.
%
After all, they extract a significant amount of information from the model
by utilizing the gradients of some loss function and not just the predicted output.
%
In a white-box setting, there are many different attacks
that have been created, and picking almost any of them will be useful.
%
However, it is important to \emph{not} just choose an attack and apply it out-of-the-box
without modification.
%
Rather, these attacks should serve as a starting point
to which defense-specific knowledge can be applied.