-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathJCI.tex
6019 lines (5544 loc) · 464 KB
/
JCI.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[twoside,11pt]{article}
% Any additional packages needed should be included after jmlr2e.
% Note that jmlr2e.sty includes epsfig, amssymb, natbib and graphicx,
% and defines many common macros, such as 'proof' and 'example'.
%
% It also sets the bibliographystyle to plainnat; for more information on
% natbib citation styles, see the natbib documentation, a copy of which
% is archived at http://www.jmlr.org/format/natbib.pdf
\usepackage{jmlr2e}
\usepackage{bookmark}
%\usepackage{times}
\usepackage{tikz}
%\usepackage{listings,tikz,subfigure,changepage,verbatim,colortbl,placeins}
%\usetikzlibrary{patterns}
\usetikzlibrary{arrows}
\usepackage{subfigure}
%\usepackage[english]{babel}
%\usepackage[utf8]{inputenc}
%\usepackage[colorinlistoftodos]{todonotes}
%\usepackage{centernot,algorithm,algorithmicx,algpseudocode,mathtools}
\usepackage{amsmath}
\usepackage{bm}
\usepackage{paralist}
\usepackage{comment}
%\usepackage{pdflscape}
\usepackage{bbm}
% following are helpful to move proofs to the appendix
\newtheorem{innercustomthm}{Theorem}
\newenvironment{customthm}[1]
{\renewcommand\theinnercustomthm{#1}\innercustomthm}
{\endinnercustomthm}
\newtheorem{innercustomlem}{Lemma}
\newenvironment{customlem}[1]
{\renewcommand\theinnercustomlem{#1}\innercustomlem}
{\endinnercustomlem}
\newtheorem{innercustomcor}{Corollary}
\newenvironment{customcor}[1]
{\renewcommand\theinnercustomcor{#1}\innercustomcor}
{\endinnercustomcor}
\newcommand{\defeq}{\vcentcolon=}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator*{\argmax}{arg\,max}
\definecolor{blue}{rgb}{0,0,1}
\definecolor{orange}{rgb}{1,0.5,0.1}
\definecolor{brightblue}{rgb}{0.92,0.92,1}
\definecolor{brightgrey}{rgb}{0.96,0.96,1}
\definecolor{green}{rgb}{0.2,1.0,0.2}
\definecolor{red}{rgb}{1,0,0}
%\newcommand{\CI}{\mathrel{\perp\mspace{-10mu}\perp}}
%\newcommand{\nCI}{\centernot{\CI}}
\DeclareMathOperator*{\CI}{{\,\perp\mkern-12mu\perp\,}}
\DeclareMathOperator*{\nCI}{{\,\not\mkern-1mu\perp\mkern-12mu\perp\,}}
\DeclareMathOperator*{\SEP}{\perp}
\DeclareMathOperator*{\nSEP}{\not\perp}
\newcommand\indep[4]{{#1} \CI_{#4} {#2} \given {#3}}
\newcommand\dep[4]{{#1} \nCI_{#4} {#2} \given {#3}}
\newcommand\sep[4]{{#1} \SEP_{#4} {#2} \given {#3}}
\newcommand\con[4]{{#1} \nSEP_{#4} {#2} \given {#3}}
\newcommand{\dsep}[4]{{#1} \SEP_{#4}^d {#2} \given {#3}}
\newcommand{\dcon}[4]{{#1} \nSEP_{#4}^d {#2} \given {#3}}
%\newcommand{\Dsep}{\perp^D}
%\newcommand{\Dcon}{\not \perp^D}
\newcommand{\sigmasep}[4]{{#1} \SEP_{#4}^\sigma {#2} \given {#3}}
\newcommand{\sigmacon}[4]{{#1} \nSEP_{#4}^\sigma {#2} \given {#3}}
\newcommand{\Prb}{\mathbb{P}}
\newcommand{\marg}[1]{\mathrm{marg}_{#1}}
\newcommand\idcausal{\dashrightarrow}
\newcommand\idacausal{\text{${}\not\!\dashrightarrow{}$}}
\newcommand\notimplies{\text{${}\not\!\implies{}$}}
\newcommand\B[1]{\bm{#1}}
\newcommand\C[1]{\mathcal{#1}}
\newcommand\BC[1]{\bm{\mathcal{#1}}}
%\newcommand\mathbfsc[1]{\text{\normalfont\bfseries\scshape#1}}
\newcommand\mathbfsc[1]{\text{\normalfont\scshape#1}}
\newcommand\Det[1]{\mathbfsc{Det}(#1)}
\newcommand\an[1]{\mathbfsc{an}(#1)}
\newcommand\de[1]{\mathbfsc{de}(#1)}
\newcommand\scc[1]{\mathbfsc{sc}(#1)}
\newcommand\pa[1]{\mathbfsc{pa}(#1)}
\newcommand\ch[1]{\mathbfsc{ch}(#1)}
\newcommand\ncol[1]{\mathbfsc{ncol}(#1)}
\newcommand\col[1]{\mathbfsc{col}(#1)}
\newcommand\Detsub[2]{\mathbfsc{Det}_{#2}(#1)}
\newcommand\ansub[2]{\mathbfsc{an}_{#1}(#2)}
\newcommand\desub[2]{\mathbfsc{de}_{#1}(#2)}
\newcommand\sccsub[2]{\mathbfsc{sc}_{#1}(#2)}
\newcommand\pasub[2]{\mathbfsc{pa}_{#1}(#2)}
\newcommand\chsub[2]{\mathbfsc{ch}_{#1}(#2)}
\newcommand\given{\,|\,}
\newcommand\causes{\idcausal}
\newcommand\notcauses{\idacausal}
\newcommand{\xto}[1]{\stackrel{#1}{\to}}
\newcommand\eref[1]{(\ref{#1})}
\newtheorem{assumption}{Assumption}
%\newtheorem{lemma}{Lemma}
%\newtheorem{proposition}{Proposition}
%\newtheorem{theorem}{Theorem}
%\newtheorem{corollary}{Corollary}
%\newtheorem{definition}{Definition}
%\newtheorem{conjecture}{Conjecture}
%\newcommand{\qedsymbol}{$\square$}
%\newcommand{\qed}{\hfill\qedsymbol}
%\newcommand{\qedtag}[1]{\tag*{\lower#1\hbox{\qedsymbol}}}
%\newenvironment{proof}[1][Proof.]{\noindent \textbf{#1\hskip.5em\relax}}{\qed\vskip\baselineskip}
%\newenvironment{proof*}[1][Proof.]{\noindent \textbf{#1\hskip.5em\relax}}{}
\newcommand{\mL}{\text{\L}}
\newcommand{\ot}{\leftarrow}
\newcommand{\oto}{\leftrightarrow}
\newcommand{\ots}{\leftarrow\mkern-13mu\ast\,}
\newcommand{\otc}{\leftarrow\mkern-9mu\circ\,}
\newcommand{\sto}{\,\ast\mkern-13mu\to}
\newcommand{\stt}{\,\ast\mkern-13mu\relbar\!\!\!\relbar}
\newcommand{\tts}{\relbar\!\!\!\relbar\mkern-13mu\ast\,}
\newcommand{\sts}{\,\ast\mkern-10mu\relbar\mkern-10mu\ast\,}
\newcommand{\ctc}{\,\circ\mkern-9mu\relbar\mkern-9mu\circ\,}
\newcommand{\cto}{\,\circ\mkern-9mu\rightarrow}
%\newcommand{\cto}{\vcenter{\hbox{\begin{tikzpicture}%
% \draw[o->,>=stealth] (0,0) to (3ex,0ex);%
%\end{tikzpicture}}}
%}
\newcommand{\ctt}{\,\circ\mkern-9mu\relbar\!\!\!\relbar}
\newcommand{\ttt}{\relbar\!\!\!\relbar}
\newcommand{\ttc}{\relbar\!\!\!\relbar\mkern-9mu\circ\,}
\newcommand{\RN}{\mathbb{R}}
\newcommand{\I}{\mathbbm{1}}
\newcommand{\ck}{\checkmark}
\newcommand{\intervene}{\mathrm{do}}
\newcommand{\dadb}[2]{\frac{\partial {#1}}{\partial {#2}}}
\newcommand{\MAG}{\mathrm{MAG}}
\newcommand{\PAG}{\mathrm{PAG}}
\newcommand{\CPAG}{\mathrm{CPAG}}
\newcommand{\DMAG}{\mathrm{DMAG}}
\newcommand{\DPAG}{\mathrm{DPAG}}
\newcommand{\CDPAG}{\mathrm{CDPAG}}
\newcommand{\IM}{\mathrm{IM}}
\newcommand{\alg}[1]{\texttt{#1}}
\newcommand{\boldcap}[1]{\textbf{#1}}
\DeclareSymbolFont{bbold}{U}{bbold}{m}{n}
\DeclareSymbolFontAlphabet{\mathbbold}{bbold}
\newcommand\ind{\mathbbold{1}}
\newenvironment{topalignedcolumn}[1]{\begin{minipage}[t]{#1}\vspace{0pt}\relax}{\end{minipage}}
\newcommand{\Joris}[1]{{\color{blue}#1}}
\newcommand{\Sara}[1]{{\color{purple}#1}}
\newcommand{\Tom}[1] {{\color{green}#1}}
\newcommand{\draft}[1]{{\color{gray}#1}}
\newcommand{\Spirtes}[1]{{\color{red}#1}}
%\newcommand{\Joris}[1]{}
%\newcommand{\Sara}[1]{}
%\newcommand{\Tom}[1] {}
%\newcommand{\draft}[1]{}
\newcommand{\JCIABC}{\ref{ass:uncaused}, \ref{ass:unconfounded}, \ref{ass:dependences}}
\newcommand{\JCIAB}{\ref{ass:uncaused}, \ref{ass:unconfounded}}
\newcommand{\JCIA}{\ref{ass:uncaused}}
% Heading arguments are {volume}{year}{pages}{submitted}{published}{author-full-names}
%\jmlrheading{1}{2016}{1-48}{4/00}{10/00}{Joris M.\ Mooij, Sara Magliacane and Tom Claassen}
\jmlrheading{?}{????}{??-??}{?/??}{?/??}{????????}{Joris M.\ Mooij, Sara Magliacane and Tom Claassen}
% Short headings should be running head and authors last names
%\ShortHeadings{Joint Causal Inference from Observational and Experimental Datasets}{Mooij, Magliacane, Claassen}
% The short heading has to be less than 50 chars.
\ShortHeadings{Joint Causal Inference from Multiple Contexts}{Mooij, Magliacane, Claassen}
\firstpageno{1}
\tikzstyle{var}=[circle,draw=black,fill=white,thick,minimum size=20pt,inner sep=0pt]
\tikzstyle{varh}=[circle,draw=gray,fill=white,thick,minimum size=20pt,inner sep=0pt,dashed]
\tikzstyle{arr}=[->,>=stealth',draw=black,thick]
%\tikzstyle{arr}=[->,>=stealth',draw=black,fill=black,line width=1pt]
%\tikzstyle{arrh}=[->,>=stealth',draw=gray,thin]
\tikzstyle{arrh}=[->,>=stealth',draw=gray,fill=gray,thick,dashed]
\tikzstyle{biarr}=[<->,>=stealth',draw=black,fill=black,thick]
\tikzstyle{biarrh}=[<->,>=stealth',draw=gray,fill=gray,thick,dashed]
\tikzstyle{noarr}=[draw=black,fill=black,thick]
\tikzstyle{noarrh}=[draw=gray,fill=gray,thick,dashed]
\tikzstyle{fac}=[rectangle,draw=black!50,fill=black!20,thick,minimum size=10pt] %or 15pt
\tikzstyle{varc}=[rectangle,draw=black,fill=white,thick,minimum size=20pt,inner sep=0pt]
\tikzstyle{varch}=[rectangle,draw=black,fill=white,thick,minimum size=20pt,inner sep=0pt,dashed]
\graphicspath{{fig/},{code/experiments/simul/plots/png/},{code/experiments/sachs/plots/}}
% For listing ASP code
\usepackage{listings}
\lstdefinestyle{ASP}{
belowcaptionskip=1\baselineskip,
breaklines=true,
frame=L,
xleftmargin=\parindent,
language=psl,
showstringspaces=false,
basicstyle=\script\ttfamily,
keywordstyle=\bfseries\color{green!40!black},
commentstyle=\itshape\color{purple!40!black},
identifierstyle=\color{blue},
stringstyle=\color{orange},
}
\lstset{
numbers=left, numberstyle=\tiny,
language = Prolog,
breaklines = false,
% frame = trbl,
tabsize = 1,
captionpos = b,
morekeywords={causes(Z,X)}
morecomment=\color{green}[l]//,%
morecomment=\color{green}[s]{/*}{*/},%
commentstyle=\color{green},
% morestring=[b]�,
aboveskip = 1em,
% belowskip = 1em,
belowskip = 0em,
% xleftmargin = 0.2em,
% xrightmargin = 0.2em,
% framexleftmargin = 0.1em,
% framexrightmargin = 0.1em,
% framextopmargin = 0.1em,
% framexbottommargin = 0.1em,
showstringspaces = false,
basicstyle = \scriptsize\ttfamily,
keywordstyle = \color{blue}\bfseries,
stringstyle = \color{orange},
% backgroundcolor = \color{brightblue}
}
\begin{document}
\title{Joint Causal Inference from Multiple Contexts}
\author{\name Joris M. Mooij \email [email protected]\\
\addr Institute for Informatics, University of Amsterdam\\
\addr Amsterdam, The Netherlands
\AND
\name Sara Magliacane \email [email protected]\\
\addr MIT-IBM Watson AI Lab, IBM Research\\
\addr Cambridge, United States\\[0.5\baselineskip]
\addr Institute for Informatics, University of Amsterdam\\
\addr Amsterdam, The Netherlands
\AND
\name Tom Claassen \email [email protected]\\
\addr Radboud University Nijmegen\\
\addr Nijmegen, The Netherlands\\[0.5\baselineskip]
\addr Institute for Informatics, University of Amsterdam\\
\addr Amsterdam, The Netherlands
}
\editor{Peter Spirtes}
\maketitle
\begin{abstract}%
The gold standard for discovering causal relations is by means of experimentation.
Over the last decades, alternative methods have been proposed that can
infer causal relations between variables from certain
statistical patterns in purely observational data.
%\Sara{Something to clarify we don't claim to be the first paper to unify the two approaches? I know we don't, but a naive reading of the next sentence may suggest that.}.
We introduce \emph{Joint Causal
Inference (JCI)}, a novel approach to causal discovery from
multiple data sets from different contexts that elegantly unifies both approaches. JCI is a causal modeling
framework rather than a specific algorithm, and it can be implemented using any
causal discovery algorithm that can take into account certain background knowledge.
%\Sara{(possibly we can even make this optional?: that can take into account certain background knowledge)}.
%The main idea is to reduce causal discovery from multiple data sets originating from different contexts (e.g., different experimental conditions) to causal discovery from a single pooled data set by adding auxiliary context variables and incorporating applicable background knowledge on the causal relationships involving the context variables.
JCI can deal with different types of interventions (e.g., perfect, imperfect, stochastic, etc.)
in a unified fashion, does
not require knowledge on intervention targets or types in case of interventional data.
%, and allows one to fully exploit all the information in the joint distribution on system and context variables.
%exploiting differences in distribution between contexts it improves on
%the accuracy and identifiability of the predicted causal relations \Joris{rephrase?}.
We explain how several well-known causal discovery algorithms can be seen as addressing special
cases of the JCI framework, and we also propose novel implementations that extend
existing causal discovery methods for purely observational data to the JCI setting.
We evaluate different implementations of the JCI approach on synthetic data
and on flow cytometry protein expression data and conclude that JCI implementations can
considerably outperform state-of-the-art causal discovery algorithms.
%We evaluate the approach on flow cytometry data.
%\Tom{state something like 'exploit information that other approaches cannot'}
\end{abstract}
\begin{keywords}
Causal Discovery, Structure Learning, Observational and Experimental Data, Interventions,
Randomized Controlled Trials
%\Sara{I would remove the following two: Randomized Controlled Trials, A-B testing, unless we start estimating causal effects.}\Joris{In many randomized controlled trials that are reported in the medical literature, what is reported is the p-value for the existence of a causal effect. So I don't see why one would assume that RCT implies estimating causal effect size.}
\end{keywords}
\section{Introduction}\label{sec:introduction}
\Joris{Add link to code repository.}
%\Sara{I also think calling it a generalized/extended Randomized Controlled Trial gives the wrong impression (known graph, estimating causal effect between treatment and outcome, possibly no confounders), unless we specifically mention that it is not. I would rather say that the RCT setting (learning the graph, as opposed to the problem of estimating the causal effect) is a special case of JCI (or something a bit more subtle), and keep the example later on, as long as we are precise about it. What do you think?}\Joris{I don't know why you have those associations with RCT? (known graph --> one of the important questions is often: is there an effect of treatment on outcome?; estimating causal effect between treatment and outcome: yes, or simply the existence of a causal effect; possibly no confounders: we also allow no confounders). The RCT assumptions in Proposition 1 are in complete analogy with the JCI assumptions - actually, they are the special case of the JCI assumptions for 1 context variable and 1 system variable. Furthermore, this is exactly what methods like COMBINE cannot do: generalize this RCT style causal reasoning/discovery (they forget to compare across contexts). So I think calling it a generalized RCT gives exactly the right impression. So it is a huge plus that we can say that JCI generalizes RCT: modern medicine is based on the RCT concept, whereas as far as I know there is no scientific discipline that is based on causal discovery from purely observational data (except our own ivory tower, plus possibly longitudinal studies in epidemiology, but as we have seen, its reliability is debated---David Madigan's Sackler colloquium talk). So I'd rather say that it generalizes RCT because we gain the positive association of ``gold standard for causal discovery'' than with causal discovery from observational data as it gives us the negative assocition of ``interesting on paper and works in simulations but worthless in the real world''.}
The aim of causal discovery is to learn the causal relations between variables
of a system of interest from data. As a simple example, suppose a researcher
wants to find out whether playing violent computer games causes aggressive
behavior. She gathers observational data by taking a sample from several high
schools in different countries and observes a significant correlation between the daily amount of hours
spent on playing violent computer games, and aggressive behavior at
school (see also Figure~\ref{fig:example}).
This in itself does not yet imply a causal relation between the two in either direction.
Indeed, an alternative explanation of the observed correlation could be the
presence of a confounder (a latent common cause), for example, a genetic
predisposition towards violence that makes the carrier particularly enjoy such
games and also make him behave more aggressively.
The most reliable way to establish whether playing violent computer games
causes aggressive behavior, is to use \emph{experimentation}, for example by
means of a randomized controlled trial \citep{Fisher1935}.
This would imply assigning each pupil to one out of two groups randomly,
where the pupils in one group are forced to play violent computer games for
several hours a day,
while the pupils in the other group are forced to abstain from playing those games.
After several months, the aggressive behavior in both groups is measured.
If a significant correlation between group and outcome is observed (or
equivalently, the outcome is significantly different between the two groups)
it can then be concluded that playing violent computer
games indeed causes aggressive behavior.
Given the ethical and practical problems that such an experiment would involve,
one might wonder whether there are alternative ways to answer this question. One such
alternative is to combine data from different
contexts. For example, in some countries the government may have decided
to forbid certain ultra-violent games from being sold. In addition, some schools may have introduced
certain measures to discourage aggressive behavior. By combining the data
from these different contexts in an appropriate way, one may be able to identify
the presence or absence of a causal effect of playing violent computer games
on aggressive behavior. For example, in the setting of Figure~\ref{fig:example}(c),
the causal relationship between the two variables of interest turns out to
be identifiable from conditional independence relationships in pooled data
from all the contexts.
In particular, in that case the observed correlation between playing violent
computer games and aggressive behavior could be unambiguously attributed to a
causal effect of one on the other, just from \emph{combining} two readily
available data sets, \emph{without} the need for an impractical experiment.
In this paper, we propose a simple and general way to combine data sets from
different contexts that enables one to draw such strong causal conclusions.
%These contexts can also correspond to different experimental
%(intervention) settings, allowing one to jointly analyze
%(observational and) interventional data. As a special case, our approach
%recovers the Randomized Controlled Trial setting.
\begin{figure}\centering%
\begin{tikzpicture}
\begin{scope}
\node at (-1,2) {(a)};
\node[var] (X1) at (0,1) {$X_1$};
\node[var] (X2) at (2,1) {$X_2$};
\draw[arr] (X1) -- (X2);
\end{scope}
\begin{scope}[xshift=5cm]
\node at (-1,2) {(b)};
\node[var] (X1) at (0,1) {$X_1$};
\node[var] (X2) at (2,1) {$X_2$};
\draw[biarr] (X1) -- (X2);
\end{scope}
\begin{scope}[xshift=10cm]
\node at (-1,2) {(c)};
\node[var] (X1) at (0,0) {$X_1$};
\node[var] (X2) at (2,0) {$X_2$};
\draw[arr] (X1) -- (X2);
\node[var] (C1) at (0,2) {$C_\alpha$};
\node[var] (C2) at (2,2) {$C_\beta$};
\draw (-0.5,1) edge[dotted] (2.5,1);
\draw[biarr] (C1) -- (C2);
\draw[arr] (C1) -- (X1);
\draw[biarr] (C2) -- (X2);
\end{scope}
\end{tikzpicture}
\caption{Different causal graphs relating $X_1$, the daily amount of hours spent on playing violent computer games, and $X_2$, a measure of aggressive behavior. (a) Playing violent computer games causes aggressive behavior; (b) The observed correlation between $X_1$ and $X_2$ is explained by a latent confounder, e.g., a genetic predisposition towards violence. (c) Hypothetical causal graph also involving context variables $C_\alpha$, which indicates whether ultra-violent games have been
banned by the government, and $C_\beta$, which indicates school interventions to stimulate social behavior. Without considering contexts, it is not possible to distinguish between (a) and (b) based on conditional independences in the data. In scenario (c), JCI allows one to infer from conditional independences in the pooled data that $X_1$ causes $X_2$ and that $X_1$ and $X_2$ are not confounded (assuming that context variables $C_\alpha$ and $C_\beta$ are not caused by system variables $X_1$ and $X_2$).\label{fig:example}}
\end{figure}
%The gold standard for causal discovery is provided by randomized experiments \citep{Fisher1935}.
%Therefore, a common approach to deal with such problems would
%be to ignore the observational studies in case unmeasured confounders cannot
%be ruled out, and analyse the randomized controlled trials separately, combining the
%conclusions of those separate analysis using meta-analysis techniques.
%%\Sara{Are we sure observational data are discarded? I can check with some statisticians to their usual practices.}
%%\Joris{Pretty sure (added qualification `in case unmeasured confounders cannot be ruled out'). Double-check welcome.}
%Such an approach may not be optimal as it does not fully exploit the available
%data.
%%\Tom{THIS is something we should show in an example}
%%\Joris{Agreed, or even better, in the simulations on synthetic data.}
%Furthermore, it is not clear how randomized controlled trial data from different
%subpopulations should be combined in a principled way.
%%\Joris{Is \url{https://arxiv.org/abs/1803.06048} interesting to cite?}
While experimentation is still the gold standard to establish causal relationships,
researchers realized in the early nineties that there are other methods that require
only \emph{purely observational} data \citep{SGS2000,Pearl2009}. Many methods for causal
discovery from purely observational data have been proposed over the last decades,
relying on different assumptions.
%Alternatively, causal discovery methods that do not require experimentation could be applied.
These can be roughly divided into \emph{constraint-based} causal discovery methods,
such as the PC \citep{SGS2000}, IC \citep{Pearl2009} and FCI algorithms \citep{SMR1999,Zhang2008_AI},
\emph{score-based} causal discovery methods \citep[e.g.,][]{CooperHerskovits1992,HGC1995,Chickering2002,KoivistoSood2004},
and methods exploiting other statistical patterns in the joint distribution \citep[e.g.,][]{Mooij++_JMLR_16,PJS2017}.
Originally, these methods were designed to estimate the causal graph of the system
from a single data set corresponding to a single (purely observational) context.
More recently, various causal discovery methods have been proposed that extend these techniques
to deal with multiple data sets from different contexts.
As an example, the data sets may correspond with a baseline
of purely observational data of measurements concerning the ``natural'' state of the
system, and data of measurements under different perturbations of the system caused
by external interventions on the system.\footnote{In
certain parts of the causal discovery literature, the word ``intervention'' has become
synonymous to ``surgical/atomic/hard/perfect'' intervention (i.e., an intervention that precisely sets a variable or set of variables
to a certain value without directly affecting the other variables in the system), but in this work we use it in the more
general meaning of any external perturbation of the system.} More generally, they can correspond to measurements
of the system in different environments.
These methods can be divided into two main approaches:
\begin{enumerate}[(a)]
%\item methods that construct causal graphs for each context separately and then combine them into one context-independent causal graph \citep{Claassen++_NIPS2010},
%\item methods that obtain statistics from each context separately and then construct a single context-independent causal graph by combining these statistics \citep{IOD2011,Hyttinen++2012,HEJ2014,triantafillou2015constraint,Rothenhausler++2015},
\item methods that obtain statistics or constraints from each context separately and then construct a single context-independent causal graph by combining these statistics \citep{Claassen++_NIPS2010,IOD2011,Hyttinen++2012,HEJ2014,triantafillou2015constraint,Rothenhausler++2015,ForreMooij_UAI_18}, but never directly compare data from different contexts;
\item methods that pool all data and construct a single context-independent causal graph directly from the pooled data
\citep{Cooper1997,CooperYoo1999,TianPearl2001,SPP05,EatonMurphy07,Trigger2007,GIES2012,MooijHeskes_UAI_13,ICP2016,oates2016estimating,Zhang++_IJCAI17}.
\end{enumerate}
In this paper, we introduce \emph{Joint Causal Inference (JCI)}, a framework for causal modeling
of a system in different contexts and for causal discovery from multiple data sets consisting of
measurements obtained in different contexts, which takes the latter approach.
As will be discussed in more detail in Section~\ref{sec:related_work},
JCI is the most generally applicable of those approaches---for example, it allows for the
presence of latent confounders and cyclic causal relationships---and also offers most flexibility in terms of
its implementation.
While the ingredients of the JCI framework are not novel, the added value of the
framework is that on the one hand it
arrives at a unifying description of a diverse spectrum of existing approaches, while on the
other hand it serves to inspire new implementations, such as the adaptations of FCI that we
propose in this work. Technically, this is achieved by formulating the problem in terms of
a standard Structural Causal Model that considers system and environment as subsystems of one
joint system, rather than inventing novel types of representations in which the system is
modeled conditionally on its environment. This allows us to apply the standard notion of
statistical independence in the same ways as has been done so successfully for the
purely observational setting.
The key idea of JCI is to (i) consider auxiliary context variables that describe the context of each data set, (ii) pool all the data from different contexts, including the values of the context variables, and finally (iii) apply standard
causal discovery methods to the pooled data, incorporating appropriate background knowledge on the
causal relationships involving the context variables. The framework is simple and very generally
applicable as it allows
one to deal with latent confounding and cycles (if the causal discovery method supports this),
and various types of interventions in a unified way.
It does not require background knowledge on the intervention types and targets, making it very
suitable to the application on complex systems in which the effects of certain interventions are
not known \emph{a priori}, a situation that often occurs in practice.
JCI can be implemented using any causal discovery method that can incorporate the appropriate
background knowledge on the relationships between context and system variables.
This allows one to benefit from the availability of sophisticated
and powerful causal discovery methods that have been primarily designed for a single data set
from a single context by extending their application domain to the setting of multiple data sets
from multiple contexts. For example, we will show in this work how FCI \citep{SMR1999,Zhang2008_AI} can easily be adapted to the
JCI setting. At the same time, JCI accommodates various well-known causal discovery
methods as special cases, such as the standard randomized controlled trial setting \citep{Fisher1935}, LCD \citep{Cooper1997}
and ICP \citep{ICP2016}.
By explicitly introducing the context variables and treating them analogously to the system
variables (but with additional background knowledge about their causal relations with the system variables), JCI makes
it possible to elegantly combine the principles of
causal discovery from experimentation with those of causal discovery from purely observational data.
%Therefore, JCI can be viewed as a generalization of the principle of randomized controlled trials
%to multiple outcome variables,
%\Tom{not sure I follow this line of reasoning}\Joris{hopefully you do now, after reading the section on RCTs?}
%\Joris{Tom missed an intermediate step: multiple context variables but single outcome/system variable has
%already been studied extensively in \emph{experimental design}.}
%\Tom{To emphasize this more (I missed it originally), add a figure with four case: outcomes/systems (1/1 = RCT,
%0/more = purely observational, more/1=experimental design, more/more=JCI)}
%while at the same time (but from a different perspective), JCI can be viewed as a generalization of causal
%discovery from a single data set to causal discovery from multiple data sets corresponding to different contexts.
%Table~\ref{tab:overview} gives an overview of the different approaches and their applicability domain.
%\begin{table}
%\caption{Different approaches to causal discovery and their applicability domain in terms of the number of context and system variables.\label{tab:overview}}
%\centerline{\begin{tabular}{lll}
%\# Context Variables & \# System Variables & Name of Causal Discovery Approach \\
%\hline
%1 & 1 & Randomized Controlled Trial \\
%$> 1$ & 1 & Experimental Design \\
%0 & $\ge 1$ & Purely Observational Causal Discovery \\
%$\ge 0$ & $\ge 1$ & Joint Causal Inference
%\end{tabular}}
%\end{table}
This paper is structured as follows. In Section~\ref{sec:background} we describe the relevant causal
modeling and discovery concepts and define terminology and notation. In Section~\ref{sec:JCI} we
introduce the JCI framework and modeling assumptions. In Section~\ref{sec:causal_discovery},
we show how JCI can be implemented using various causal discovery methods,
and compare it with related work. In Section~\ref{sec:experiments} we
report experimental results on synthetic and flow cytometry data. We conclude in Section~\ref{sec:conclusion}.
\section{Background}\label{sec:background}
In this section, we present the background material on which we will base our
exposition. We start in Section~\ref{sec:graphical_causal_modeling}
with a brief subsection stating the basic definitions and
results in the field of graphical causal modeling that
we will use in this paper. In
addition to covering material that is standard in the field,
we also discuss more recent extensions to the cyclic setting.\footnote{We realize that
this material is quite densely presented here, but because the cyclic case is in many
ways very analogous to the acyclic case, we decided to treat both cases in parallel rather
than sequentially, in order to avoid making the paper longer than necessary. For a more
detailed treatment of cyclic causal modeling, we refer the reader to \citet{Bongers++_1611.06221v2}.}
Then, in Section~\ref{sec:CDexp}, we discuss the key
idea of causal discovery from experimentation (in the setting of a randomized
controlled trial, or A/B-testing) in these terms. We finish with Section~\ref{sec:CDobs}
that briefly illustrates the basic idea underlying constraint-based causal discovery
from purely observational data in a simple setting.
\subsection{Graphical Causal Modeling}\label{sec:graphical_causal_modeling}
%\Sara{Quite densely written, I can maybe add a summarizing table with all the notations, at least for SCMs, e.g. $\C{I}$ endogenous variable indexes, etc.}\Joris{Table could be done, but has absolutely no priority right now.}
We briefly summarize some basic definitions and results in the field of graphical causal modeling.
For more details, we refer the reader to \citet{Pearl2009} and \citet{Bongers++_1611.06221v2}.
\subsubsection{Directed Mixed Graphs}
A \emph{Directed Mixed Graph} (DMG) is a graph $\C{G} = \langle \C{V},\C{E},\C{F} \rangle$ with
nodes $\C{V}$ and two types of edges: \emph{directed} edges $\C{E} \subseteq \C{V}^2$, and
\emph{bidirected} edges $\C{F} \subseteq \{\{i,j\} : i, j \in \C{V}, i \ne j\}$.
We will denote a directed edge $(i,j) \in \C{E}$ as $i \rightarrow j$ or $j \ot i$, and
call $i$ a \emph{parent} of $j$ and $j$ a \emph{child} of $i$. We denote all parents of
$j$ as $\pasub{\C{G}}{j} := \{i \in \C{V} : i \to j \in \C{E}\}$, and all children of $i$ as
$\chsub{\C{G}}{i} := \{j \in \C{V} : i \to j \in \C{E}\}$. We allow for self-cycles $i \to i$,
so a variable can be its own parent and child.
We will denote a bidirected edge $\{i,j\} \in \C{F}$ as $i \oto j$ or $j \oto i$,
and call $i$ and $j$ \emph{spouses}.
Two nodes $i,j \in \C{V}$ are called \emph{adjacent in $\C{G}$} if they are connected by an edge (or multiple edges), i.e.,
if $i\to j \in \C{E}$ or $i \ot j \in \C{E}$ or $i \oto j \in \C{F}$.
For a subset of nodes $\C{W} \subseteq \C{V}$, we define the \emph{induced subgraph} $\C{G}_{\C{W}}
:= (\C{W},\C{E}\cap \C{W}^2,\C{F}\cap \{\{i,j\} : i,j \in \C{W}, i \ne j\})$, i.e.,
with nodes $\C{W}$ and exactly those edges of $\C{G}$ that connect two nodes in $\C{W}$.
A \emph{walk between $i,j\in\C{V}$} is a tuple $\langle i_0,e_1,i_1,e_2,i_3,\dots,e_{n-1},i_n \rangle$
of alternating nodes and edges in $\C{G}$ ($n \ge 0$), such that all $i_0,\dots,i_n \in \C{V}$,
all $e_1,\dots,e_{n-1} \in \C{E} \cup \C{F}$, starting with node $i_0=i$ and ending with node $i_n=j$,
and such that for all $k=1,\dots,n-1$, the edge $e_k$ connects the two nodes $i_{k-1}$ and
$i_k$ in $\C{G}$. If the walk contains each node at most once, it is called a \emph{path}.
%$e_1,\dots,e_n$ in $\C{E} \cup \C{F}$ such that the first node of $e_1$ equals $i$, the last node of $e_n$ equals $j$,
%and for all $k=1,\dots,n-1$, the last node of edge $e_k$ equals the first node of edge $e_{k+1}$.
A \emph{trivial walk (path)} consists just of a single node and zero edges.
A \emph{directed walk (path) from $i \in \C{V}$ to $j \in \C{V}$} is a walk (path) between $i$ and $j$ such that every edge
$e_k$ on the walk (path) is of the form $i_{k-1} \to i_k$, i.e., every edge is directed and points away from $i$.
By repeatedly taking parents, we obtain the \emph{ancestors} of $j$:
$\ansub{\C{G}}{j} := \{i \in \C{V} : i = i_0 \to i_1 \to i_2 \to \dots \to i_k = j \text{ in } \C{G}\}$.
Similarly, we define the \emph{descendants} of $i$: $\desub{\C{G}}{i} := \{j \in \C{V}: i = i_0 \to i_1 \to i_2 \to \dots \to i_k = j \text{ in } \C{G}\}$. In particular, each node is ancestor and descendant of itself.
A \emph{directed cycle} is a directed path from $i$ to $j$ such that in addition, $j \to i \in \C{E}$.
An \emph{almost directed cycle} is a directed path from $i$ to $j$ such that in addition, $j \oto i \in \C{F}$.
All nodes on directed cycles passing through $i \in \C{V}$ together form the \emph{strongly-connected component}
$\sccsub{\C{G}}{i}:= \ansub{\C{G}}{i} \cap \desub{\C{G}}{i}$ of $i$.
We extend the definitions to sets $I \subseteq \C{V}$ by setting $\ansub{\C{G}}{I} := \cup_{i\in I} \ansub{\C{G}}{i}$,
and similarly for $\desub{\C{G}}{I}$ and $\sccsub{\C{G}}{I}$. A directed mixed graph $\C{G}$
is \emph{acyclic} if it does not contain any directed cycle, in which case it is known as an
\emph{Acyclic Directed Mixed Graph (ADMG)}. A directed mixed graph that does not contain
bidirected edges is known as a \emph{Directed Graph (DG)}. If a directed mixed graph does not
contain bidirected edges and is acyclic, it is called a \emph{Directed Acyclic Graph (DAG)}.
A subpath (subwalk) $\langle i_{k-1},e_k,i_k,e_{k+1},i_{k+1} \rangle$ of a path
(walk) $\langle i_0,e_1,i_1,e_2,i_3,\dots,e_{n-1},i_n \rangle$ in $\C{V}$ is said to
form a \emph{collider on $i_k$}
if the two edges $e_k,e_{k+1}$ meet head-to-head on their shared node $i_k$ (i.e., if the two
subsequent edges are of the form $i_{k-1} \to i_k \ot i_{k+1}$, $i_{k-1} \oto i_k \ot i_{k+1}$,
$i_{k-1} \to i_k \oto i_{k+1}$, or $i_{k-1} \oto i_k \oto i_{k+1}$). Otherwise, the subpath
(subwalk) is called a \emph{non-collider on $i_k$} (i.e., if the two
subsequent edges are of the form $i_{k-1} \to i_k \to i_{k+1}$, $i_{k-1} \ot i_k \ot i_{k+1}$,
$i_{k-1} \ot i_k \to i_{k+1}$, $i_{k-1} \oto i_k \to i_{k+1}$, or $i_{k-1} \ot i_k \oto i_{k+1}$).
Also the end points $i_0, i_n$ of the walk $\pi$ will be referred to as non-colliders on $\pi$.
We will denote the colliders on a walk $\pi$ as $\col{\pi}$ and the non-colliders on $\pi$
(including the end-points of $\pi$) as $\ncol{\pi}$.
A triple of nodes $\langle i,j,k \rangle$ in $\C{G}$ such that $i$ is adjacent to $j$ and $j$ is adjacent
to $k$ is said to be \emph{unshielded} if $i$ is not adjacent to $k$ in $\C{G}$.
\subsubsection{Structural Causal Models}
Directed Mixed Graphs form a convenient graphical representation for
variables (labelled by the nodes) and their functional relations (expressed
by the edges) in a \emph{Structural Causal Model (SCM)} \citep{Pearl2009},
also known as a (non-parametric) \emph{Structural Equation Model (SEM)} \citep{Wright1921}.
Several slightly different definitions of SCMs have been proposed in the literature, which
all have their (dis)advantages. Here we use a variant of the definition in
\citet{Bongers++_1611.06221v2} that is most convenient for our purposes here.
The reason we use SCMs to formulate JCI (rather than for example the
more well-known causal Bayesian networks) is that SCMs are expressive enough
to deal both with latent common causes and with cyclic causal relationships.
\begin{definition}
A Structural Causal Model (SCM) is a tuple $\C{M} = \langle \C{I}, \C{J}, \C{H}, \BC{X}, \BC{E}, \B{f}, \Prb_{\BC{E}} \rangle$ of:
\begin{compactenum}[(i)]
\item a finite index set $\C{I}$ for the endogenous variables in the model;
\item a finite index set $\C{J}$ for the latent exogenous variables in the model;
\item a directed graph $\C{H}$ with nodes $\C{I} \cup \C{J}$, and directed edges pointing from
$\C{I} \cup \C{J}$ to $\C{I}$;
\item a product of Borel\footnote{A \emph{Borel space} is both a measurable and a topological space, such that
the sigma-algebra is generated by the open sets. Most spaces that one encounters in applications as the domain of a random variable are Borel spaces.} spaces $\BC{X} = \prod_{i \in \C{I}} \C{X}_i$, which define the
domains of the endogenous variables;
\item a product of Borel spaces $\BC{E} = \prod_{j \in \C{J}} \C{E}_j$, which define the
domains of the exogenous variables;
\item a measurable function $\B{f} : \BC{X} \times \BC{E} \to \BC{X}$, the \emph{causal
mechanism}, such that each of its components $f_i$ only depends on a particular subset of the variables, as
specified by the directed graph $\C{H}$:
$$f_i : \BC{X}_{\pasub{\C{H}}{i} \cap \C{I}} \times \BC{E}_{\pasub{\C{H}}{i} \cap \C{J}} \to \C{X}_i, \qquad i \in \C{I};$$
\item a product probability measure $\Prb_{\BC{E}} = \prod_{j \in \C{J}} \Prb_{\C{E}_j}$ on $\BC{E}$ specifying
the \emph{exogenous distribution}.
\end{compactenum}
\end{definition}
An SCM is often specified informally by specifying only the structural equations and the
density\footnote{We denote a probability measure (or distribution) of a random variable $\B{X}$ by
$\Prb(\B{X})$, and a density of $\B{X}$ with respect to some product measure by $p(\B{X})$.} of the exogenous distribution with respect to some product measure, for example:
$$\C{M}: \begin{cases}
X_i &= f_i(\B{X}_{\pasub{\C{H}}{i} \cap \C{I}}, \B{E}_{\pasub{\C{H}}{i} \cap \C{J}}) \\
p(\B{E}) & = \prod_{j\in\C{J}} p(E_j).
\end{cases}$$
In discussing the properties of SCMs, the graphical representation of various objects and their relations
in Figure~\ref{fig:scms} may be helpful. This shows how the SCM is the basic object containing all information, and how other
representations can be derived from the SCM. In the rest of this section, we will discuss this in more detail.
\begin{figure}\centering
\begin{tikzpicture}
\node[draw=black] (SCM) at (-0.5,0.5) {Simple SCM};
\node[draw=black] (iSCM) at (4,1) {Intervened SCM};
\node[draw=black] (iP) at (9,1) {Interventional Distribution};
\node[draw=black] (mSCM) at (4,0) {Marginal SCM};
\node[draw=black] (afG) at (2,-1) {Augmented Graph};
\node[draw=black] (fG) at (4,-2.25) {Graph};
\node[draw=black] (Seps) at (3.3,-3.5) {$d$/$\sigma$-separations};
% \node (=) at (5.9,-2.2) {$\supseteq$};
% \node[draw=black] (cG) at (8.5,-2.2) {Causal Graph};
\node[draw=black] (ccR) at (8,-1) {Latent Confounders};
% \node[draw=black] (aR) at (7,-5.5) {Ancestral Relations};
\node[draw=black] (cR) at (8,-2.25) {Direct Causes};
\node[draw=black] (icR) at (8,-3.5) {Causal Relations};
\node[draw=black] (P) at (-1,-3.5) {Observational Distribution};
\node[draw=black] (CI) at (1,-6) {(Conditional) Independences};
% \node[draw=black] (MAG) at (7.5,-5) {Maximal Ancestral Graph};
% \node[draw=black] (PAG) at (7.5,-6.5) {Complete Partial Ancestral Graph};
% \draw[->,thick] (icR) to (MAG);
% \draw[->,thick] (Seps) to (MAG);
% \draw[->,thick] (MAG) to (PAG);
\draw[->,thick] (afG) -- (fG);
\draw[->,thick] (SCM) -- (afG);
\draw[->,thick] (SCM) -- (iSCM);
\draw[->,thick] (SCM) -- (mSCM);
\draw[->,thick] (SCM) -- (P);
\draw[->,thick] (iSCM) -- (iP);
\draw[->,thick,bend right] (P) edge (CI);
\draw[->,thick] (fG) edge (Seps);
% \draw[->,thick,bend right=15] (Seps) to (PAG);
\draw[->,thick,bend left=10] (Seps) edge [anchor=west] node[text width=1.5cm,yshift=-3mm] {Markov Property} (CI);
\draw[->,thick,bend left=10,dashed] (CI) edge [anchor=east,xshift=-1.2cm] node {Faithfulness} (Seps);
% \draw[->,thick] (cG) edge (cR);
\draw[->,thick] (fG) edge (cR);
\draw[->,thick] (fG) edge (ccR);
\draw[->,thick] (cR) edge (icR);
% \draw[->,thick] (fG) -- (aR);
% \draw[->,thick] (aR) -- (icR);
\end{tikzpicture}
\caption{Relationships between various representations for simple SCMs. Directed edges represent mappings.
Intervened and marginal SCMs are always defined and are also simple.\label{fig:scms}}
\end{figure}
We refer to the graph $\C{H}$ as the \emph{augmented graph} of $\C{M}$.
The \emph{graph} of $\C{M}$, denoted $\C{G}(\C{M})$, is the directed mixed graph
with nodes $\C{I}$, directed edges $i_1 \to i_2$ iff $i_1 \to i_2 \in \C{H}$, and bidirected edges
$i_1 \oto i_2$ iff there exists $j \in \pasub{\C{H}}{i_1} \cap \pasub{\C{H}}{i_2} \cap \C{J}$.\footnote{This
definition of graph makes a slight simplification:
a more precise definition would leave out edges that are redundant.
For example, if the structural equation for $X_2$ reads $X_2 = 0 \cdot X_1 + X_3$ it could be that $1 \to 2 \in \C{H}$,
but this edge would not appear in $\C{G}(\C{M})$.
For the rigorous version of this definition, see \citet{Bongers++_1611.06221v2}.}
While the augmented graph $\C{H}$ shows in detail the functional dependence of endogenous variables
on the (independent) exogenous variables, the graph $\C{G}(\C{M})$ provides an abstraction by not including
the exogenous variables explicitly, but using bidirected edges to represent any shared dependence of
pairs of endogenous variables on a common exogenous parent.
If $\C{G}(\C{M})$ is acyclic, we call the SCM $\C{M}$ \emph{acyclic}, otherwise we call the SCM \emph{cyclic}.
If $\C{G}(\C{M})$ contains no bidirected edges, we call the endogenous variables in the SCM $\C{M}$ \emph{causally sufficient}.
A pair of random variables $(\B{X},\B{E})$ is called a \emph{solution} of the SCM $\C{M}$ if
$\B{X} = (X_i)_{i \in \C{I}}$ with $X_i \in \C{X}_i$ for all $i \in \C{I}$,
$\B{E} = (E_j)_{j \in \C{J}}$ with $E_j \in \C{E}_j$ for all $j \in \C{J}$,
the distribution $\Prb(\B{E})$ is equal to the exogenous distribution $\Prb_{\BC{E}}$, and
the \emph{structural equations}:
$$X_i = f_i(\B{X}_{\pasub{\C{H}}{i} \cap \C{I}}, \B{E}_{\pasub{\C{H}}{i} \cap \C{J}})\quad\text{a.s.}$$
hold for all $i \in \C{I}$.
For acyclic SCMs, solutions exist and have a unique distribution that is determined by the SCM.
This is not generally the case in cyclic SCMs, as discussed in detail by \citet{Bongers++_1611.06221v2}.
%Indeed, cyclic SCMs could have no solution at all, or could have multiple solutions with different distributions.
\begin{definition}\label{def:unique_solvability_wrt}
An SCM $\C{M}$ is said to be \emph{uniquely solvable w.r.t.\ $\C{O} \subseteq \C{I}$} if there exists
a measurable mapping $\B{g}_{\C{O}} : \BC{X}_{(\pasub{\C{H}}{\C{O}}\setminus\C{O})\cap\C{I}} \times \BC{E}_{\pasub{\C{H}}{\C{O}} \cap \C{J}} \to \BC{X}_{\C{O}}$
such that for $\Prb_{\BC{E}}$-almost every $\B{e}$ for all $\B{x} \in \BC{X}$:
$$
\B{x}_{\C{O}} = \B{g}_{\C{O}}(\B{x}_{(\pasub{\C{H}}{\C{O}}\setminus\C{O})\cap\C{I}}, \B{e}_{\pasub{\C{H}}{\C{O}}\cap\C{J}})
\quad\iff\quad \B{x}_{\C{O}} = \B{f}_{\C{O}}(\B{x},\B{e}) \,.
$$
(Loosely speaking: the structural equations for $\C{O}$ have a unique solution for $\B{X}_{\C{O}}$ in terms of the other variables appearing in those equations.)
\end{definition}
If $\C{M}$ is uniquely solvable with respect to $\C{I}$, then it induces a unique \emph{observational distribution
$\Prb_{\C{M}}(\B{X})$}. In general, the SCM might not induce an observational distribution at all, or induce multiple ones.
Given an SCM that models a certain system, we can model the system after an idealized intervention in which an external
influence enforces a subset of endogenous variables to take on certain values, while leaving the rest of the system untouched.
\begin{definition}
Let $\C{M}$ be an SCM. The \emph{perfect (``surgical'') intervention} with target $I \subseteq \C{I}$
and value $\B{\xi}_I \in \BC{X}_I$ induces the \emph{intervened SCM} $\C{M}_{\intervene(I,\B{\xi}_I)}$ by copying $\C{M}$,
but letting $\tilde{\C{H}}$ be $\C{H}$ without the edges $\{j \to i : j \in \C{I} \cup \C{J}, i \in I\}$, and
modifying the causal mechanism into $\tilde{\B{f}}$ such that
\begin{equation*}
\tilde{f}_i(\B{x},\B{e}) = \begin{cases}
\xi_i & i \in I \\
f_i(\B{x},\B{e}) & i \notin I.
\end{cases}
\end{equation*}
\end{definition}
The interpretation is that the causal mechanisms that normally determine the values of the components $i \in I$ are
replaced by mechanisms that assign the values $\xi_i$. Other types of interventions are possible as well
(see also Section~\ref{sec:modeling_interventions}). If the intervened SCM $\C{M}_{\intervene(I,\B{\xi}_I)}$ induces a
unique observational distribution, this is denoted as $\Prb_{\C{M}}\big(\B{X} \given \intervene(I,\B{\xi}_I)\big)$ and
referred to as the \emph{interventional distribution of $\C{M}$ under the perfect intervention $\intervene(I,\B{\xi}_I)$}.
\citet{Pearl2009} derived the \emph{do-calculus} for acyclic SCMs, consisting of three rules that express relationships between
interventional distributions of an SCM.
\subsubsection{Simple Structural Causal Models}\label{sec:simple_scm}
The theory of general cyclic Structural Causal Models is rather involved \citep{Bongers++_1611.06221v2}.
In this work, for simplicity of exposition, we will focus on a certain subclass of SCMs that has many convenient properties
and for which the theory simplifies considerably:
\begin{definition}\label{def:simple_scm}
An SCM $\C{M}$ is called \emph{simple} if it is uniquely solvable with respect to any subset $\C{O} \subseteq \C{I}$.
\end{definition}
All acyclic SCMs are simple.
Simple SCMs provide a special case of the more general class of \emph{modular} SCMs \citep{ForreMooij_1710.08775}.
The class of simple SCMs can be thought of as a generalization of acyclic SCMs that allows for (weak) cyclic causal relations, but preserves many of the convenient properties that acyclic SCMs have.
Indeed, a simple SCM induces a unique observational distribution.
Its marginalizations are always defined \citep{Bongers++_1611.06221v2},
and are also simple; in other words, the class of simple SCMs is closed under marginalizations.
The class of simple SCMs is also closed under perfect interventions, and hence,
all perfect interventional distributions of a simple SCM are uniquely defined.
Without loss of generality, one can assume that simple SCMs have no self-cycles.
The causal interpretation of the graph of an SCM with cycles and/or bidirected edges can be rather subtle in general.
However, for graphs of simple SCMs there is a straightforward causal interpretation:
\begin{definition}
Let $\C{M}$ be a simple SCM. If $i \to j \in \C{G}(\C{M})$ we call $i$ a \emph{direct cause of $j$ according to $\C{M}$}.
If there exists a directed path $i \to \dots \to j \in \C{G}(\C{M})$, i.e., if $i \in \ansub{\C{G}(\C{M})}{j}$, then we call
$i$ a \emph{cause of $j$ according to $\C{M}$}. If there exists a bidirected edge $i \oto j \in \C{G}(\C{M})$, then we call
$i$ and $j$ \emph{confounded according to $\C{M}$}.
\end{definition}
We conclude that the graph $\C{G}(\C{M})$ of a simple SCM can be interpreted as its \emph{causal graph}.
In the next subsection, we will discuss how the same graph $\C{G}(\C{M})$ of a simple SCM $\C{M}$ also represents
the conditional independences that must hold in the observational distribution of $\C{M}$.
These two interpretations of the graph $\C{G}(\C{M})$ of a simple SCM can then be combined into a causal do-calculus \citep{ForreMooij_UAI_19} that extends the acyclic do-calculus of \cite{Pearl2009}.
%\subsubsection{Structural Causal Models: Causal Interpretation}\label{sec:causal_interpretation}
%The \emph{causal graph} of an acyclic SCM $\C{M}$ is the acyclic directed mixed graph $\C{G}(\C{M})$ with nodes $\C{I}$,
%directed edges $i_1 \to i_2$ in $\C{G}$ iff $i_1 \to i_2 \in \C{H}$, and bidirected edges
%$i_1 \oto i_2$ in $\C{G}$ iff there exists $k \in \pasub{\C{H}}{i_1} \cap \pasub{\C{H}}{i_2} \cap \C{K}$.\footnote{This
%definition of causal graph makes a slight simplification: a more precise definition would leave out edges that are redundant. For example, if
%the structural equation for $X_2$ reads $X_2 = 0\cdot X_1 + X_3$ then $1 \to 2 \in \C{H}$ but this edge would not be in $\C{G}$.
%For the rigorous version of this definition, and how it can be extended to certain cyclic SCMs, see \citet{Bongers++_1611.06221v2}.}
%The directed edges in $\C{G}$ represent \emph{direct} causal relations between endogenous variables
%(i.e., $i \to j$ iff $i$ is a direct cause of $j$ with respect to $\C{I}$), and
%the bidirected edges in $\C{G}$ may be interpreted to represent the influence of
%\emph{latent confounders}, i.e., $i \oto j$ iff there exists a common cause of $i$ and $j$ that is not in $\C{V}$
%and this common cause $k$ is a direct cause of $i$ and $j$ with respect to $\{i,j,k\}$.
%Acyclic SCMs whose causal graphs contain bidirected edges are also sometimes called \emph{Semi-Markov
%Causal Models} \citep{Pearl2009}.
%A fundamental result states that the Markov property holds for acyclic SCMs: the observational
%distribution $\Prb_{\C{M}}(\B{X})$ induced by the SCM $\C{M}$ is Markov with respect to its causal graph $\C{G}(\C{M})$ \citep{Richardson2003,Pearl2009,Bongers++_1611.06221v2}.
\subsubsection{Structural Causal Models: Markov properties}\label{sec:markov}
Under certain conditions, the graph $\C{G}(\C{M})$ of an SCM $\C{M}$ can be interpreted as a statistical
graphical model, i.e., it allows one to read off
conditional independences that must hold in the observational distribution $\Prb_{\C{M}}(\B{X})$.
One of the most common formulations of such \emph{Markov properties} involves the following notion of \emph{$d$-separation},
first proposed by \citet{Pearl1986} in the context of DAGs,
and later shown to be more generally applicable:\footnote{It is also sometimes called ``$m$-separation'' in the ADMG literature.}
%DMGs can be used as graphical models to represent conditional independence properties between random variables \citep{Richardson2003,ForreMooij_1710.08775}.
\begin{definition}[$d$-separation]
We say that a walk $\langle i_0 \dots i_k \rangle$ in DMG $\C{G} = \langle \C{V},\C{E},\C{F} \rangle$ is \emph{$d$-blocked by $C \subseteq \C{V}$} if:
\begin{compactenum}[(i)]
\item its first node $i_0 \in C$ or its last node $i_k \in C$, or
\item it contains a collider on a node $i \notin \ansub{\C{G}}{C}$, or
\item it contains a non-collider on a node $i \in C$.
\end{compactenum}
If all paths in $\C{G}$ between any node in set $A \subseteq \C{V}$ and any node in set $B \subseteq \C{V}$
are $d$-blocked by a set $C \subseteq \C{V}$, we say that $A$ is \emph{$d$-separated}
from $B$ by $C$, and we write $\dsep{A}{B}{C}{\C{G}}$.
\end{definition}
In the general cyclic case, however, the notion of $d$-separation is too strong, as was already pointed out by
\citet{Spirtes94}. A solution is to replace it with a non-trivial generalization of $d$-separation,
known as $\sigma$-separation \citep{ForreMooij_1710.08775}:
\begin{definition}[$\sigma$-separation]
We say that a walk $\langle i_0 \dots i_k \rangle$ in DMG $\C{G} = \langle \C{V},\C{E},\C{F} \rangle$ is \emph{$\sigma$-blocked by $C \subseteq \C{V}$} if:
\begin{compactenum}[(i)]
\item its first node $i_0 \in C$ or its last node $i_k \in C$, or
\item it contains a collider on a node $i \notin \ansub{\C{G}}{C}$, or
\item it contains a non-collider on a node $i \in C$ that points to a
neighboring node on the path in another strongly-connected component (i.e.,
$i_{k-1} \to i_k \to i_{k+1}$ or $i_{k-1}\oto i_k \to i_{k+1}$ with $i_{k+1} \notin \sccsub{\C{G}}{i_k}$,
$i_{k-1} \ot i_k \ot i_{k+1}$ or $i_{k-1}\ot i_k \oto i_{k+1}$ with $i_{k-1} \notin \sccsub{\C{G}}{i_k}$,
or $i_{k-1} \ot i_k \to i_{k+1}$ with $i_{k-1} \notin \sccsub{\C{G}}{i_k}$ or $i_{k+1} \notin \sccsub{\C{G}}{i_k}$).
\end{compactenum}
If all paths in $\C{G}$ between any node in set $A \subseteq \C{V}$ and any node in set $B \subseteq \C{V}$
are $\sigma$-blocked by a set $C \subseteq \C{V}$, we say that $A$ is \emph{$\sigma$-separated}
from $B$ by $C$, and we write $\sigmasep{A}{B}{C}{\C{G}}$.
\end{definition}
\citet{ForreMooij_1710.08775} proved the following fundamental result for modular SCMs, which we formulate here only for
the special case of simple SCMs:
\begin{theorem}[Generalized Directed Global Markov Property]\label{thm:sigma_separation}
Any solution $(\B{X},\B{E})$ of a simple SCM $\C{M}$ obeys the \emph{Generalized Directed Global Markov Property}
with respect to the graph $\C{G}(\C{M})$:
$$\sigmasep{A}{B}{C}{\C{G}(\C{M})} \implies \indep{\B{X}_A}{\B{X}_B}{\B{X}_C}{\Prb_{\C{M}}(\B{X})} \qquad \forall A,B,C \subseteq \C{I}.$$
\end{theorem}
The following stronger Markov properties have been derived for special cases by \citet{ForreMooij_1710.08775} (where
again we consider only the special case of simple SCMs):
\begin{theorem}[Directed Global Markov Property]\label{thm:d_separation}
Let $\C{M} = \langle \C{I}, \C{J}, \C{H}, \BC{X}, \BC{E}, \B{f}, \Prb_{\BC{E}} \rangle$ be a simple SCM. If $\C{M}$ satisfies at least one of the following three conditions:
\begin{compactenum}[(i)]
\item $\C{M}$ is acyclic;
\item all endogenous spaces $\C{X}_i$ are discrete;
\item $\C{M}$ is linear (i.e., $\C{X}_i = \RN$ for each $i \in\C{I}$, $\C{E}_j = \RN$ for each $j\in\C{J}$, $\phi : \C{I} \to \C{J}$ is a bijection, $\pasub{\C{H}}{i} \cap \C{J} = \phi(i)$ for each $i \in \C{I}$, and each causal mechanism $f_i : \BC{X}_{\pasub{\C{H}}{i} \cap \C{I}} \times \C{E}_{\phi(i)} \to \C{X}_i$ is linear), and its exogenous variables have a density $p(\B{E})$ with respect to Lebesgue measure;
\end{compactenum}
then any solution $(\B{X},\B{E})$ of $\C{M}$ obeys the \emph{Directed Global Markov Property}
with respect to the graph $\C{G}(\C{M})$:
$$\dsep{A}{B}{C}{\C{G}(\C{M})} \implies \indep{\B{X}_A}{\B{X}_B}{\B{X}_C}{\Prb_{\C{M}}(\B{X})} \qquad \forall A,B,C \subseteq \C{I}.$$
\end{theorem}
The acyclic case is well-known and was first shown by \citet{Richardson2003}.
The discrete case fixes the erroneous theorem by \citet{PearlDechter96}, for which a
counterexample was found by \citet{Neal2000}, by adding the unique
solvability condition, and extends it to allow for latent common causes.
The linear case is an extension of existing results for the linear-Gaussian setting
without latent common causes \citep{Spirtes94,Koster96} to a linear (possibly
non-Gaussian) setting with latent common causes.
%The solvability conditions in the discrete and linear cases above are implied by structural solvability w.r.t.\ each strongly-connected component in $\C{G}(\C{M})$.
We conclude that simple SCMs also have convenient Markov
properties. A simple SCM induces a unique observational distribution that satisfies the Generalized
Directed Global Markov Property; under additional conditions, it satisfies even the Directed Global
Markov Property. Similarly, for any perfect intervention, a simple SCM induces a unique interventional
distribution that satisfies the (Generalized) Directed Global Markov Property with respect to the
mutilated graph. We conclude that the graph $\C{G}(\C{M})$ of a simple SCM has two interpretations:
it expresses both the causal structure between the variables as well as the conditional independence
structure. These two interpretations of the graph $\C{G}(\C{M})$ of a simple SCM can be combined
into a causal do-calculus \citep{ForreMooij_UAI_19} that extends the acyclic do-calculus of \cite{Pearl2009}
to the class of simple (or more generally, modular) SCMs.
%Let $\B{X} = (X_j)_{j\in\C{V}}$ be a family of random variables indexed by the nodes $\C{V}$ of DMG $\C{G}$,
%and let $\Prb(\B{X})$ denote the probability distribution of $\B{X}$. The \emph{(global directed)
%Markov property} holds for the pair $(\C{G}, \B{X})$ if each $d$-separation in $\C{G}$ implies
%a conditional independence in $\Prb(\B{X})$:
%$$A \dsep B \given C \ [\C{G}] \implies \B{X}_A \CI \B{X}_B \given \B{X}_C \ [\Prb(\B{X})] \qquad \forall A,B,C \subseteq \C{V}.$$
%\Joris{To simplify the presentation, we could consider simple SCMs only in this section. However, then we would loose out the opportunity to make clear that there are many subtleties to be dealt with for non-simple SCMs.}
%\subsubsection{Faithfulness}\label{sec:faithfulness}
The starting point for constraint-based approaches to causal discovery from observational data is
to assume that the data is modelled by an (unknown) SCM $\C{M}$, such that its observational
distribution $\Prb_{\C{M}}(\B{X})$ exists and satisfies a Markov property with respect to its graph
$\C{G}(\C{M})$. In addition, one usually assumes the \emph{faithfulness assumption} to hold \citep{SGS2000,Pearl2009},
i.e., that the graph explains \emph{all} conditional independences present in the
observational distribution. For the cases in which $d$-separation applies, this amounts to assuming
the following implication:
$$\dsep{A}{B}{C}{\C{G}(\C{M})} \impliedby \indep{\B{X}_A}{\B{X}_B}{\B{X}_C}{\Prb_{\C{M}}(\B{X})} \qquad \forall A,B,C \subseteq \C{V}.$$
\citet{Meek1995} has shown completeness properties of $d$-separation. More specifically, \citet{Meek1995}
showed that faithfulness holds generically for DAGs if (i) all variable domains are finite, or
(ii) if all variables are real-valued, linearly related and have a multivariate Gaussian distribution.
This in particular provides some justification for assuming faithfulness.
On the other hand, no completeness results are known yet for the general
cyclic case in which $\sigma$-separation applies. Nevertheless, we believe that such results can be
shown, and we will assume for simple SCMs a similar faithfulness assumption as for the $d$-separation case:
$$\sigmasep{A}{B}{C}{\C{G}(\C{M})} \impliedby \indep{\B{X}_A}{\B{X}_B}{\B{X}_C}{\Prb_{\C{M}}(\B{X})} \qquad \forall A,B,C \subseteq \C{V}.$$
\subsection{Causal Discovery by Experimentation}\label{sec:CDexp}
The gold standard for causal discovery is by means of experimentation. For example,
randomized controlled trials \citep{Fisher1935} form the foundation
of modern evidence-based medicine. In engineering, A/B-testing is a
common protocol to optimize certain causal effects of an engineered
system. Toddlers learn causal representations of the world through playful
experimentation.
We will discuss here the simplest randomized controlled trial setting by formulating
it in terms of the graphical causal terminology introduced in the last section.
The experimental procedure is as follows.
Consider two variables, ``treatment'' $C$ and ``outcome'' $X$. In the simplest setting, one considers a
binary treatment variable, where $C=1$ corresponds to ``treat with drug'' and
$C=0$ corresponds to ``treat with placebo''. For example, the drug could be aspirin,
and outcome could be the severity of headache perceived two
hours later. Patients are split into two groups,
the treatment and the control group, by means of a coin flip that assigns a value
of $C$ to every patient.\footnote{Usually this is done in a double-blind way, so
that neither the patient nor the doctor knows which group a patient has been assigned
to.} Patients are treated depending on the assigned value of $C$, i.e., patients in the
treatment group are treated with the drug and patients in the control group are treated
with a placebo. Some time after treatment, the outcome $X$ is measured for each patient.
This yields a data set $(C_n,X_n)_{n=1}^N$
with two measurements ($C_n$, $X_n$) for the $n^{\mathrm{th}}$ patient.
If the distribution of outcome $X$ significantly differs between the two groups,
one concludes that treatment is a cause of outcome.
The important underlying causal assumptions that ensure the validity of the
conclusion are:
%\footnote{Even though we use a medical setting as an illustration, note that this holds more generally, for any study or experiment with variables $C$ and $X$ satisfying these assumptions.}
\begin{enumerate}[(i)]
\item outcome $X$ is not a cause of treatment $C$ (which is commonly deemed
justified if the outcome is an event that occurs later in time than the treatment event);
\item there is no latent confounder of treatment and outcome (this is where the
randomization comes in: if treatment is decided solely by a proper coin flip,
then it seems reasonable to assume that there cannot be any latent common cause
of the coin flip $C$ and the outcome $X$ that is not just a combination of two
statistically independent separate causes of $C$ and $X$),
\item no selection bias is present in the data (in other words, data
samples have not been selected afterwards based on the measured values of $C$ or $X$,
which could happen for example if somebody would have selected only a subset of
patients that suffered from certain treatment side effects).
\end{enumerate}
\begin{figure}\centering\begin{tikzpicture}
\node at (0,2) {(a) Two separate data sets:};
\node[anchor=east] at (-3.5,0) {\scalebox{0.8}{\small\begin{tabular}{c}
Placebo \\
($C=0$): \\
\hline
$X$ \\
\hline
\color{blue!70}-0.2 \\
\color{blue!70} 0.6 \\
\color{blue!70}-1.7 \\
\color{blue!70}\dots \\
\hline
\end{tabular}}};
\node at (-2,0) {\includegraphics[width=3cm]{RCThistogram0.pdf}};
\node[anchor=east] at (1.5,0) {\scalebox{0.8}{\small\begin{tabular}{c}
Drug \\
($C=1$): \\
\hline
$X$ \\
\hline
\color{red!70} -0.3 \\
\color{red!70} 1.8 \\
\color{red!70} -0.1 \\
\color{red!70} \dots \\
\hline
\end{tabular}}};
\node at (3,0) {\includegraphics[width=3cm]{RCThistogram1.pdf}};
\begin{scope}[xshift=7cm]
\node at (0,2) {(b) Pooled data:};
\node[anchor=east] at (0,0) {\scalebox{0.8}{\small\begin{tabular}{cc}
$C$ & $X$ \\
\hline
0 & \color{blue!70}-0.2 \\
0 & \color{blue!70} 0.6 \\
0 & \color{blue!70}-1.7 \\
0 & \color{blue!70}\dots \\
1 & \color{red!70} -0.3 \\
1 & \color{red!70} 1.8 \\
1 & \color{red!70} -0.1 \\
1 & \color{red!70} \dots \\
\hline
\end{tabular}}};
% \node at (0,-5) {\includegraphics[width=3cm]{RCThistogram01.pdf}};
\node at (1.5,0) {\includegraphics[width=3cm]{RCThistogram01b.pdf}};
\end{scope}
\end{tikzpicture}
\caption{Illustration of the data from an example randomized controlled trial. The data can either be interpreted
as (a) two separate data sets, one for the treatment and one for the control group, or (b)
as a single data set including a context variable indicating treatment/control. Note that in this
particular example, $C$ is dependent on $X$ in the pooled data (or equivalently, the distribution of $X$
differs between contexts $C=0$ and $C=1$), which implies that $C$ is a cause of $X$.\label{fig:RCT_example}}
\end{figure}
Under these assumptions, one can show that
if the distribution of the outcome $X$ differs between the two groups of
patients (``treatment group'' with $C=1$ vs.\ ``control group'' with $C=0$),
then treatment must be a cause of outcome (in this population of patients).
There are two conceptually slightly different ways of testing this in the data, depending on whether
we treat the data as a single pooled data set, or rather as two separate data sets (each one corresponding
to a particular patient group), see also Figure~\ref{fig:RCT_example}.
If we consider the data about outcome $X$
in the two groups as two \emph{separate} data sets (corresponding to the same
variable $X$, but measured in different contexts $C$), then the question is whether the
distribution of $X$ is statistically different in the two data sets. This can
be tested with a two-sample test, for example, a $t$-test or a Wilcoxon test.
The other alternative is to consider the data as a single \emph{pooled} data set
(by pooling the data for the two groups), and let the value of $C$ indicate the