forked from balaji-gfdl/wippaper
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathwippaper.tex.save
2282 lines (2088 loc) · 118 KB
/
wippaper.tex.save
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% \documentclass[12pt,twocolumn]{article}
% Copernicus stuff
\documentclass[gmd,manuscript]{copernicus}
%\documentclass[gmd,manuscript]{../171128_Copernicus_LaTeX_Package/copernicus} %durack
% page/line labeling and referencing
% from http://goo.gl/HvS9BK
\newcommand{\pllabel}[1]{\label{p-#1}\linelabel{l-#1}}
\newcommand{\plref}[1]{page~\pageref{p-#1}, line~\lineref{l-#1}}
% answer environment for reviewer responses
\newenvironment{answer}{\color{blue}}{}
\usepackage{enumitem}
% \hypersetup{colorlinks=true,urlcolor=blue,citecolor=red}
\hypersetup{colorlinks=false}
% \newcommand{\degree}{\ensuremath{^\circ}}
% \newcommand{\order}{\ensuremath{\mathcal{O}}}
\newcommand{\bibref}[1] { \cite{ref:#1}}
\newcommand{\pipref}[1] {\citep{ref:#1}}
% \newcommand{\ceqref}[1] {\mbox{CodeBlock \ref{code:#1}}}
% \newcommand{\charef}[1] {\mbox{Chapter \ref{cha:#1}}}
% \newcommand{\eqnref}[1] {\mbox{Eq. \ref{eq:#1}}}
\newcommand{\figref}[1] {\mbox{Figure \ref{fig:#1}}}
\newcommand{\secref}[1] {\mbox{Section \ref{sec:#1}}}
\newcommand{\appref}[1] {\mbox{Appendix \ref{sec:#1}}}
% \newcommand{\tabref}[1] {\mbox{Table \ref{tab:#1}}}
\newcommand{\editorial}[1]{\protect{\color{red}#1}}
\runningtitle{WIP Paper Draft \today}
\runningauthor{Balaji et al.}
\begin{document}
\title{Requirements for a global data infrastructure in support of CMIP6}
% \pllabel{SC1-1}
\Author[1,2]{Venkatramani}{Balaji}
\Author[3]{Karl E.}{Taylor}
\Author[4]{Martin}{Juckes}
\Author[5]{Michael}{Lautenschlager}
\Author[6,2]{Chris}{Blanton}
\Author[7]{Luca}{Cinquini}
\Author[8]{S\'ebastien}{Denvil}
\Author[3]{Paul J.}{Durack}
\Author[9]{Mark}{Elkington}
\Author[8]{Francesca}{Guglielmo}
\Author[8,10]{Eric}{Guilyardi}
\Author[10]{David}{Hassell}
\Author[11]{Slava}{Kharin}
\Author[5]{Stefan}{Kindermann}
\Author[10,4]{Bryan N.}{Lawrence}
\Author[1,2]{Sergey}{Nikonov}
\Author[6,2]{Aparna}{Radhakrishnan}
\Author[5]{Martina}{Stockhause}
\Author[5]{Tobias}{Weigel}
\Author[3]{Dean}{Williams}
\affil[1]{Princeton University, Cooperative Institute of Climate
Science, Princeton NJ, USA}
\affil[2]{NOAA/Geophysical Fluid Dynamics Laboratory, Princeton NJ,
USA}
\affil[3]{PCMDI, Lawrence Livermore National Laboratory, Livermore, CA, USA}
\affil[4]{Science and Technology Facilities Council, Abingdon, UK}
\affil[5]{Deutsches KlimaRechenZentrum GmbH, Hamburg, Germany}
\affil[6]{Engility Inc., NJ, USA}
\affil[7]{Jet Propulsion Laboratory (JPL), 4800 Oak Grove Drive,
Pasadena, CA 91109, USA}
\affil[8]{Institut Pierre-Simon Laplace, CNRS/UPMC, Paris, France}
\affil[9]{Met Office, FitzRoy Road, Exeter, EX1 3PB, UK}
\affil[10]{National Center for Atmospheric Science and University of
Reading, UK}
\affil[11]{Canadian Centre for Climate Modelling and Analysis, Atmospheric Environment Service, University of Victoria, BC, Canada}
% \affil[10]{NCAR}
\correspondence{V. Balaji (\texttt{[email protected]})}
\received{}
\pubdiscuss{} %% only important for two-stage journals
\revised{}
\accepted{}
\published{}
%% These dates will be inserted by Copernicus Publications during the typesetting process.
\firstpage{1}
\maketitle
% \pagebreak
\abstract{The World Climate Research Programme (WCRP)'s Working Group
on Climate Modeling (WGCM) Infrastructure Panel (WIP) was formed in
2014 in response to the explosive growth in size and complexity of
Coupled Model Intercomparison Projects (CMIPs) between CMIP3
(2005-06) and CMIP5 (2011-12). This article presents the WIP
recommendations for the global data infrastructure needed to support
CMIP design, future growth and evolution. Developed in close
coordination with those who build and run the existing
infrastructure (the Earth System Grid Federation), the
recommendations are based on several principles beginning with the
need to separate requirements, implementation, and operations. Other
important principles include the consideration of data as a
commodity in an ecosystem of users, the importance of provenance,
the need for automation, and the obligation to measure costs and
benefits.
This paper concentrates on requirements, recognising the diversity
of communities involved (modelers, analysts, software developers,
and downstream users). Such requirements include the need for
scientific reproducibility and accountability alongside the need
to record and track data usage for the purpose of assigning
credit. One key element is to generate a dataset-centric rather
than system-centric focus, with an aim to making the
infrastructure less prone to systemic failure.
With these overarching principles and requirements, the WIP has
produced a set of position papers, which are summarized here. They
provide specifications for managing and delivering model output,
including strategies for replication and versioning, licensing, data
quality assurance, citation, long-term archival, and dataset
tracking. They also describe a new and more formal approach for
specifying what data, and associated metadata, should be saved,
which enables future data volumes to be estimated.
The paper concludes with a future-facing consideration of the global
data infrastructure evolution that follows from the blurring of
boundaries between climate and weather, and the changing nature of
published scientific results in the digital age. }
% \pagebreak
\introduction
\label{sec:intro}
CMIP6 \pipref{eyringetal2016a}, the latest Coupled Model
Intercomparison Project (CMIP), can trace its genealogy back to the
Charney Report \pipref{charneyetal1979}. This seminal report on the
links between CO$_2$ and climate was an authoritative summary of the
state of the science at the time, and produced findings that have
stood the test of time \pipref{bonyetal2013}. It is often noted that
the range and uncertainty bounds on equilibrium climate sensitivity
generated in this report have not fundamentally changed, despite the
enormous increase in resources devoted to analysing the problem in
decades since.
Beyond its prescient findings on climate sensitivity, the Charney
Report also gave rise to a methodology for the treatment of
uncertainties and gaps in understanding, which has been equally
influential, and is in fact the basis of CMIP itself. The Report can
be seen as one of the first uses of the \emph{multi-model ensemble}.
At the time, there were two models capable of representing the
equilibrium response of the climate system to a change in CO$_2$
forcing, one from Syukuro Manabe's group at NOAA's Geophysical Fluid
Dynamics Laboratory, and the other from James Hansen's group at NASA's
Goddard Institute for Space Studies. Then as now, these groups
marshaled vast state-of-the-art computing and data resources to run
very challenging simulations of the Earth system. The Report's results
were based on an ensemble of 3 runs from Manabe, labeled M1-M3, and
two from Hansen, labeled H1-H2.
By the time of the IPCC First Assessment Report (FAR) in 1990, the
process had been formalized. At this stage, there were 5 models
participating in the exercise, and some of what has now been
formalized as the ``Diagnosis, Evaluation, and Characterization of
Klima'' (DECK) experiments\footnote{``Klima'' is German for
``climate''.} had been standardized (a pre-industrial control, 1\%
per year CO$_2$ increase to doubling, etc). The ``scenarios'' had
emerged as well, for a total of 5 different experimental protocols.
Fast-forwarding to today, CMIP6 expects more than 75 models from
around 35 modeling centers \citep[in 14 countries, a stark contrast
to the US monopoly in][]{ref:charneyetal1979} to participate in the
DECK and historical experiments \citep[Table~2
of][]{ref:eyringetal2016a}, and some subset of these to participate in
one or more the 21 MIPs endorsed by the CMIP Panel \citep[Table~3
of][]{ref:eyringetal2016a}. The MIPs call for over 200 experiments, a
considerable expansion over CMIP5.
Alongside the experiments themselves is the data request which
defines, for each CMIP experiment, what output each model should
provide for analysis. The complexity of this data request has also
grown tremendously over the CMIP era. A typical dataset from the FAR
archive (\href{https://goo.gl/M1WSJy}{from the GFDL R15 model}) lists
climatologies and time series of two variables, and the dataset size
is about 200~MB. The CMIP6 Data Request \cite{ref:juckesetal2015}
lists literally thousands of variables from the hundreds of
experiments mentioned above. This growth in complexity is testament to
the modern understanding of many physical, chemical and biological
processes which were simply absent from the Charney Report era models.
The simulation output is now a primary scientific resource for
researchers the world over, rivaling the volume of observed weather
and climate data from the global array of sensors and satellites
\pipref{overpecketal2011}. Climate science, and observed and simulated
climate data in particular, have now become primary elements in the
``vast machine'' \pipref{edwards2010} serving the global climate and
weather enterprise.
% It could be worthwhile to quantify (in $USD) the impact, as forecasting
% in particular has yielded considerable social and economic gains
Managing and sharing this huge amount of data is an enterprise in its
own right -- and the solution established for CMIP5 was the global
``Earth System Grid Federation'' (ESGF, \pipref{williamsetal2015}).
ESGF was identified by the WCRP Joint Scientific Committee in 2013 as
the recommended infrastructure for data archiving and dissemination
for the Programme. The larger gateways currently participating in the
ESGF are shown in in \figref{esgf}, which also lists (some of) the
many projects these nodes support. With multiple agencies and
institutions, and many uncoordinated and possibly conflicting
requirements, the ESGF itself is a complex and delicate component to
manage.
\begin{figure*}
\begin{center}
\includegraphics[width=175mm]{images/esgf-map-2017.png}
\end{center}
\caption{Sites participating in the Earth System Grid Federation in
2017. Figure courtesy Dean Williams, adapted from the ESGF
Brochure. }
\label{fig:esgf}
\end{figure*}
The sheer size and complexity of this infrastructure emerged as a
matter of great concern at the end of CMIP5, when the growth in data
volume relative to CMIP3 (from 40~TB to 2~PB, a 50-fold increase in 6
years) suggested the community was on an unsustainable path. These
concerns led to the 2014 recommendation of the WGCM to form an
\emph{infrastructure panel} (based upon \href{https://goo.gl/FHqbNN},
a proposal at the 2013 annual meeting). The WGCM Infrastructure Panel
(WIP) was tasked with examining the global computational and data
infrastructure underpinning CMIP, and improving communication between
the teams overseeing the scientific and experimental design of these
globally coordinated experiments, and the teams providing resources
and designing that infrastructure. The communication was intended to
be two-way: providing input both to the provisioning of infrastructure
appropriate to the experimental design, and informing the scientific
design of the technical (and financial) limits of that infrastructure.
This paper is a summary of the requirements identified by the WIP in
the first three years of activity since its formation in 2014,
alongside the recommendations which have arisen. In
\secref{principles}, the principles and scientific rationale
underlying the requirements for global data infrastructure are
articulated. In \secref{dreq} the CMIP6 Data Request is covered:
standards and conventions, requirements for modeling centers to
process a complex data request, and projections of data volume.
In \secref{licensing}, recent
evolution in how data are archived is reviewed alongside a licensing
strategy consistent with current practice and scientific principle. In
\secref{cite} issues surrounding data as a citable resource are
discussed, including the technical infrastructure for the creation of
citable data, and the documentation and other standards required to
make data a first-class scientific entity. In \secref{replica} the
implications of data replicas and in \secref{version} issues
surrounding data versioning, retraction, and errata are addressed.
\secref{summary} provides an outlook for the future of global data
infrastructure, looking beyond CMIP6 towards a unified view of
the ``vast machine'' for weather and climate computation and data.
\section{Principles underlying the infrastructure requirements}
\label{sec:principles}
In the pioneering days of CMIP, the community of participants was
small and well-knit, and all the issues involved in generating
datasets for common analysis from different modeling groups could be
settled by mutual agreement (Ron Stouffer, personal communication).
Analysis was performed by the same community that performed the
simulations. The Program for Climate Model Diagnostics and
Intercomparison (PCMDI), established in 1989, had championed the idea
of more systematic analysis of models, and in close cooperation with
the climate modeling centers, PCMDI assumed responsibility for
much of the day-to-day coordination of CMIP. Until CMIP3, the hosting
of datasets from different modeling groups could be managed at a
single archival site; PCMDI alone hosted the entire 40~TB archive.
From its earliest phases, CMIP grew in importance, and its results
provided a major pillar supporting the periodic Intergovernmental
Panel on Climate Change (IPCC) assessment activity. However, the
explosive growth in the scope of CMIP, especially between CMIP3 and
CMIP5, represented a tipping point in the supporting infrastructure.
It became evident that fundamental changes would be needed to address
the evolving scientific and operational requirements, which are summarized
here:
\begin{enumerate}
\item With greater complexity and a globally distributed data
resource, it has become clear that in the design of globally
coordinated scientific experiments, the global computational and
data infrastructure needs to be formally examined as an integrated
element.
\begin{itemize}
\item The WIP was formed in response to this observation, with
membership drawn from experts in various aspects of the
infrastructure. Representatives of modeling centers,
infrastructure developers, and stakeholders in the scientific
design of CMIP and its output comprise the panel membership.
\item One of the WIP's first acts was to consider three phases in
the process of infrastructure development: \emph{requirements},
\emph{implementation}, and \emph{operations}, all informed by the
builders of workflows at the modeling centers.
\begin{itemize}
\item The WIP, in consort with the CMIP Panel, takes
responsibility to articulate requirements for the
infrastructure.
\item The implementation is in the hands of the infrastructure
developers, principally ESGF for the federated archive
\pipref{williamsetal2015}, but also related projects like Earth
System Documentation
\citep[\href{https://goo.gl/WNwKD9}{ES-DOC},][]{ref:guilyardietal2013}.
\item In 2016 at the WIP's request, the CMIP6 Data Node Operations
Team (CDNOT) was formed. It is charged with ensuring that all
the infrastructure elements needed by CMIP6 are properly
deployed and actually working as intended at the sites hosting
CMIP6 data. It is also responsible for the operational aspects
of the federation itself, including specifying what versions of
the toolchain are run at every site at any given time, and
organizing coordinated version upgrades across the federation.
\end{itemize} Although there is now a clear separation of concerns
into requirements, implementation, and operations, close links are
maintained by cross-membership between the key bodies, including
the WIP itself, the CMIP Panel, the ESGF Executive Committee, and
the CDNOT.
\end{itemize}
\item\label{broad} With the basic fact of anthropogenic climate change
now well established \citep[see, e.g.,][]{ref:stockeretal2013}
% A ref would be useful here - the AR5 technical summary for policy makers?
the scientific communities with an interest in CMIP is expanding.
For example, a substantial body of work has begun to emerge to
examine climate impacts.
\begin{itemize}
\item In addition to the specialists in Earth system science -- who
also design and run the experiments and produce the model output
-- those relying on CMIP output now include those developing and
providing climate services, as well as \emph{consumers} from
allied fields studying the impacts of climate change on health,
agriculture, natural resources, human migration, and similar
issues \pipref{mossetal2010}. This confronts us with a
\emph{scientific scalability} issue (the data during its lifetime
will be consumed by a community much larger, both in sheer
numbers, and also in breadth of interest and perspective than the
Earth system modeling community itself), which needs to be
addressed.
\item Accordingly, the WIP has promulgated the requirement that
infrastructure should ensure maximum transparency and usability
for user (consumer) communities at some distance from the modeling
(producer) communities.
\end{itemize}
\item\label{repro} While CMIP and the IPCC are formally independent,
the CMIP archive is increasingly a reference in formulating
climate policy. Hence the \emph{scientific reproducibility}
\pipref{collinstabak2014} and the underlying \emph{durability} and
\emph{provenance} of data have now become matters of central
importance: being able to trace, long after the fact, back from
model output to the configuration of models and analysis procedures
and choices made along the way.
\begin{itemize}
\item This led the IPCC to require data distribution centers (DDCs)
to attempt to guarantee the archival and dissemination of this
data in perpetuity, and
\item the WIP to promote the importance in the CMIP context of
achieving reproducibility. Given the use of multi-model ensembles
for both consensus estimates and uncertainty bounds on climate
projections, it is important to document -- as precisely as
possible, given the independent genealogy and structure of many
models -- the details and differences among model configurations
and analysis methods, to deliver both the requisite provenance and
the routes to reproduction.
\end{itemize}
\item\label{analysis} With the expectation that CMIP DECK experiment
results should be routinely contributed to CMIP, opportunities now
exist for engaging in a more systematic and routine evaluation of
Earth System Models (ESMs). This has led to community efforts to
develop standard metrics of model ``quality''
\citep{ref:eyringetal2016,ref:gleckleretal2016}.
\begin{itemize}
\item Typical multi-model analysis has hitherto taken the
multi-model average, assigning equal weight to each model, as the
most likely estimate of climate response. This ``model democracy''
\pipref{knutti2010} has been called into question and there is now
a considerable literature exploring the potential of weighting
models by quality \pipref{knuttietal2017}. The development of
standard metrics would aid this kind of research.
\item To that end, there is now a requirement to enable through the
ESGF a framework for accommodating quasi-operational evaluation
tools that could routinely execute a series of standardized
evaluation tasks. This would provide data consumers with an
increasingly (over time) systematic characterization of models.
The WIP recognizes it may be some time before a fully operational
system of this kind can be implemented, but planning must start now.
\end{itemize}
\item As the experimental design of CMIP has grown in complexity,
costs both in time and money have become a matter of great concern,
particularly for those designing, carrying out, and storing
simulations. In order to justify commitment of resources to CMIP,
mechanisms to identify costs and benefits in developing new models,
performing CMIP simulations, and disseminating the model output need
to be developed.
\begin{itemize}
\item To quantify the scientific impact of CMIP, measures are needed
to \emph{track} the use of model output and its value to consumers.
\item In addition to usage quantification, credit and tracing data
usage in literature via citation of data is important. Current
practice is at best citing large data collections provided by a
CMIP participant, or all of CMIP. Accordingly, the WIP has defined
and is encouraging use of a mechanism to identify and \emph{cite}
data provided by each modeling center.
\item Alongside the intellectual contribution to model development,
which can be recognized by citation, there is a material cost to
centers in computing which is both burdensome and poorly
understood by those requesting, designing and using CMIP
experiments. To begin documentation of these costs for CMIP6,
the ``Computational Performance'' MIP
project (CPMIP) \pipref{balajietal2017} has been established.
\end{itemize}
\item\label{cmplx} Experimental specifications have become ever more
complex, making it difficult to verify that experiment
configurations conform to those specifications.
\begin{itemize}
\item Several modeling centers have encountered this problem in
preparing for CMIP6, noting, for example, the challenging
intricacies in dealing with input forcing data
\citep[see][]{ref:duracketal2017}, output variable lists
\pipref{juckesetal2015}, and crossover requirements between the
endorsed MIPs and the DECK \pipref{eyringetal2016a} . Moreover,
these protocols inevitably evolve over time, as errors are
discovered or enhancements proposed, and centers needed to be
adaptable in their workflows accordingly.
\item The WIP therefore recognized a requirement to encode the
protocols to be directly ingested by workflows, in other words,
\emph{machine-readable experiment design}. The requirement spans
all of the \emph{controlled vocabularies} (CVs: for instance the
names assigned to models, experiments, and output variables) used
in the CMIP protocols as well as the CMIP6 Data Request
\pipref{juckesetal2015}, which must be stored in
version-controlled, machine-readable formats. Precisely documenting
the \emph{conformance} of experiments to the protocols
\pipref{lawrenceetal2012} is an additional requirement.
\end{itemize}
\item\label{snap} The transition from a unitary archive at PCMDI in
CMIP3 to a globally federated archive in CMIP5 led to many changes
in the way users interact with the archive, which impacts management
of information about users and complicates communications with them.
\begin{itemize}
\item In particular, a growing number of data users no longer
register or interact directly with the ESGF. Rather they rely on
secondary repositories, often ``snapshots'' of the state of some
portion of the ESGF archive created by others at a particular time
(see for instance the \href{https://goo.gl/34AtW6}{IPCC CMIP5 Data
Factsheet} for a discussion of the snapshots and their
coverage). This meant that reliance on the ESGF's inventory of
registered users for any aspect of the infrastructure -- such as
tracking usage, compliance with licensing requirements, or
informing users about errata or retractions -- could at best
ensure partial coverage of the user base.
\item The WIP therefore committed to a more distributed design for
several features outlined below, which devolve many of these
features to the datasets themselves rather than the archives. One
may think of this as a \emph{dataset-centric rather than
system-centric} design (in software terms, a \emph{pull} rather
than \emph{push} design): information is made available upon
request at the user/dataset level, relieving the ESGF
implementation of an impossible burden.
\end{itemize}
\end{enumerate}
Based upon these considerations, the WIP produced a set of position
papers (see \appref{wip}) encapsulating specifications and
recommendations for CMIP6 and beyond. These papers, summarized below,
are available from the
\href{https://www.earthsystemcog.org/projects/wip/}{WIP website}. As
the WIP continues to develop additional recommendations, they too will
be made available. All WIP papers distributed in this way are thought
be stable, but should revision be necessary, a modified document will
be released with a new version number.
\section{A structured approach to data production}
\label{sec:dreq}
The CMIP6 data framework has evolved considerably from CMIP5, and
follows the principles of scientific reproducibility (Item~\ref{repro}
in \secref{principles}), and the recognition that the complexity of
the experimental design (Item~\ref{cmplx}) required far greater
degrees of automation and embedding in workflows. This requires that
all elements in the specification be recorded in structured text
formats (XML and JSON, for example), and subject to rigorous version
control. \emph{Machine-readable} specification of as many aspects of
the model output configuration as possible is a WIP design goal.
The data request spans several elements discussed in sub-sections
below.
\subsection{CMIP6 Data Request}
\label{sec:data-request}
The data request \pipref{juckesetal2015} is now available
through the \href{https://goo.gl/iNBQ9m}{DREQ} tool, the associated
\texttt{dreqPy} Python library, and underlying
% Martin refs to this as "dreq", with the software "dreqPy"
database. The DREQ combines definitions of variables and their output
format with specifications of the objectives they support and the
experiments that they are required for. The entire request is encoded
in an XML database with rigorous type constraints. Important elements
of the request, such as units, cell methods (expressing the subgrid
processing implicit in the variable definition), and time slices for
required output, are defined as controlled vocabularies within the
request to ensure consistency of usage. The request is designed to
enable flexibility, allowing modeling centers to make informed
decisions about the variables they should submit to the CMIP6 archive
from each experiment.
The data request spans several elements.
\begin{enumerate}
\item specification of the parameter to be calculated in terms of a CF
standard name and units,
\item an output frequency,
\item a structural specification which includes specification of
dimensions and of subgrid processing.
\end{enumerate}
In order to facilitate the cross linking between the 2100 variables
from 248 experiments, the request database allows MIPs to aggregate
variables and experiments into groups. The link between variables and
experiments is then made through the following chain:
\begin{enumerate}
\item A \emph{variable group}, aggregating variables with priorities
specific to the MIP defining the group;
\item A \emph{request link} associating a variable group with an
objective and a set of request items;
\item \emph{Request} items associating a particular time slice with a
request link and a set of experiments.
\end{enumerate}
This formulation takes into account the complexities that arise
when a particular MIP requests that variables needed for
their own experiments should also
be saved from a DECK experiment or from an experiment proposed
by a different MIP.
The data request supports a broad range of users who are
provided with a range of different access points.
\begin{enumerate}
\item The XML database provides the reference document;
\item Web pages provide a direct representation of the database
content;
\item Excel workbooks provide selected overviews for specific MIPs and
experiments;
\item A python library provides an interface to the database with some
built-in support functions;
\item A command line tool based on the python library allows quick
access to simple queries.
\end{enumerate}
The data request's machine-readable database, which is accessible
through a simple python API, has been an extraordinary resource for
the modeling centers. They can, for example, directly integrate the
request specifications with their workflows to ensure that the correct
set of variables are saved for each experiment they plan to run. In
addition, it has given them a new-found ability to estimate the data
volume associated with meeting a MIP's requirements, a feature
exploited below in \secref{dvol}.
\subsection{Model inputs}
\label{sec:data-inputs}
Datasets used by the model for configuration of model inputs
\citep[\texttt{input4MIPs}, see][]{ref:duracketal2017} as well as
observations for comparison with models \citep[\texttt{obs4MIPs},
see][]{ref:teixeiraetal2014} are both now organized in the same way,
and share many of the naming and metadata conventions as the
CMIP model output itself. The datasets follow versioning
methodologies recommended by the WIP.
\subsection{Data Reference Syntax}
\label{sec:data-drs}
The organization of the model output follows the
\href{http://goo.gl/v1drZl}{Data Reference Syntax (DRS)} first used in
CMIP5, and now in somewhat modified form in CMIP6. The DRS depends on
pre-defined \emph{controlled vocabularies} (CVs) for various terms
including: the names of institutions, models, experiments, time
frequencies, etc. The CVs are now recorded as a version-controlled set
of structured text documents, and the WIP has taken steps to ensure
that there is a \href{https://goo.gl/HGafnJ}{single authoritative
source for any CV}, on which all elements in the toolchain will
rely. The DRS elements that rely on these controlled vocabularies
appear as netCDF attributes and are used in constructing file names,
directory names, and unique identifiers of datasets that are essential
throughout the CMIP6 infrastructure. These aspects are covered in
detail in the \href{https://goo.gl/mSe4rf}{CMIP6 Global Attributes,
DRS, Filenames, Directory Structure, and CVs} position paper. A new
element in the DRS indicates whether data has been stored on a native
grid or has been regridded (see discussion below in \secref{dvol} on
the potentially critical role of regridded output). This element of
the DRS will allow us to track the usage of the \emph{regridded
subset} of data, and assess the relative popularity of native-grid
vs. standard-grid output.
\subsection{CMIP6 data volumes}
\label{sec:dvol}
As noted, extrapolations based on CMIP3 and CMIP5 lead to some
alarming trends in data volume \citep[see
e.g.,][]{ref:overpecketal2011}. The WIP has undertaken a rigorous
approach to the estimation of future data volumes, rather than simple
extrapolation. Contributions to increase in data volume include the
systematic increase in model resolution and complexity of the
experimental protocol and data request. We consider these separately:
\begin{description}
\item[Resolution] The median horizontal resolution of a CMIP model
tends to grow with time, and is expected to be more typically 100~km
in CMIP6, compared to 200~km in CMIP5. The vertical resolution grows
in a more controlled fashion, at least as far as the data is
concerned, as often the requested output is reported on a standard
set of atmospheric levels that has not changed much over the years.
Similarly the temporal resolution of the data request does not
increase at the same rate as the model timestep: monthly averages
remain monthly averages. A doubling of model resolution leads
therefore to a quadrupling of the data volume, in principle. But
typically the temporal resolution of the model (though not the data)
is doubled as well, for reasons of numerical stability. Thus, for an
$N$-fold increase in horizontal resolution, we require an $N^3$
increase in computational capacity, which will result in an $N^2$
increase in data volume. We argue therefore, that data volume $V$
and computational capacity $C$ are related as $V \sim C^\frac23$,
purely from the point of view of resolution. The exponent is even
smaller if vertical resolution increases are assumed. If we then
assume that centers will experience an 8-fold increase in $C$
between CMIPs (which is optimistic in an era of tight budgets), we
can expect a 4-fold increase in data volume. However, this is not
what we experienced between CMIP3 and CMIP5. What caused that
extraordinary 50-fold increase in data volume?
\item[Complexity] The answer lies in the complexity of CMIP: the
complexity of the data request, and of the experimental protocol.
The data request complexity is related to that of the science: the
number of processes being studied, and the physical variables
required for the study. In CPMIP \pipref{balajietal2017}, we have
attempted a rigorous definition of this complexity, measured
by the number of physical variables simulated by the model. This, we
argue, grows not smoothly like resolution, but in very distinct
generational step transitions, such as the one from
atmosphere-ocean models to Earth system models, which involved a
substantial jump in complexity, the number of physical, chemical,
and biological species being modeled, as shown in
\bibref{balajietal2017}.
% the following increase in complexity doesn't help explain the 50-fold increase
% which is what this paragraph is supposed to address
% the number of experiments (or number of years simulated) are
% primarily controlled by $C$, which you say is limited to 8-fold increase.
% need to restructure the argument.
The second component of complexity is the experimental protocol, and
the number of experiments themselves when comparing CMIP5 and CMIP6.
With the new structure of CMIP6, with a DECK and 21 endorsed MIPs,
this would appear to have grown tremendously. We propose as a
measure of experimental complexity, the \emph{total number of
simulated years (SYs)} conforming to a given protocol. Note that
this too is gated by $C$: modeling centers usually make tradeoffs
between experimental complexity and resolution in deciding their
level of participation in CMIP6, discussed in
\bibref{balajietal2017}.
\end{description}
The WIP has recommended two further steps toward ensuring sustainable
growth in data volumes.
% Given the earlier arguments, it seems $C$ will limit growth of volume by itself
% Why are additional steps necessary?
\begin{enumerate}
\item The first of these is the consideration of standard horizontal
resolutions for saving data, as is already done for vertical and
temporal resolution in the data request. Cross-model analyses
already cast all data to a common grid in order to evaluate it as an
ensemble, typically at fairly low resolution. The studies of Knutti
and colleagues (e.g., \bibref{knuttietal2017}) are typically
performed on relatively coarse grids. We recommend that for most
purposes atmospheric data on the ERA-40 grid
($2^\circ\times 2.5^\circ$) would suffice, with of course exceptions
for experiments like those called for by HighResMIP
\pipref{haarsmaetal2016}. A similar recommendation is made for ocean
data (the World Ocean Atlas $1^\circ\times 1^\circ$ grid), with
extended discussion of the benefits and losses due to regridding
\citep[see][]{ref:griffiesetal2014,ref:griffiesetal2016}.
Regridding remains a contentious topic, and owing to
a lack of consensus, the WIP recommendations on regridding remain in
flux. The \href{https://goo.gl/wVtm5t}{CMIP6 Output Grid Guidance
document} outlines a number of possible recommendations, including
the provision of ``weights'' to a target grid. Many of the
considerations around regridding, particularly for ocean data in
CMIP6, are discussed at length in \bibref{griffiesetal2016}. A
similar lack of consensus has made the WIP drop a recommendation of
a common \emph{calendar} for particular experiments: a wide variety
of calendars are in use -- Gregorian, Julian, 365-day, and
equal-month (360-day) all remain popular options -- and the onus of
converting data across the multi-model ensemble (MME) to a common
one for analysis remains upon the end-user.
As outlined below in \secref{replica}, both ESGF data nodes and the
creators of secondary repositories are given considerable leeway in
choosing data subsets for replication, based on their own interests.
The tracking mechanisms outlined in \secref{pid} below will allow us
to ascertain, after the fact, how widely used the native grid data
may be \emph{vis-\`a-vis} the regridded subset, and allow us to
recalibrate the replicas, as usage data becomes available. We note
also that the providers of at least one of the standard metrics
packages \citep[ESMValTool,][]{ref:eyringetal2016a} have expressed a
preference of standard grid data for their analysis, as regridding
from disparate grids increases the complexity of their already
overburdened infrastructure.
\item The second is the issue of data compression. netCDF4, which is
the WIP's required standard for CMIP6 data, includes an option
for lossless compression or deflation \pipref{zivlempel1977} that
relies on the same technique used in standard tools such
as \texttt{gzip}. In practice, the reduction in data volume will
depend upon the ``entropy'' or randomness in the data, with
smoother data being compressed more.
Deflation entails computational costs, not only during creation of
the compressed data, but also every time the data are re-inflated.
There is also a subtle interplay with precision: for instance
temperatures usually seen in climate models appear to deflate better
when expressed in Kelvin, rather than Celsius, but that is due to
the fact that the leading order bits are always the same, and thus
the data is actually less precise. Deflation is also enhanced by
reorganizing (``shuffling'') the data internally into chunks that
have spatial and temporal coherence.
Some in the community argue for the use of more aggressive
\emph{lossy} compression methods \pipref{bakeretal2016}, but the
WIP, after consideration, believes the loss of precision entailed by
such methods, and the consequences for scientific results, require
considerably more evaluation by the community before such methods
can be accepted as common practice.
Given the options above, we undertook a systematic study of the
behavior of typical model output files under lossless compression,
the results of which are \href{https://goo.gl/qkdDnn}{publicly
available}. The study indicates that standard \texttt{zlib}
compression in the netCDF4 library with the settings of
\texttt{deflate=2} (relatively modest, and computationally
inexpensive), and \texttt{shuffle} (which ensures better
spatiotemporal homogeneity) ensures the best compromise between
increased computational cost and reduced data volume. For a coupled
model, we expect a total savings of about 50\%, with ocean, ice,
land realms getting the most savings (owing to large areas of the
globe that are masked), and atmospheric data the least. This 50\%
estimate has been verified with sample output from some models
preparing for CMIP6.
\end{enumerate}
The \href{https://goo.gl/iNBQ9m}{DREQ} alluded to above in
\secref{dreq} allows us to make a systematic assessment of these
considerations. The tool expects one to input a model's resolution
along with the experiments that will be performed and the data one
intends to save (using DREQ's \emph{priority} attribute). With this
information
% We are actually capturing this information in the registered content
% for the model source_id entries - see http://rawgit.com/WCRP-CMIP/CMIP6_CVs/master/src/CMIP6_source_id.html
% The json entry contains resolutions for each active model realm
% https://github.com/WCRP-CMIP/CMIP6_CVs/blob/master/CMIP6_source_id.json
% "unprecedented" is incorrect.
% In CMIP5 we had a sophisticated capability of estimating data volume
% We polled the groups to determine which experiments they planned
% to run and how large their ensembles would be.
% We also asked what resolution they would report output.
% From this we estimated in Nov. 2010 a total data volume of 2.5 petabytes
% (2.1 petabytes if only high-priority variables were reported), not too
% far from the actual volume. I'll send you the analysis if you like.
% The modeling groups had access to this information.
\href{https://goo.gl/Ezz5v3}{dreqDataVol.py}, which is a tool
built atop DREQ available from the WIP website calculates the
data volume that will be produced. While similar
analyses were undertaken at PCMDI for CMIP5, this tool puts this
capability in the hands of the modeling centers themselves.
To make a preliminary estimate of total data volume, the WIP carried
out a survey of modeling centers in 2016, asking them for their
expected model resolutions, and intentions of participating in various
experiments. Based on that survey, we initially have forecast a data
volume of 18~PB for CMIP6. This assumes an overall 50\% compression
rate, which has been approximately verified for at least one CMIP6
model, and whose compression rates should be quite typical. This
number, 18~PB, is about 6 times the CMIP archive size, and can be
explained in terms of the compounding of modest increases in
resolution and complexity, as explained above. The more dramatic
increase in data volume between CMIP3 and CMIP5 was also due to these
same causes, but with a much larger change. Many models of the CMIP5
era added atmospheric chemistry and aerosol-cloud feedbacks, sometimes
with $\mathcal{O}(100)$ species. CMIP5 also marked the first time in
CMIP that ESMs were used to simulate changes in the carbon cycle and
modeling groups performed many more simulations than in CMIP3 with a
corresponding increase in years simulated. There is no comparable jump
between CMIP5 and CMIP6. CMIP6's innovative DECK/endorsed-MIP
structure should thus be seen as an extension and an attempt to impose
a rational order on CMIP5, rather than a qualitative leap.
% if you want to discuss different grids, perhaps here is a better place for
% that.
It should be noted that reporting output on a lower
resolution standard grid (rather than the native model grid) could
shrink this volume 10-fold, to 1.8~PB. This is an important number, as
will be seen below in \secref{replica}: the managers of Tier~1 nodes
have indicated that 2~PB is about the practical limit for replicated
storage of combined data from all models. The WIP believes
% I for one don't think it is important for all the data to be replicated
this target is achievable based on compression and the use of standard
grids. Both of these (the use of netCDF4 compression and regridding)
remain merely recommendations, and the centers are free to choose
whether or not to compress and regrid.
\section{Licensing}
\label{sec:licensing}
The WIP's recommended licensing policy is based on an examination of
data usage patterns in CMIP5. First, while the licensing policy called
for registration and acceptance of the terms of use, a large fraction,
perhaps a majority of users, actually obtained their data not directly
from ESGF, but from other copies, such as the ``snapshots'' alluded to
above in Item~\ref{snap}, \secref{principles}. Those users accessing
the data indirectly, as shown in \figref{dark}, relied on user groups
or their home institutions to make secondary repositories that could
be more conveniently accessed. The WIP
\href{https://goo.gl/7vHsPU}{CMIP6 Licensing and Access Control}
position paper refers to the secondary repositories as ``dark'' and
those obtaining CMIP data from those reposistories as ``dark users''
who are invisible to the ESGF system. While this appears to subvert
the licensing and registration policy put in place for CMIP5, this
should not be seen as a ``bootleg'' process: it is in fact the most
efficient use of limited network bandwidth at the user sites. However,
this also removes the ability for users of these ``dark'' repositories
to benefit from the augmented provenance provided by infrastructure
updates, such as being notified of data retractions or replacements in
the case that contributed datasets are found to be erroneous and
replaced.
\begin{figure*}
\begin{center}
\includegraphics[width=175mm]{images/WIP-data-process.png}
\end{center}
\caption{Typical data usage pattern in CMIP5 involved users making
local copies, and user groups making institutional-scale caches
from ESGF. Figure courtesy Stephan Kindermann, DKRZ, adapted from
WIP Licensing White Paper.}
\label{fig:dark}
\end{figure*}
The WIP therefore recommends a licensing policy that inverts this and
removes the impossible task of license enforcement from the distribution
system, and embraces the ``dark'' repositories and users.
To quote the WIP position paper:
\begin{quote}
The proposal is that (1) a data license be embedded in the data
files, making it impossible for users to avoid having a copy of the
license, and (2) the onus on defending the provisions of the license
be on the original modeling center...
\end{quote}
The data archive snapshots and emerging resources that combine
archival and analysis capabilities (e.g., NCAR's
\href{https://goo.gl/sYTxC2}{CMIP Analysis Platform}) will host data
and offload some of the network provisioning requirements from ESGF
nodes themselves.
Modeling centers are offered two choices of \emph{Creative Commons}
licenses: data covered by the \href{https://goo.gl/CY5m2v}{Creative
Commons Attribution ``Share Alike'' 4.0 International License} will
be freely available; for centers with more restrictive policies, the
\href{https://goo.gl/KUNUKq}{Creative Commons Attribution
``NonCommercial Share Alike'' 4.0 International License}, which
restricts the data to non-commercial use. Further sharing of the data
is allowed, as the license travels with the data. The PCMDI website
provides a link to the current
\href{https://pcmdi.llnl.gov/CMIP6/TermsOfUse}{CMIP6 Terms of Use
webpage}.
\section{Citation and provenance}
\label{sec:cite}
As noted in \secref{principles}, the WIP's position on citation flows
from two underlying considerations: one, to provide proper credit and
formal acknowledgment of the authors of datasets; and the other, to
enable rigorous tracking of data provenance and data usage. The
tracking facilitates scientific reproducibility and traceability, as
well as enabling statistical analyses of dataset utility.
In addition to clearly identifying what data have been used in
research studies and who deserves credit for providing that data, it
is essential that the data be examined for quality and that
documentation be made available describing the model and experiment
conditions under which it was generated. These subjects are addressed
in the four position papers summarized in this section.
The principles outlined above are well-aligned with the
\href{https://goo.gl/Pzb7F6}{Joint Declaration of Data Citation
Principles} formulated by the Force11 (The Future of Research
Communications and e-Scholarship) Consortium, which has acknowledged
the rapid evolution of digital scholarship and archival, as well as
the need to update the rules of scholarly publication for the digital
age. We are convinced that not only peer-reviewed publications but
also the data itself should now be considered a first-class product of
the research enterprise. This means that data requires curation and
should be treated with the same care as journal articles. Moreover,
most journals and academies now insist that data used in the
literature be made publicly available for independent inquiry and
reproduction of results. New services like
\href{http://www.scholix.org}{Scholix} are evolving to support the
exchange and access of such data-data and data-literature
interlinking.
Given the complexity of the CMIP6 data request, we expect, as shown in
\secref{dvol}, a total dataset count of $\mathcal{O}(10^6)$. Because
dozens of datasets are typically used in a single scientific study, it
is impractical to cite each dataset individually in the same way as
individual research publications are acknowledged. The WIP therefore
offers an option of citing data and giving credit to data providers
that relies on a rather coarse granularity, while at the same time
offering another option at a much finer granularity for recording the
specific files and datasets used in a study.
In the following, two distinct types of persistent identifiers (PIDs)
are discussed: DOIs, which can only be assigned to data that comply
with certain standards for citation metadata and curation, and the
more generic ``Handles''that have fewer constraints and may be more
easily adapted for a particular use. Technically both types of PIDs
rely on the underlying global Handle System to provide services (e.g.,
to resolve the PIDs and provide associated metadata, such as the
location of the data itself).
\subsection{Persistent identifiers for acknowledgment and citation}
\label{sec:doi}
Based on earlier phases of CMIP, some datasets initially contributed
to the CMIP6 archive will be flawed (due, for example, to errors in
processing) and therefore will not accurately represent a model's
behavior. When errors are uncovered in the datasets, they may be
replaced with corrected versions. Similarly, additional datasets may
be added to an initially incomplete collection of datasets. Thus,
initially at least, the DOIs assigned for the purposes of citation and
acknowledgement will represent an evolving underlying collection of
datasets.
The recommendations, detailed in the
\href{https://goo.gl/BFn9Hq}{CMIP6 Data Citation and Long Term
Archival} position paper, recognize two phases to the process of
assigning DOI's to collections of datasets: an initial phase, when the
data have been released and preliminary community analysis is still
underway and a second stage when most errors in the data have been
identified and corrected. Upon reaching stage two, the data will be
transferred to long-term archival (LTA) of the IPCC Data Distribution
Centre (IPCC DDC) and deemed appropriate for interdisciplinary use
(e.g., in policy studies). The timing of the planned DDC snapshot is
linked to the IPCC AR6 schedule.
For evolving dataset aggregations, the data citation infrastructure
relies on information collected from the data providers and uses the
\href{https://www.datacite.org/dois.html}{DataCite} data
infrastructure to assign DOIs and record associated metadata.
DataCite is a leading global non-profit organisation that provides
persistent identifiers (DOIs) for research data. The DOIs will be
assigned to:
\begin{enumerate}
\item aggregations that include all the datasets contributed by one
model from one institution from all of a single MIP's experiments,
and
\item aggregations that include all datasets contributed by one model
from one institution generated in performing one experiment (which
might include one or more simulations).
\end{enumerate}
These aggregations are dynamic as far as the PID infrastructure is
concerned: new elements can be added to the aggregation without
modifying the PID. As an example, for the coarser of the two
aggregations defined above, the same PID will apply to an evolving
number of simulations as new experiments are performed with the model.
This PID architecture is shown in \figref{pidarch}. Since these
collections are dynamic, citation requires authors to provide a
version reference.
\begin{figure*}
\begin{center}
\includegraphics[width=175mm]{images/PID-architecture.png}
\end{center}
\caption{PID architecture, showing layers in the PID hierarchy. In
the lower layers of the hierarchy, PIDs are static once generated,
and new datasets generate new versions with new PIDs.}
\label{fig:pidarch}
\end{figure*}
For the stable dataset collections, the data citation infrastructure
requires some additional steps to meet formal requirements. First, we