-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
308 lines (303 loc) · 16.3 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="chrome=1">
<title>CLDF - Cross-Linguistic Data Formats</title>
<link rel="stylesheet" href="stylesheets/styles.css">
<link rel="stylesheet" href="stylesheets/github-light.css">
<meta name="viewport" content="width=device-width">
</head>
<body class="site">
<header class="site-header">
<nav>
<div class="sticky-nav">
<div class="container-nav">
<div class="nav-container">
<div class="navbar-brand"><a href="https://cldf.clld.org">
<img width="150" src="logos/logo_straight.png" alt="CLDF">
</a></div>
<div class="navbar-nav nav-mobile">
<a class="nav-item nav-link btn btn-nav" href="/">Home</a>
<a class="nav-item nav-link btn btn-nav" href="https://github.com/cldf/cldf">Specification</a>
<a class="nav-item nav-link btn btn-nav" href="v1.0/terms.rdf">Ontology</a>
<a class="nav-item nav-link btn btn-nav" href="v1.0/terms.html">(HTML)</a>
<a class="nav-item nav-link btn btn-nav" href="/publications.html">Publications</a>
<a class="nav-item nav-link btn btn-nav" href="/examples.html">Examples</a>
</div>
<div class="navbar-nav nav-main">
<a class="nav-item nav-link active" href="/">Home</a>
<a href="https://github.com/cldf/cldf" class="nav-item nav-link">Specification</a>
<a href="v1.0/terms.rdf" class="nav-item nav-link">Ontology</a>
<a href="v1.0/terms.html" class="nav-item nav-link">(HTML)</a>
<a class="nav-item nav-link" href="/publications.html">Publications</a>
<a class="nav-item nav-link" href="/examples.html">Examples</a>
</div>
</div>
</div>
</div>
</nav>
</header>
<main class="site-content">
<section class="section-alt pad-top">
<div class="container">
<h1 class="text-center headline">Cross-Linguistic Data Formats</h1>
</div>
</section>
<section class="section-main">
<div class="container pad-top">
<div class="service-flexrow pad-top">
<div class="column-100">
<blockquote class="blockquote-info">
<p class="lead">
CLDF 1.3 has been released!
<a href="https://doi.org/10.5281/zenodo.10579537"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.10579537.svg" alt="DOI"></a>
</p>
<p>
See the <a href="https://github.com/cldf/cldf/blob/master/CHANGELOG.md">changelog</a> for a list of changes.
</p>
</blockquote>
<p>
See also this article describing CLDF:
</p>
<blockquote class="blockquote-info">
Forkel, R. et al. Cross-Linguistic Data Formats, advancing data sharing and reuse in comparative linguistics. Sci. Data. 5:180205 doi: <a href="https://doi.org/10.1038/sdata.2018.205">10.1038/sdata.2018.205</a> (2018).
</blockquote>
</div>
</div>
<div class="service-flexrow pad-top">
<div class="column-33">
<div style="padding-top: 3cm;">
<img src="logos/logo.png" alt="CLDF logo" class="img-responsive">
</div>
</div>
<div class="column-66">
<div class="gray-box">
<h2>Why?</h2>
<p>
To allow exchange of cross-linguistic data and decouple
development of tools and methods from that of
databases, standardized data formats are necessary.
</p>
<p>
Once established, these dataformats could become a foundation
not
only for tools but also for instruction material in the spirit
of
<a href="http://datacarpentry.org/">Data Carpentry</a> for
historical
linguistics and linguistic typology.
</p>
</div>
<div class="gray-box">
<h2>What?</h2>
<p>
The main types of cross-linguistic data we are concerned with
here
are any tabular data which is typically
analysed using quantitative (automated) methods or made
accessible
using software tools like the `clld` framework, such as
</p>
<ul>
<li>wordlists (or more complex lexical data including e.g.
cognate
judgements),
</li>
<li>
structure datasets (e.g. <a
href="http://wals.info/feature">WALS features</a>),
</li>
<li>simple dictionaries.</li>
</ul>
</div>
</div>
</div>
<div class="service-flexrow pad-top">
<div class="column-100 gray-box">
<h2>Design principles</h2>
<ul>
<li>Data should be both editable "by hand" and amenable to reading
and
writing by software (preferably software the typical linguist
can be
expected to use correctly).
</li>
<li>Data should be encoded as UTF-8 text files.</li>
<li>If entities can be referenced, e.g. languages through their
Glottocode,
this should be done rather than duplicating information like
language
names.
</li>
<li>Compatibility with existing tools, standards and practice
should
always be kept in mind.
</li>
</ul>
<p>
Automated re-use requires that the standard specifies not just
the structure, but also the semantics of the data stored. Thus,
the CLDF specification should be as rigid as possible. Of course, new
types of data cannot be immediately compatible with
independently developed tools; so the CLDF standard should also provide
mechanisms to let data types evolve well understood semantics,
while being syntactically compatible from the start.
</p>
</div>
</div>
<div class="service-flexrow pad-top">
<div class="column-100 gray-box">
<h2>Technology</h2>
<p>
Since we are concerned with tabular data here, CLDF is built on
W3C's
<a href="http://www.w3.org/TR/tabular-data-model/#standard-file-metadata">
Model for Tabular Data and Metadata on the Web</a>
and
<a href="https://www.w3.org/TR/tabular-metadata/">
Metadata Vocabulary for Tabular Data</a>.
This model - by virtue of being a
<a href="https://en.wikipedia.org/wiki/JSON-LD">JSON-LD</a> dialect -
is ideally suited to be combined with an
<a href="https://en.wikipedia.org/wiki/Ontology_(information_science)">ontology</a>
to specify syntax as well as semantics of a data serialization format.
Much like <a href="https://software.sil.org/shoebox/mdf/">MDF</a> - SIL's
Multi-Dictionary Formatter - adds a hierarchical data model on top of
Toolbox' standard format markers to support a data reuse scenario,
CLDF structures cross-linguistic data to make automated reuse in
typical analyses in historical linguistics possible.
</p>
<p>
One of the main goals of the <b>CLDF</b> specification is a useful
delineation of data and tools. Using a CSV
based format makes it really easy to use this data in a
<a href="https://en.wikipedia.org/wiki/Pipeline_%28Unix%29">
UNIX-style pipeline</a> of
data transformation commands.
This pipeline-style of data transformation and analysis seems to
be at the core of typical workflows e.g. in
historical linguistics, e.g.
<a href="http://lingpy.org/tutorial/workflow.html">LingPy</a> or
<a href="https://github.com/cysouw/qlcPipe">QLC</a>.
</p>
<p>
If suitable text- and line-based formats are available, this
pipeline-style does also allow for easy extensibility;
E.g. a workflow for automatic cognate judgements based on LingPy
functionality could be extended with phylogenetic
analysis and post-processing via
<a href="https://github.com/lmaurits/phyltr">phyltr</a>, which
processes sets
of phylogenetic trees represented in the newick format, or
<a href="http://etetoolkit.org/documentation/tools/">ete</a>.
</p>
<p>
If cross-linguistic comparisons procede in the footsteps of
bioinformatics, workflows based UNIX pipelines may at some point
be formalized using a <a
href="http://common-workflow-language.github.io/">
common workflow language</a>.
</p>
</div>
</div>
<div class="service-flexrow pad-top">
<div class="column-100 gray-box">
<h2>History</h2>
<p>While data formats to exchange linguistic data have been around for
some time, e.g. the SFM or Standard Format
used by Toolbox, new developments in the area of language
diversity research have motivated this push for a new
set of formats:</p>
<ul>
<li>A new interest in standardizing <a
href="https://www.w3.org/TR/tabular-data-model/">tabular
data on the web</a>,
with a particular focus on <a
href="http://csvconf.com/">CSV</a></li>
<li>A trend towards using computational methods to analyse large
scale cross-linguistic data.
</li>
<li>
The <a href="https://github.com/clld/clld">
clld framework</a>, developed within the
<a href="http://clld.org">CLLD project</a> has shown that many
different cross-linguistic databases can be built on top of
the
same core data model. CLDF is an attempt to externalise this
data
model.
</li>
</ul>
<p>Thus, following up discussions from the first workshop on
<a href="http://www.mpi.nl/events/language-comparison-with-linguistic-databases-reflex-and-typological-databases">
Language Comparison with Linguistic Databases
</a> a
<a href="http://www.eva.mpg.de/linguistics/conferences/2014-ws-lanclid2/index.html">second
workshop</a> in Leipzig
focused on the idea of a very simple CSV based format to exchange
very simple cross-linguistic data.</p>
<p>Simplicity was the main design goal from the start, so the formats
under consideration will evolve starting out
as simple as possible. With
<a href="https://doi.org/10.5281/zenodo.1117644">CLDF 1.0</a>
we provide
a stable baseline for further evolution.</p>
</div>
</div>
</div>
</section>
</main>
<footer class="site-footer">
<!-- expanded_footer -->
<div class="footer">
<div class="container flex-footer">
<div class="f-links f-item">
<h2>CLDF Specification</h2>
<ul class="footer-links">
<li><a href="https://github.com/cldf/cldf/zipball/master">Download
<strong>ZIP File</strong></a></li>
<li><a href="https://github.com/cldf/cldf/tarball/master">Download
<strong>TAR Ball</strong></a></li>
<li><a href="https://github.com/cldf/cldf">View On <strong>GitHub</strong></a>
</li>
</ul>
</div>
<div class="f-about f-item">
<h2>About</h2>
<p>
CLDF is an initiative by the Glottobank consortium with support from the
Max Planck Institute for the Science of Human History and the ERC project
Computer-Assisted Language Comparison.
</p>
<table style="width: 100%">
<tr>
<td style="width: 33%; text-align: left; padding-top: 10px;">
<a href="http://calc.digling.org" style="border: none;">
<img src="logos/European_Research_Council_logo.svg" alt="erc-logo" style="width:100px;"/>
</a>
</td>
<td style="width: 33%; text-align: center;">
<a href="http://glottobank.org" style="border: none;">
<img src="logos/glottobank.png" alt="" style="width:100px;"/>
</a>
</td>
<td style="width: 33%; text-align: right; padding-top: 10px;">
<a href="http://www.shh.mpg.de/" style="border: none;">
<img src="logos/max-planck-logo.svg" alt="mpi-logo" style="width:100px;"/>
</a>
</td>
</tr>
</table>
</div>
<div class="f-contact f-item">
<!-- Contact Us -->
<h2>Contact Info</h2>
<span class="footer-address">Robert Forkel</span><br/>
<span class="footer-address" style="font-family: monospace">[email protected]</span>
</div>
</div>
</div>
</footer>
</body>
</html>