index.html

<!doctype html>
<html>
<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="chrome=1">
    <title>CLDF - Cross-Linguistic Data Formats</title>

    <link rel="stylesheet" href="stylesheets/styles.css">
    <link rel="stylesheet" href="stylesheets/github-light.css">
    <meta name="viewport" content="width=device-width">
</head>

<body class="site">
<header class="site-header">
    <nav>
        <div class="sticky-nav">
            <div class="container-nav">
                <div class="nav-container">
                    <div class="navbar-brand"><a href="https://cldf.clld.org">
                        <img width="150" src="logos/logo_straight.png" alt="CLDF">
                    </a></div>
                    <div class="navbar-nav nav-mobile">
                        <a class="nav-item nav-link btn btn-nav" href="/">Home</a>
                        <a class="nav-item nav-link btn btn-nav" href="https://github.com/cldf/cldf">Specification</a>
                        <a class="nav-item nav-link btn btn-nav" href="v1.0/terms.rdf">Ontology</a>
                        <a class="nav-item nav-link btn btn-nav" href="v1.0/terms.html">(HTML)</a>
                        <a class="nav-item nav-link btn btn-nav" href="/publications.html">Publications</a>
                        <a class="nav-item nav-link btn btn-nav" href="/examples.html">Examples</a>
                    </div>
                    <div class="navbar-nav nav-main">
                        <a class="nav-item nav-link active" href="/">Home</a>
                        <a href="https://github.com/cldf/cldf" class="nav-item nav-link">Specification</a>
                        <a href="v1.0/terms.rdf" class="nav-item nav-link">Ontology</a>
                        <a href="v1.0/terms.html" class="nav-item nav-link">(HTML)</a>
                        <a class="nav-item nav-link" href="/publications.html">Publications</a>
                        <a class="nav-item nav-link" href="/examples.html">Examples</a>
                    </div>
                </div>
            </div>
        </div>
    </nav>
</header>
<main class="site-content">
    <section class="section-alt pad-top">
        <div class="container">
            <h1 class="text-center headline">Cross-Linguistic Data Formats</h1>

        </div>
    </section>
    <section class="section-main">
        <div class="container pad-top">
            <div class="service-flexrow pad-top">
                <div class="column-100">
                    <blockquote class="blockquote-info">
                        <p class="lead">
                            CLDF 1.3 has been released!
                            <a href="https://doi.org/10.5281/zenodo.10579537"><img src="https://zenodo.org/badge/DOI/10.5281/zenodo.10579537.svg" alt="DOI"></a>
                        </p>
                        <p>
                            See the <a href="https://github.com/cldf/cldf/blob/master/CHANGELOG.md">changelog</a> for a list of changes.
                        </p>
                    </blockquote>
                    <p>
                        See also this article describing CLDF:
                    </p>
                    <blockquote class="blockquote-info">
                        Forkel, R. et al. Cross-Linguistic Data Formats, advancing data sharing and reuse in comparative linguistics. Sci. Data. 5:180205 doi: <a href="https://doi.org/10.1038/sdata.2018.205">10.1038/sdata.2018.205</a> (2018).
                    </blockquote>
                </div>
            </div>
            <div class="service-flexrow pad-top">
                <div class="column-33">
                    <div style="padding-top: 3cm;">
                        <img src="logos/logo.png" alt="CLDF logo" class="img-responsive">
                    </div>
                </div>
                <div class="column-66">
                    <div class="gray-box">
                        <h2>Why?</h2>
                        <p>
                            To allow exchange of cross-linguistic data and decouple
                            development of tools and methods from that of
                            databases, standardized data formats are necessary.
                        </p>
                        <p>
                            Once established, these dataformats could become a foundation
                            not
                            only for tools but also for instruction material in the spirit
                            of
                            <a href="http://datacarpentry.org/">Data Carpentry</a> for
                            historical
                            linguistics and linguistic typology.
                        </p>
                    </div>
                    <div class="gray-box">
                        <h2>What?</h2>
                        <p>
                            The main types of cross-linguistic data we are concerned with
                            here
                            are any tabular data which is typically
                            analysed using quantitative (automated) methods or made
                            accessible
                            using software tools like the `clld` framework, such as
                        </p>
                        <ul>
                            <li>wordlists (or more complex lexical data including e.g.
                                cognate
                                judgements),
                            </li>
                            <li>
                                structure datasets (e.g. <a
                                    href="http://wals.info/feature">WALS features</a>),
                            </li>
                            <li>simple dictionaries.</li>
                        </ul>
                    </div>
                </div>
            </div>
            <div class="service-flexrow pad-top">
                <div class="column-100 gray-box">
                    <h2>Design principles</h2>
                    <ul>
                        <li>Data should be both editable "by hand" and amenable to reading
                            and
                            writing by software (preferably software the typical linguist
                            can be
                            expected to use correctly).
                        </li>
                        <li>Data should be encoded as UTF-8 text files.</li>
                        <li>If entities can be referenced, e.g. languages through their
                            Glottocode,
                            this should be done rather than duplicating information like
                            language
                            names.
                        </li>
                        <li>Compatibility with existing tools, standards and practice
                            should
                            always be kept in mind.
                        </li>
                    </ul>
                    <p>
                        Automated re-use requires that the standard specifies not just
                        the structure, but also the semantics of the data stored. Thus,
                        the CLDF specification should be as rigid as possible. Of course, new
                        types of data cannot be immediately compatible with
                        independently developed tools; so the CLDF standard should also provide
                        mechanisms to let data types evolve well understood semantics,
                        while being syntactically compatible from the start.
                    </p>
                </div>
            </div>
            <div class="service-flexrow pad-top">
                <div class="column-100 gray-box">
                    <h2>Technology</h2>
                    <p>
                        Since we are concerned with tabular data here, CLDF is built on
                        W3C's
                        <a href="http://www.w3.org/TR/tabular-data-model/#standard-file-metadata">
                            Model for Tabular Data and Metadata on the Web</a>
                        and
                        <a href="https://www.w3.org/TR/tabular-metadata/">
                            Metadata Vocabulary for Tabular Data</a>.
                        This model - by virtue of being a
                        <a href="https://en.wikipedia.org/wiki/JSON-LD">JSON-LD</a> dialect -
                        is ideally suited to be combined with an
                        <a href="https://en.wikipedia.org/wiki/Ontology_(information_science)">ontology</a>
                        to specify syntax as well as semantics of a data serialization format.
                        Much like <a href="https://software.sil.org/shoebox/mdf/">MDF</a> - SIL's
                        Multi-Dictionary Formatter - adds a hierarchical data model on top of
                        Toolbox' standard format markers to support a data reuse scenario,
                        CLDF structures cross-linguistic data to make automated reuse in
                        typical analyses in historical linguistics possible.
                    </p>
                    <p>
                        One of the main goals of the <b>CLDF</b> specification is a useful
                        delineation of data and tools. Using a CSV
                        based format makes it really easy to use this data in a
                        <a href="https://en.wikipedia.org/wiki/Pipeline_%28Unix%29">
                            UNIX-style pipeline</a> of
                        data transformation commands.
                        This pipeline-style of data transformation and analysis seems to
                        be at the core of typical workflows e.g. in
                        historical linguistics, e.g.
                        <a href="http://lingpy.org/tutorial/workflow.html">LingPy</a> or
                        <a href="https://github.com/cysouw/qlcPipe">QLC</a>.
                    </p>
                    <p>
                        If suitable text- and line-based formats are available, this
                        pipeline-style does also allow for easy extensibility;
                        E.g. a workflow for automatic cognate judgements based on LingPy
                        functionality could be extended with phylogenetic
                        analysis and post-processing via
                        <a href="https://github.com/lmaurits/phyltr">phyltr</a>, which
                        processes sets
                        of phylogenetic trees represented in the newick format, or
                        <a href="http://etetoolkit.org/documentation/tools/">ete</a>.
                    </p>
                    <p>
                        If cross-linguistic comparisons procede in the footsteps of
                        bioinformatics, workflows based UNIX pipelines may at some point
                        be formalized using a <a
                            href="http://common-workflow-language.github.io/">
                        common workflow language</a>.
                    </p>
                </div>
            </div>
            <div class="service-flexrow pad-top">
                <div class="column-100 gray-box">
                    <h2>History</h2>

                    <p>While data formats to exchange linguistic data have been around for
                        some time, e.g. the SFM or Standard Format
                        used by Toolbox, new developments in the area of language
                        diversity research have motivated this push for a new
                        set of formats:</p>
                    <ul>
                        <li>A new interest in standardizing <a
                                href="https://www.w3.org/TR/tabular-data-model/">tabular
                            data on the web</a>,
                            with a particular focus on <a
                                    href="http://csvconf.com/">CSV</a></li>
                        <li>A trend towards using computational methods to analyse large
                            scale cross-linguistic data.
                        </li>
                        <li>
                            The <a href="https://github.com/clld/clld">
                            clld framework</a>, developed within the
                            <a href="http://clld.org">CLLD project</a> has shown that many
                            different cross-linguistic databases can be built on top of
                            the
                            same core data model. CLDF is an attempt to externalise this
                            data
                            model.
                        </li>
                    </ul>
                    <p>Thus, following up discussions from the first workshop on
                        <a href="http://www.mpi.nl/events/language-comparison-with-linguistic-databases-reflex-and-typological-databases">
                            Language Comparison with Linguistic Databases
                        </a> a
                        <a href="http://www.eva.mpg.de/linguistics/conferences/2014-ws-lanclid2/index.html">second
                            workshop</a> in Leipzig
                        focused on the idea of a very simple CSV based format to exchange
                        very simple cross-linguistic data.</p>
                    <p>Simplicity was the main design goal from the start, so the formats
                        under consideration will evolve starting out
                        as simple as possible. With
                        <a href="https://doi.org/10.5281/zenodo.1117644">CLDF 1.0</a>
                        we provide
                        a stable baseline for further evolution.</p>
                </div>
            </div>
        </div>
    </section>
</main>
<footer class="site-footer">
    <!-- expanded_footer -->

    <div class="footer">
        <div class="container flex-footer">
            <div class="f-links f-item">
                <h2>CLDF Specification</h2>
                <ul class="footer-links">
                    <li><a href="https://github.com/cldf/cldf/zipball/master">Download
                        <strong>ZIP File</strong></a></li>
                    <li><a href="https://github.com/cldf/cldf/tarball/master">Download
                        <strong>TAR Ball</strong></a></li>
                    <li><a href="https://github.com/cldf/cldf">View On <strong>GitHub</strong></a>
                    </li>
                </ul>
            </div>
            <div class="f-about f-item">
                <h2>About</h2>
                <p>
                    CLDF is an initiative by the Glottobank consortium with support from the
                    Max Planck Institute for the Science of Human History and the ERC project
                    Computer-Assisted Language Comparison.
                </p>
                <table style="width: 100%">
                    <tr>
                        <td style="width: 33%; text-align: left; padding-top: 10px;">
                            <a href="http://calc.digling.org" style="border: none;">
                                <img src="logos/European_Research_Council_logo.svg" alt="erc-logo" style="width:100px;"/>
                            </a>
                        </td>
                        <td style="width: 33%; text-align: center;">
                            <a href="http://glottobank.org" style="border: none;">
                                <img src="logos/glottobank.png" alt="" style="width:100px;"/>
                            </a>
                        </td>
                        <td style="width: 33%; text-align: right; padding-top: 10px;">
                            <a href="http://www.shh.mpg.de/" style="border: none;">
                                <img src="logos/max-planck-logo.svg" alt="mpi-logo" style="width:100px;"/>
                            </a>
                        </td>
                    </tr>
                </table>
            </div>
            <div class="f-contact f-item">
                <!-- Contact Us -->
                <h2>Contact Info</h2>
                <span class="footer-address">Robert Forkel</span><br/>
                <span class="footer-address" style="font-family: monospace">forkel@shh.mpg.de</span>
            </div>
        </div>
    </div>
</footer>
</body>
</html>