From June through August of 2019 I wrote a Text Encoding Initiative exporter for the Cuneiform Digital Library Initiative, to make data from cuneiform tablets and other inscriptions more accessible, in particular to the Scaife reading environment. The work was funded as a Google Summer of Code Project.
This is a summary of what I accomplished.
The project itself didn't have a public staging server, so I set up a temporary one on my own domain.
- Visit the test server at https://cdli.thaumas.net/scaife/ (If necessary, click the gear and enable the 'CDLI Reader' and 'Suggested Documents' components.)
- Click on one of the suggested documents.
- Click Translation to show/hide the parallel translation.
- Convert a document from ATF and display it in Scaife.
- Publish a repo with a subset of convertible records.
- Set up automated export from CDLI data to a CTS repo.
- Demonstrate a scaife instance running somewhere.
The following repositories are new code I wrote as part of the project:
- https://github.com/cdli-gh/atf2tei (document converter)
- https://github.com/cdli-gh/cdli-cts (tei export target repo)
- https://github.com/cdli-gh/cdli-cts-server (capitains server for cdli-cts)
- https://github.com/cdli-gh/cdli-search (Catalogue search experiment)
I made contributions to two more repositories which are important components of the project:
- https://github.com/cdli-gh/scaife (fork of the reading environment)
- https://github.com/oracc/pyoracc (parser atf2tei is using)
See the cdli branch for the changes I made for the demo running above. Changes to the upstream scaife repo were not accepted, so this repo reflects our project's customized version.
See below for a contributions to the pyoracc parser.
- Capitains/MyCapytain#192 (merged)
- Capitains/flask-capitains-nemo#125 (merged)
- Capitains/HookTest#146 (merged)
- Capitains/HookTest#144 (merged)
- Capitains/HookTest#143 (merged)
- oracc/pyoracc#85 (merged)
- oracc/pyoracc#84 (merged)
- oracc/pyoracc#81 (merged)
- oracc/pyoracc#80 (merged)
- oracc/pyoracc#79 (merged)
- oracc/pyoracc#77 (merged)
- oracc/pyoracc#76 (merged)
- cdli-gh/pyoracc#42 (merged)
- cdli-gh/pyoracc#41 (merged)
- pallets/jinja#1030 (merged)
- pytest-dev/pytest#5416 (merged)
- scaife-viewer/readhomer#30
- scaife-viewer/readhomer#31
- scaife-viewer/readhomer#33
- scaife-viewer/scaife-basic#6
- scaife-viewer/scaife-basic#5
- vim/vim#4619 (merged)
- https://gitlab.com/cdli/framework/merge_requests/11 (merged)
- Capitains/Nautilus#85
- Capitains/HookTest#145
- scaife-viewer/scaife-viewer#370
- scaife-viewer/readhomer#34
- oracc/pyoracc#78
- oracc/pyoracc#82
- oracc/pyoracc#83
- Reported various atf syntax inconsistencies I found to @epp who corrected the master database.
Continuation points for the project, which I didn't have time to pursue. Hopefully these can be developed over time.
ATF line markup isn't fully converted. The export xml files
should use tei markup to represent damage, restorations,
smallcap logograms, and superscript determinative.
This would need to be supported both in atf2tei
and in the
tei parser in Scaife. Greek and Latin layout works well enough
with plain unicode text, but cuneiform transliteration requires
extra typographic features.
There are also a some unhandled annotations, like comments and cross-references which should be supported.
Export data should maintain correctness according to the HookTest suite.
In addition to the ATF-format transcription data, CDLI publishes a csv-format catalog of each record. The conversion tool should read this and represent relevant fields in the teiHeader, so the xml documents are a more complete representation of the texts. At a minimum there should be publication references and urls for the hand-drawn copies and photographs of the source object. This will provide viewer software with everything it needs to present the same data as the main CDLI website.
I wrote a python wrapper for the catalog metadata. This should be packaged so it can be shared between the various applications.
There are other data sources which can also be added, either to the xml or the viewer. There are lemma and part of speech annotation data for many tablets from mtaac, oracc, and other projects.
Scaife right now expects text and translation as separate, long documents. That makes some sense given the history of the scholarship they're supporting. We have short documents where photo, copy, transliteration, normalization and translation are usually thought of line-by-line together. Ideally all of these could be shown/hidden independently, and transition between parallel and interlinear presentations depending on screen size.
I would like to see my proof-of-concept CDLI Reader component developed into a proper modular set of files which could be easily added to any Scaife instance to support cuneiform documents.
I wrote a quick script to upload the catalog metadata to Elasticsearch where is could be searched as a full text. It didn't do better than the general search on the current CDLI website, but with some tuning it should be possible to improve things.
For example, searching for a tablet reference like 'K 162' should find P345482, the primary example of the Akkadian Descent of Ishtar text, without the user having to know to search for 'K 00162' in the accession number field.
If the atf data is also uploaded and indexed appropriately, the service could provide easy programmatic access to the whole corpus from a very small codebase, loaded directly from the published data set.
To get good typography for transliteration lines we need markup for
determinative, logograms and damage. Those should be represented
by tei elements in the document served by CTS and converted to html
elements like <sup>
and <span>
with custom classes for display.
That's not too hard, but to protect against cross-site scripting
the entire xml tree must be checked and cleaned of elements we don't
want, which isn't something we should write ourselves.
Suggestions welcome. Scaife avoids this issue by serving plain unicode text, which works ok for greek and latin, but not for us.
The parser is strict, but what's in the library hasn't been carefully validated, so it rejects many entries.
I did a lot of cleanup work on the library, but it wasn't possible to get it handling the whole CDLI corpus within the term. Over the long term, syntax errors with the ATF in the database should be corrected, ingest should do more validation to reduce new errors, and pyoracc should be extended to support common features of the CDLI corpus.
For future work, I'd also want to revisit my ad-hoc, line-based parser and see if that can get more documents available. ATF is a simple format and a permissive parser might work better for an application like this where we're just trying to present the corpus as it is.
An Arabic interface translation is something we'd like to do for the reader. Scaife-viewer is using django to do this, can try one of the vue.js packages on the readhomer re-write.
Returning the whole CDLI corpus in a single query is too much data. Capitains is also quite slow indexing a large corpus. I opened an issue with scaife-viewer to figure out a shared way to address this. DTS (ld-json) or ATLAS (graphql) are options.