Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow running with incomplete descriptions #58

Merged
merged 35 commits into from
Jan 10, 2022
Merged
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
61c35bd
fix MODS name without roles, ht@kba #51
bertsky Dec 3, 2021
499c3cc
fallback to empty publicationStmt/date and encodingDesc if metsHdr is…
bertsky Dec 3, 2021
8984b1b
get_text_in_line: append HYP content if available
bertsky Dec 3, 2021
7b136c8
log to stderr instead of stdout (to prevent mixing with TEI)
bertsky Dec 3, 2021
6545b16
improve makefile
bertsky Dec 3, 2021
711025a
improve CI
bertsky Dec 3, 2021
605dd89
mets.fromfile: allow missing logical structmap
bertsky Dec 5, 2021
3bfa7c2
mets.fromfile: allow missing mods originInfo
bertsky Dec 5, 2021
559e4c1
mets.fromfile: allow missing mods physicalDescription
bertsky Dec 5, 2021
1a7fe59
mets.fromfile: allow missing mets amdSec provenance dv
bertsky Dec 5, 2021
af1740e
mets.fromfile: simplify physical struct map, allow missing @ORDER
bertsky Dec 5, 2021
18a2dde
mets.fromfile: allow missing struct link
bertsky Dec 5, 2021
dbcc1fe
teil.fill_from_mets: allow empty logical struct map and struct link
bertsky Dec 5, 2021
61c4624
METS to TEI structure: comment urging for more+better mappings
bertsky Dec 5, 2021
15022f5
rename changelog
bertsky Dec 6, 2021
553e0fd
improve+update changelog
bertsky Dec 6, 2021
27dffe8
differentiate image number and page number
bertsky Dec 6, 2021
c39b6c7
allow passing image fileGrp other than DEFAULT
bertsky Dec 6, 2021
71fd269
add params for image fileGrp and output file, more logging
bertsky Dec 6, 2021
5c20f90
update changelog
bertsky Dec 6, 2021
ad261ff
generalize passing URN and VD ID to all identifiers
bertsky Dec 12, 2021
93fb684
improve level, title and idno metadata…
bertsky Dec 13, 2021
9a5f486
fall back to biblFull title level u
bertsky Dec 13, 2021
55353e5
keep going if there is no author and div type
bertsky Dec 14, 2021
0bf8bd3
fix tei:collection
bertsky Dec 20, 2021
7962b8c
fix tei:repository (from list-valued mods:physicalLocation), add tei:…
bertsky Dec 20, 2021
073f2b1
fix 7962b8c5
bertsky Dec 20, 2021
8d2fc41
add tei:notesStmt/tei:note from mods:note
bertsky Dec 20, 2021
06f1ccf
fix tei:editionStmt (does not belong under titleStmt)
bertsky Dec 20, 2021
c49c2a4
add tei:keywords | tei:classCode under tei:textClass (for mods:subjec…
bertsky Dec 20, 2021
27127fe
chdir to METS dir if not URL
bertsky Dec 20, 2021
8ac0747
fix mods:location (only once, but multiple contents)
bertsky Dec 20, 2021
20546af
fix regression in 27127febd
bertsky Dec 20, 2021
f33a4ca
drop Python 3.5
bertsky Jan 6, 2022
8204bfc
Revert regression fix in README.md
wrznr Jan 6, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 40 additions & 5 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
@@ -1,19 +1,54 @@
# Python CircleCI 2.1 configuration file
# for mets-mods2tei
#
# Check https://circleci.com/docs/2.1/language-python/ for more details
# Check https://circleci.com/docs/2.0/language-python/ for more details
#
version: 2.1
orbs:
codecov: codecov/[email protected]
jobs:
build:
test:
parameters:
version:
type: string
docker:
- image: python:3.6
- image: circleci/python:<< parameters.version >>
working_directory: ~/repo
steps:
- checkout
- run: pip install -r requirements-test.txt
- run: pip install .
- run: make deps deps-test
- run: make install
- run: make test
- run: make coverage
- codecov/upload
pypi:
docker:
- image: circleci/python:3.6
working_directory: ~/repo
steps:
- checkout
- setup_remote_docker
- run: make install
- run: python setup.py sdist
- run: |
pip install cibuildwheel
cibuildwheel --output-dir dist
- store_artifacts:
path: dist/
destination: artifacts
# later: upload to PyPI...

workflows:
version: 2
test-all:
jobs:
- test:
matrix:
parameters:
version: [3.5.10, 3.6.15, 3.7.12, 3.8.12, 3.9.9]
deploy:
jobs:
- pypi:
filters:
branches:
only: master
18 changes: 16 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,26 +1,40 @@
# Python interpreter. Default: '$(PYTHON)'
PYTHON = python
PYTHON ?= python
PIP ?= pip

# BEGIN-EVAL makefile-parser --make-help Makefile

help:
@echo ""
@echo " Targets"
@echo ""
@echo " install Install this package"
@echo " deps Install dependencies only"
@echo " deps-test Install dependencies for testing only"
@echo " test Run all unit tests"
@echo " coverage Run coverage tests"
@echo ""
@echo " Variables"
@echo ""
@echo " PYTHON Python interpreter. Default: '$(PYTHON)'"
@echo " PIP Python packager. Default: '$(PIP)'"

# END-EVAL

#
# Tests
#

.PHONY: test coverage
.PHONY: install test coverage deps deps-test

install:
$(PIP) install .

deps:
$(PIP) install -r requirements.txt

deps-test:
$(PIP) install -r requirements-test.txt

# Run all unit tests
test:
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,5 +118,5 @@ including the extracted information from the MODS part of the METS.

Example:

mm2tei "https://digital.slub-dresden.de/oai/?verb=GetRecord&metadataPrefix=mets&identifier=oai:de:slub-dresden:db:id-453779263"
mm2tei "https://digital.slub-dresden.de/oai/?verb=GetRecord&metadataPrefix=mets&identifier=oai:de:slub-dresden:db:id-453779263" > tei.xml

6 changes: 5 additions & 1 deletion mets_mods2tei/api/alto.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,11 @@ def get_text_in_line(self, line):
Returns the ALTO-encoded text .
:param Element line: The line to extract the text from.
"""
return " ".join(element.get("CONTENT") for element in line.xpath("./alto:String", namespaces=ns))
text = " ".join(element.get("CONTENT") for element in line.xpath("./alto:String", namespaces=ns))
hyp = line.find("alto:HYP", namespaces=ns)
if hyp is not None:
text += hyp.get("CONTENT")
return text

def __compute_fuzzy_distance(self, text1, text2):
"""
Expand Down
106 changes: 72 additions & 34 deletions mets_mods2tei/api/mets.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ def __init__(self):
self.tree = None
self.mets = None
self.mods = None
self.page_map = {}
self.order_map = {}
self.img_map = {}
self.alto_map = {}
Expand Down Expand Up @@ -123,10 +124,9 @@ def __spur(self):

#
# main title and manuscript type
struct_map_logical = list(filter(lambda x: x.get_TYPE() == "LOGICAL", self.mets.get_structMap()))[0]
title = struct_map_logical.get_div()
self.title = title.get_LABEL()
self.type = title.get_TYPE()
div = self.get_div_structure()
self.title = div.get_LABEL() if div else ""
self.type = div.get_TYPE() if div else ""

#
# sub titles
Expand All @@ -145,7 +145,7 @@ def __spur(self):
person[name_part.get_type()] = name_part.get_valueOf_()

# either author or editor
roles = name.get_role()[0].get_roleTerm()
roles = name.get_role()[0].get_roleTerm() if name.get_role() else []
# TODO: handle the complete set of allowed roles
for role in roles:
if role.get_valueOf_() == "edt":
Expand All @@ -155,29 +155,34 @@ def __spur(self):

#
# orgin info
origin_info = self.mods.get_originInfo()[0]
origin_info = self.mods.get_originInfo()

# publication place
self.places = []
for place in origin_info.get_place():
place_ext = {}
for place_term in place.get_placeTerm():
place_ext[place_term.get_type()] = place_term.get_valueOf_()
self.places.append(place_ext)
if origin_info:
for place in origin_info[0].get_place():
place_ext = {}
for place_term in place.get_placeTerm():
place_ext[place_term.get_type()] = place_term.get_valueOf_()
self.places.append(place_ext)

# publication dates
self.dates = {}
for date_issued in origin_info.get_dateIssued():
date_type = date_issued.get_point() if date_issued.get_point() != None else "unspecified"
self.dates[date_type] = date_issued.get_valueOf_()
if origin_info:
for date_issued in origin_info[0].get_dateIssued():
date_type = date_issued.get_point() if date_issued.get_point() != None else "unspecified"
self.dates[date_type] = date_issued.get_valueOf_()

# publishers
self.publishers = []
for publisher in origin_info.get_publisher():
self.publishers.append(publisher.get_valueOf_())
if origin_info:
for publisher in origin_info[0].get_publisher():
self.publishers.append(publisher.get_valueOf_())

# edition of the manuscript
self.edition = origin_info.get_edition()[0].get_valueOf_() if origin_info.get_edition() else ""
self.edition = ""
if origin_info and origin_info[0].get_edition():
self.edition = origin_info[0].get_edition()[0].get_valueOf_()

#
# languages and scripts
Expand All @@ -201,28 +206,35 @@ def __spur(self):

#
# physical description
physical_description = self.mods.get_physicalDescription()[0]
physical_description = self.mods.get_physicalDescription()

# digital origin
self.digital_origin = physical_description.get_digitalOrigin()[0] if physical_description.get_digitalOrigin() else ""
self.digital_origin = ""
if physical_description and physical_description[0].get_digitalOrigin():
self.digital_origin = physical_description[0].get_digitalOrigin()[0]

# extent
self.extents = []
for extent in physical_description.get_extent():
self.extents.append(extent.get_valueOf_())
if physical_description:
for extent in physical_description[0].get_extent():
self.extents.append(extent.get_valueOf_())

#
# dv FIXME: replace with generated code as soon as schema is available
dv = etree.fromstring(self.mets.get_amdSec()[0].get_rightsMD()[0].get_mdWrap().get_xmlData().get_anytypeobjs_()[0])
amdsec = self.mets.get_amdSec()
if amdsec and amdsec[0].get_rightsMD():
dv = etree.fromstring(amdsec[0].get_rightsMD()[0].get_mdWrap().get_xmlData().get_anytypeobjs_()[0])
else:
dv = []

# owner of the digital edition
self.owner_digital = dv.xpath("//dv:owner", namespaces=ns)[0].text
self.owner_digital = dv.xpath("//dv:owner", namespaces=ns)[0].text if dv else ""

# availability/license
# common case
self.license = ""
self.license_url = ""
license_nodes = dv.xpath("//dv:license", namespaces=ns)
license_nodes = dv.xpath("//dv:license", namespaces=ns) if dv else []
if license_nodes != []:
self.license = license_nodes[0].text
self.license_url = ""
Expand All @@ -237,12 +249,26 @@ def __spur(self):
#
# metsHdr
header = self.mets.get_metsHdr()

# encoding date
self.encoding_date = header.get_CREATEDATE().isoformat()

# encoding description
self.encoding_desc = list(filter(lambda x: x.get_OTHERTYPE() == "SOFTWARE", header.get_agent()))[0].get_name()
if header:
# encoding date
self.encoding_date = header.get_CREATEDATE()
# encoding description
self.encoding_desc = [agent.get_name()
for agent in header.get_agent()
if agent.get_TYPE() == "OTHER" and agent.get_OTHERTYPE() == "SOFTWARE"]
else:
self.encoding_date = None
self.encoding_desc = None

if self.encoding_date:
self.encoding_date = self.encoding_date.isoformat()
else:
self.logger.error("Found no @CREATEDATE for publicationStmt/date")
if self.encoding_desc:
self.encoding_desc = self.encoding_desc[0] # or -1?
# what about agent.get_OTHERROLE() and agent.get_note()?
else:
self.logger.error("Found no mets:agent for encodingDesc")

#
# location of manuscript
Expand Down Expand Up @@ -294,16 +320,19 @@ def __spur(self):
default_map[entry.get("ID")] = entry.find("./" + METS + "FLocat").get("%shref" % XLINK)

# struct map physical
for div in list(filter(lambda x: x.get_TYPE() == 'PHYSICAL', self.mets.get_structMap()))[0].get_div().get_div():
self.order_map[div.get_ID()] = div.get_ORDER()
for div in self.get_page_structure().get_div():
self.page_map[div.get_ID()] = div
if div.get_ORDER():
self.order_map[div.get_ID()] = div.get_ORDER()
for fptr in div.get_fptr():
if fptr.get_FILEID() in fulltext_map:
self.alto_map[div.get_ID()] = fulltext_map[fptr.get_FILEID()]
elif fptr.get_FILEID() in default_map:
self.img_map[div.get_ID()] = default_map[fptr.get_FILEID()]

# struct links
for sm_link in self.tree.xpath("//mets:structLink", namespaces=ns)[0].iterchildren():
structlinks = self.tree.xpath("//mets:structLink/*", namespaces=ns)
for sm_link in structlinks:
if sm_link.get("%sto" % XLINK) in self.alto_map:
if sm_link.get("%sfrom" % XLINK) not in self.struct_links:
self.struct_links[sm_link.get("%sfrom" % XLINK)] = []
Expand Down Expand Up @@ -447,14 +476,23 @@ def get_languages(self):
"""
return self.languages

def get_page_structure(self):
"""
Return the div structure from the physical struct map
"""
for struct_map in self.mets.get_structMap():
if struct_map.get_TYPE() == "PHYSICAL":
return struct_map.get_div()
return None

def get_div_structure(self):
"""
Return the div structure from the logical struct map
"""
for struct_map in self.mets.get_structMap():
if struct_map.get_TYPE() == "LOGICAL":
return struct_map.get_div()
return []
return None

def get_struct_links(self, log_id):
"""
Expand Down
Loading