Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow running with incomplete descriptions #58

Merged
merged 35 commits into from
Jan 10, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
61c35bd
fix MODS name without roles, ht@kba #51
bertsky Dec 3, 2021
499c3cc
fallback to empty publicationStmt/date and encodingDesc if metsHdr is…
bertsky Dec 3, 2021
8984b1b
get_text_in_line: append HYP content if available
bertsky Dec 3, 2021
7b136c8
log to stderr instead of stdout (to prevent mixing with TEI)
bertsky Dec 3, 2021
6545b16
improve makefile
bertsky Dec 3, 2021
711025a
improve CI
bertsky Dec 3, 2021
605dd89
mets.fromfile: allow missing logical structmap
bertsky Dec 5, 2021
3bfa7c2
mets.fromfile: allow missing mods originInfo
bertsky Dec 5, 2021
559e4c1
mets.fromfile: allow missing mods physicalDescription
bertsky Dec 5, 2021
1a7fe59
mets.fromfile: allow missing mets amdSec provenance dv
bertsky Dec 5, 2021
af1740e
mets.fromfile: simplify physical struct map, allow missing @ORDER
bertsky Dec 5, 2021
18a2dde
mets.fromfile: allow missing struct link
bertsky Dec 5, 2021
dbcc1fe
teil.fill_from_mets: allow empty logical struct map and struct link
bertsky Dec 5, 2021
61c4624
METS to TEI structure: comment urging for more+better mappings
bertsky Dec 5, 2021
15022f5
rename changelog
bertsky Dec 6, 2021
553e0fd
improve+update changelog
bertsky Dec 6, 2021
27dffe8
differentiate image number and page number
bertsky Dec 6, 2021
c39b6c7
allow passing image fileGrp other than DEFAULT
bertsky Dec 6, 2021
71fd269
add params for image fileGrp and output file, more logging
bertsky Dec 6, 2021
5c20f90
update changelog
bertsky Dec 6, 2021
ad261ff
generalize passing URN and VD ID to all identifiers
bertsky Dec 12, 2021
93fb684
improve level, title and idno metadata…
bertsky Dec 13, 2021
9a5f486
fall back to biblFull title level u
bertsky Dec 13, 2021
55353e5
keep going if there is no author and div type
bertsky Dec 14, 2021
0bf8bd3
fix tei:collection
bertsky Dec 20, 2021
7962b8c
fix tei:repository (from list-valued mods:physicalLocation), add tei:…
bertsky Dec 20, 2021
073f2b1
fix 7962b8c5
bertsky Dec 20, 2021
8d2fc41
add tei:notesStmt/tei:note from mods:note
bertsky Dec 20, 2021
06f1ccf
fix tei:editionStmt (does not belong under titleStmt)
bertsky Dec 20, 2021
c49c2a4
add tei:keywords | tei:classCode under tei:textClass (for mods:subjec…
bertsky Dec 20, 2021
27127fe
chdir to METS dir if not URL
bertsky Dec 20, 2021
8ac0747
fix mods:location (only once, but multiple contents)
bertsky Dec 20, 2021
20546af
fix regression in 27127febd
bertsky Dec 20, 2021
f33a4ca
drop Python 3.5
bertsky Jan 6, 2022
8204bfc
Revert regression fix in README.md
wrznr Jan 6, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Add ALTO `HYP` text content if available, #52
- Allow empty logical structMap and structLink, fallback to physical, or empty, #57
- Allow partial dmdSec (MODS) or amdSec, fallback to empty, #46, #51
- Pass all `mods:identifier`s to `msIdentifier/idno` (not just VD and URN)
- Parse full `titleInfo` (main/sub/part/volume), and re-use in `biblFull`
- Prefer `titleInfo/title` over `div/@LABEL` if available
- Map top logical `div/@TYPE` into allowed `biblFull/title/@level` only
- Map top logical `div/@TYPE` into appropriate `bibl/@type` if possible

## [0.1.1] - 2020-05-11
### Added
Expand Down
85 changes: 78 additions & 7 deletions mets_mods2tei/api/mets.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,8 @@ def __init__(self):

self.title = None
self.sub_titles = None
self.part_titles = None
self.volume_titles = None
self.authors = None
self.editors = None
self.places = None
Expand Down Expand Up @@ -124,17 +126,74 @@ def __spur(self):
"""

#
# main title and manuscript type
# get publication level
# get main and sub title from top-level logical div as a fallback
self.title = ""
self.biblevel = None
self.bibtype = None
div = self.get_div_structure()
self.title = div.get_LABEL() if div else ""
self.type = div.get_TYPE() if div else ""
if div:
self.title = div.get_LABEL() # overridden by any titleInfo
div_type = div.get_TYPE()
# differentiate between analytic and closed, periodic and singular, dependent and indepenent types
# (for use in bibl/@type and biblFull//title/@level):
# FIXME: verify this ruleset is correct/standardized (but criteria do not look orthogonal, e.g. "issue" and "proceeding")
if div_type in ["bachelor_thesis", "diploma_thesis", "magister_thesis", "master_thesis", "doctoral_thesis", "habilitation_thesis", "file", "register", "research_paper", "report", "atlas", "album", "letter", "document", "leaflet", "manuscript", "poster", "plan", "study", "judgement", "preprint", "dossier", "paper"]:
self.biblevel = 'u' # unpublished
self.bibtype = 'M' # monograph
elif div_type in ["contained_work", "folder", ]:
self.biblevel = 'a'
self.bibtype = 'DM' # dependent part of monograph
# ? or 'DS' # dependent part of series
elif div_type in ["article"]:
self.biblevel = 'a' # analytic
self.bibtype = 'JA' # journal article
elif div_type in ["periodical", "newspaper"]:
self.biblevel = 'j' # journal
self.bibtype = 'J' # journal
elif div_type in ["lecture"]:
self.biblevel = 's' # series
self.bibtype = '' # ?
elif div_type in ["monograph", ]:
self.biblevel = 'm' # monograph
self.bibtype = 'M' # monograph
elif div_type in ["multivolume_work", "volume"]:
self.biblevel = 'm' # monograph
self.bibtype = 'MM' # monograph within multi-volume monograph
# ? or 'MS' # monograph within series
# ? or 'MMS' # monograph within multi-volume monograph series

#
# sub titles
self.sub_titles = []
for title_info in self.mods.get_titleInfo():
# titleInfo (main, sub, part/volume)
self.sub_titles = [] # subtitle (mods:titleInfo[mods:subTitle]
self.part_titles = dict() # part title of multipart subseries (mods:titleInfo[mods:partNumber|mods:partName])
self.volume_titles = dict() # volume title in multivolume monograph (mods:part[mods:detail])
title_infos = self.mods.get_titleInfo()
if len(title_infos):
def norm_title_first(titleInfo):
if not titleInfo.get_type() or titleInfo.get_type() == 'simple':
# prefer untyped entry ('simple' most likely is from generateDS)
return -1
if titleInfo.get_type() == 'uniform':
return 0
return 1
title_info = sorted(title_infos, key=norm_title_first)[0]
if title_info.get_title():
self.title = title_info.get_title()[0].get_valueOf_().strip()
for sub_title in title_info.get_subTitle():
self.sub_titles.append(sub_title.get_valueOf_().strip())
for part_number, part_name in zip(title_info.get_partNumber(), title_info.get_partName()):
self.part_titles[part_number.get_valueOf_().strip()] = part_name.get_valueOf_().strip()
part_infos = self.mods.get_part()
if len(part_infos):
part_info = part_infos[0]
order = str(part_info.get_order() or 0)
for detail in part_info.get_detail():
typ = detail.get_type()
val = ', '.join([title.get_valueOf_().strip()
for title in detail.get_number() + detail.get_caption() + detail.get_title()])
self.volume_titles[order, typ] = val

#
# authors and editors
self.authors = []
Expand Down Expand Up @@ -366,10 +425,22 @@ def get_main_title(self):

def get_sub_titles(self):
"""
Return the main title of the work.
Return the sub-titles of the work.
"""
return self.sub_titles

def get_part_titles(self):
"""
Return the part titles of the work.
"""
return self.part_titles

def get_volume_titles(self):
"""
Return the volume titles of the work.
"""
return self.volume_titles

def get_authors(self):
"""
Return the author of the work.
Expand Down
120 changes: 85 additions & 35 deletions mets_mods2tei/api/tei.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ def tostring(self):
"""
Serializes the TEI object as xml string.
"""
# needs lxml>=4.5: etree.indent(self.tree, space=" ")
return etree.tostring(self.tree, encoding="utf-8")

def fill_from_mets(self, mets, ocr=True):
Expand All @@ -50,13 +51,16 @@ def fill_from_mets(self, mets, ocr=True):

# main title
self.set_main_title(mets.get_main_title())
for sub in mets.get_sub_titles():
self.add_sub_title(sub)
for number, part in mets.get_part_titles().items():
self.add_part_title(number, part)
for (order, typ), volume in mets.get_volume_titles().items():
self.add_volume_title(order, typ, volume)
self.init_biblFull()

# publication level
self.set_publication_level(mets.type)

# sub titles
for sub_title in mets.get_sub_titles():
self.add_sub_title(sub_title)
self.set_publication_level(mets.biblevel)

# authors
for typ, author in mets.get_authors():
Expand Down Expand Up @@ -100,12 +104,14 @@ def fill_from_mets(self, mets, ocr=True):

# shelf locator
for shelf_locator in mets.get_shelf_locators():
self.add_ms_identifier("shelfmark", shelf_locator)
self.add_identifier("shelfmark", shelf_locator)

# identifiers
if mets.get_identifiers():
for type_, value in mets.get_identifiers().items():
self.add_ms_identifier(type_.upper(), value)
if type_ in ["vd16", "vd17", "vd18"]:
type_ = "VD"
self.add_identifier(type_.upper(), value)

# type description
if mets.get_scripts():
Expand All @@ -125,7 +131,7 @@ def fill_from_mets(self, mets, ocr=True):

#
# citation
self.compile_bibl()
self.compile_bibl(mets.bibtype)

#
# text part
Expand Down Expand Up @@ -156,13 +162,6 @@ def main_title(self):
"""
return self.tree.xpath('//tei:titleStmt/tei:title[@type="main"]', namespaces=ns)[0].text

@property
def publication_level(self):
"""
Return the level of publication ('monographic' vs. 'analytic')
"""
return self.tree.xpath('//tei:sourceDesc/tei:biblFull/tei:titleStmt/tei:title[@type="main"]', namespaces=ns)[0].get("level")

@property
def subtitles(self):
"""
Expand All @@ -182,6 +181,13 @@ def authors(self):
authors.append(", ".join(author.xpath('descendant-or-self::*/text()')))
return authors

@property
def publication_level(self):
"""
Return the level of publication ('monographic' vs. 'analytic')
"""
return self.tree.xpath('//tei:sourceDesc/tei:biblFull/tei:titleStmt/tei:title[@type="main"]', namespaces=ns)[0].get("level")

@property
def dates(self):
"""
Expand Down Expand Up @@ -330,26 +336,70 @@ def bibl(self):

def set_main_title(self, string):
"""
Set the main title of the title statements.
Set the main title of the tei:titleStmt.
"""
for main_title in self.tree.xpath('//tei:titleStmt/tei:title[@type="main"]', namespaces=ns):
main_title.text = string
titleStmt = self.tree.xpath('//tei:titleStmt', namespaces=ns)[0]
for node in titleStmt.xpath('tei:title[@type="main"]', namespaces=ns):
node.text = string

def set_publication_level(self, level):
def add_sub_title(self, string):
"""
Set the level of publication ('monographic' vs. 'analytic')
Add a sub-title of the tei:titleStmt.
"""
self.tree.xpath('//tei:sourceDesc/tei:biblFull/tei:titleStmt/tei:title[@type="main"]', namespaces=ns)[0].set("level", level)
titleStmt = self.tree.xpath('//tei:titleStmt', namespaces=ns)[0]
node = etree.Element("%stitle" % TEI)
node.set("type", "sub")
node.text = string
titleStmt.append(copy.deepcopy(node))

def add_sub_title(self, string):
def add_part_title(self, number, string):
"""
Add a sub title to the title statements.
Add a part title of the tei:titleStmt.
"""
sub_title = etree.Element("%stitle" % TEI)
sub_title.set("type", "sub")
sub_title.text = string
for title_stmt in self.tree.xpath('//tei:titleStmt', namespaces=ns):
title_stmt.append(copy.deepcopy(sub_title))
titleStmt = self.tree.xpath('//tei:titleStmt', namespaces=ns)[0]
node = etree.Element("%stitle" % TEI)
node.set("type", "part")
node.set("n", number)
node.text = string
titleStmt.append(copy.deepcopy(node))

def add_volume_title(self, number, typ, string):
"""
Add a volume title of the tei:titleStmt.
"""
titleStmt = self.tree.xpath('//tei:titleStmt', namespaces=ns)[0]
node = etree.Element("%stitle" % TEI)
node.set("type", typ)
node.set("n", number)
node.text = string
titleStmt.append(copy.deepcopy(node))

def init_biblFull(self):
"""
Set the main, sub, and part/volume titles of the tei:biblFull by copying from tei:titleStmt.
"""
titleStmt = self.tree.xpath('//tei:titleStmt', namespaces=ns)[0]
bibl = self.tree.xpath('//tei:sourceDesc/tei:biblFull', namespaces=ns)[0]
bibl.append(copy.deepcopy(titleStmt))

def set_publication_level(self, level):
"""
Set the level of publication:
- 'm': (monographic) the title applies to a monograph such as a book
or other item considered to be a distinct publication,
including single volumes of multi-volume works
- 'a': (analytic) the title applies to an analytic item, such as an article,
poem, or other work published as part of a larger item.
- 'j': (journal) the title applies to any serial or periodical publication
such as a journal, magazine, or newspaper
- 's': (series) the title applies to a series of otherwise distinct publications
such as a collection
- 'u': (unpublished) the title applies to any unpublished material
(including theses and dissertations unless published by a commercial press)
"""
assert level in ['m', 'a', 'j', 's', 'u']
for title in self.tree.xpath('//tei:sourceDesc/tei:biblFull/tei:titleStmt/tei:title', namespaces=ns):
title.set("level", level)

def add_author(self, person, typ):
"""
Expand Down Expand Up @@ -492,13 +542,13 @@ def add_repository(self, repository):
repository_node = etree.SubElement(ms_ident, "%srepository" % TEI)
repository_node.text = repository

def add_ms_identifier(self, type_, value):
def add_identifier(self, type_, value):
"""
Add the URN, PURL, VD ID, shelfmark etc. of the digital edition
"""
ms_ident_idno = self.tree.xpath('//tei:msDesc/tei:msIdentifier/tei:idno', namespaces=ns)[0]
ms_ident = self.tree.xpath('//tei:msDesc/tei:msIdentifier/tei:idno', namespaces=ns)[0]
# FIXME: URN, DTAID, ... should go to /tei:fileDesc/tei:publicationStmt/tei:idno instead
idno = etree.SubElement(ms_ident_idno, "%sidno" % TEI)
idno = etree.SubElement(ms_ident, "%sidno" % TEI)
idno.set("type", type_)
idno.text = value

Expand Down Expand Up @@ -542,16 +592,16 @@ def add_collection(self, collection):
creation = etree.SubElement(profile_desc, "%screation" % TEI)
creation.text = collection

def compile_bibl(self):
def compile_bibl(self, type_):
"""
Compile the content of the short citation element 'bibl' based on the current state
"""
if self.publication_level:
self.bibl.set("type", self.publication_level)
if type_:
self.bibl.set("type", type_)
bibl_text = ""
if self.authors:
bibl_text += "; ".join(self.authors) + ": "
elif self.publication_level == "monograph":
elif type_.startswith("M"):
bibl_text = "[N. N.], "
bibl_text += self.main_title + "."
if self.places:
Expand Down
5 changes: 2 additions & 3 deletions mets_mods2tei/data/tei_skeleton.xml
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,12 @@
<title type="main">[Haupttitel]</title>
</titleStmt>
<publicationStmt>
<idno>
</idno>
</publicationStmt>
<sourceDesc>
<bibl type="M">[Zitiertitel]</bibl>
<biblFull>
<titleStmt>
<title level="m" type="main">[Haupttitel einer Monographie]</title>
</titleStmt>
<publicationStmt>
</publicationStmt>
</biblFull>
Expand Down
6 changes: 4 additions & 2 deletions tests/test_mets.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,9 @@ def test_data_assignment(subtests, datadir):
assert(mets.get_shelf_locators() == ['Hist.Amer.1497'])

with subtests.test("Check URN"):
assert(mets.get_urn() == 'urn:nbn:de:bsz:14-db-id4971666239')
assert "urn" in mets.get_identifiers()
assert mets.get_identifiers()["urn"] == 'urn:nbn:de:bsz:14-db-id4971666239'

with subtests.test("Check VD ID"):
assert(mets.get_vd_id() == 'VD18 11413883')
assert "vd18" in mets.get_identifiers()
assert mets.get_identifiers()["vd18"] == 'VD18 11413883'
Loading