From bb3fdb42900f936bb9b39479590d21846fe8972c Mon Sep 17 00:00:00 2001 From: Jeremy Prevost Date: Fri, 23 Feb 2024 09:50:23 -0500 Subject: [PATCH 1/6] Adds ADR for remapping PublicationInformation also adds config to allow adr cli to create an adr in the same directory other adrs are found in this repository https://mitlibraries.atlassian.net/browse/GDT-199 --- .adr-dir | 1 + ...-support-aggregations-on-publisher-name.md | 96 +++++++++++++++++++ 2 files changed, 97 insertions(+) create mode 100644 .adr-dir create mode 100644 docs/adrs/0003-support-aggregations-on-publisher-name.md diff --git a/.adr-dir b/.adr-dir new file mode 100644 index 0000000..eb8b01e --- /dev/null +++ b/.adr-dir @@ -0,0 +1 @@ +docs/adrs diff --git a/docs/adrs/0003-support-aggregations-on-publisher-name.md b/docs/adrs/0003-support-aggregations-on-publisher-name.md new file mode 100644 index 0000000..fcdebc0 --- /dev/null +++ b/docs/adrs/0003-support-aggregations-on-publisher-name.md @@ -0,0 +1,96 @@ +# 3. Support Aggregations on Publisher Name + +Date: 2024-02-23 + +## Status + +Accepted + +## Context + +Our current data model maps data about Publisher Name, Publication Year, and Publication Location into a single multivalued field (an Array of Strings). Mapping these different concepts into a single field makes the data less meaningful and less useful than it could be with a change to our data model. + +Mapping differnt types of data into a single field makes aggregation confusing as we'd see things like "1999" along side "Massachusetts Institute of Technology" which is not usable. + +### Current mappings + +| Source | Mappings | TIMDEX Field | Examples | +| --- | --- | --- | --- | +| Alma | [MARC 260$abcdef](https://www.loc.gov/marc/bibliographic/bd260.html) | PublicationInformation | Charlottesville : University of Virginia Press, [2015], ©2015 | +| | [MARC264$abc](https://www.loc.gov/marc/bibliographic/bd264.html) | PublicationInformation | DeKalb, Illinois : NIU Press, 2014, ©2014 | +| | | PublicationInformation | null | +| | | PublicationInformation | A.O.K. 3 | +| | | PublicationInformation | American Antiquarian Society Historical Periodicals | +| DSpace@MIT (METS) | XML slect array of Strings matching element `publisher` | PublicationInformation | Massachusetts Institute of Technology | +| | | PublicationInformation | Elsevier BV | +| | | PublicationInformation | null | +| | | PublicationInformation | ACM|Creativity and Cognition | +| | | PublicationInformation | Cambridge, Mass. : Alfred P. Sloan School of Management, Massachusetts Institute of Technology | +| | | PublicationInformation | Elsevier | +| | | PublicationInformation | Wiley | +| ArchivesSpace | XML select array of Strings matching element `dc:publisher` | PublicationInformation | Massachusetts Institute of Technology. Libraries. Department of Distinctive Collections (note: this appears to be effectively a static string for all of our ASpace records) | +| MIT GIS | | PublicationInformation | | +| OGM GIS | | PublicationInformation | | + +### Proposed mappings (Option 1) + +Create a new object `PublicationInfo` and map data into it. Most sources just use `Publication.name` but Alma uses additional fields. + +```json +{ + publicationInfo: [{ + name: String + location: String + date: String (ideally dates would be dates... but that may be out of scope for now) + }] +} +``` + +Notes: we should consider whether this really needs to be an array. 260 and 264 are repeatable fields in MARC, but whether they are regularly used in that way or if we could instead pick the "first" one that shows up and get a simpler model without losing any real world functionality would be worth investigating. For now it is modeled as an array of `PublicationInfo` assuming that is appropriate. + +GraphQL would collapse the new object and continue to serve it as the deprecated field `PublicationInformation` until we are confident it is no longer being used. + +| Source | Mappings | TIMDEX Field | Notes | +| --- | --- | --- | --- | +| Alma | [MARC 260$a](https://www.loc.gov/marc/bibliographic/bd260.html) | PublicationInfo.Location | | +| Alma | [MARC 264$a](https://www.loc.gov/marc/bibliographic/bd264.html) | PublicationInfo.Location | | +| Alma | [MARC 260$b](https://www.loc.gov/marc/bibliographic/bd260.html) | PublicationInfo.Name | | +| Alma | [MARC 264$b](https://www.loc.gov/marc/bibliographic/bd264.html) | PublicationInfo.Name | | +| Alma | [MARC 260$c](https://www.loc.gov/marc/bibliographic/bd260.html) | PublicationInfo.Date | | +| Alma | [MARC 264$c](https://www.loc.gov/marc/bibliographic/bd264.html) | PublicationInfo.Date | | +| Alma | [MARC 260$d](https://www.loc.gov/marc/bibliographic/bd260.html) | Invalid field. Don't map | | +| Alma | [MARC 260$e](https://www.loc.gov/marc/bibliographic/bd260.html) | Don't map | | +| Alma | [MARC 260$c](https://www.loc.gov/marc/bibliographic/bd260.html) | Don't map | | +| DSpace | Keep logic | Publication.Name | note: consider future work to normalize in source or during transform | +| ASpace | Keep logic | PublicationInfo.Name | | +| GIS MIT | | PublicationInfo.Name | | +| GIS OGM | | PublicationInfo.Name | | + +### Proposed mappings (Option 2) + +Move some of the Alma data to other existing objects, and create a new top level `Publisher` array of string. + +All sources except Alma remap to use `PublicationInfo` instead of `PublicationInformation` with no other changes at this time. + +This decouples our data model more from MARC where we seem to have modeled a set of data to match 260/264 in a way that doesn't seem to map to our other sources. Additionally, we already have other places for the extra info from 260/264 to map that may be more useful than as a new Object or embedded into the original String. + +GraphQL would continue to serve `PublicationInformation`, but replace it with the data in the new field `Publisher`. The deprecation notice would explain that it is not a 1:1 mapping of the old field and what aspects have been moved to additional fields. + + +| Source | Mappings | TIMDEX Field | Notes | +| --- | --- | --- | --- | +| Alma | [MARC 260$a](https://www.loc.gov/marc/bibliographic/bd260.html) | Location.kind = Publisher Location.value=260$a | | +| Alma | [MARC 264$a](https://www.loc.gov/marc/bibliographic/bd264.html) | Location.kind = Publisher Location.value=264$a | | +| Alma | [MARC 260$b](https://www.loc.gov/marc/bibliographic/bd260.html) | Publisher | | +| Alma | [MARC 264$b](https://www.loc.gov/marc/bibliographic/bd264.html) | Publisher | | +| Alma | [MARC 260$c](https://www.loc.gov/marc/bibliographic/bd260.html) | Dates.kind=PublicationDate | (this may already happen?) | +| Alma | [MARC 264$c](https://www.loc.gov/marc/bibliographic/bd264.html) | Dates.kind=PublicationDate | (this may already happen?) | + + +## Decision + +Coming soon + +## Consequences + +Either option will provide more normalized/consistent data. By ensuring we have a field that represents just the publisher name -- either via `PublicationInfo.name` or `Publisher` -- we will be able to add an additional mapping of `keyword` and allow for aggregation in OpenSearch/GraphQL/consuming applications such as TIMDEX UI. From aeb4afc6653db978e8fe66c00255a0e698535b99 Mon Sep 17 00:00:00 2001 From: Jeremy Prevost Date: Tue, 27 Feb 2024 14:12:03 -0500 Subject: [PATCH 2/6] Fixups --- ...003-support-aggregations-on-publisher-name.md | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/docs/adrs/0003-support-aggregations-on-publisher-name.md b/docs/adrs/0003-support-aggregations-on-publisher-name.md index fcdebc0..633659b 100644 --- a/docs/adrs/0003-support-aggregations-on-publisher-name.md +++ b/docs/adrs/0003-support-aggregations-on-publisher-name.md @@ -38,11 +38,13 @@ Create a new object `PublicationInfo` and map data into it. Most sources just us ```json { - publicationInfo: [{ - name: String - location: String - date: String (ideally dates would be dates... but that may be out of scope for now) - }] + "publicationInfo": [ + { + "name": "String", + "location": "String", + "date": "String (ideally dates would be dates... but that may be out of scope for now)" + } + ] } ``` @@ -68,9 +70,9 @@ GraphQL would collapse the new object and continue to serve it as the deprecated ### Proposed mappings (Option 2) -Move some of the Alma data to other existing objects, and create a new top level `Publisher` array of string. +Move some of the Alma data to other existing objects, and create a new top level `Publisher` array of strings. -All sources except Alma remap to use `PublicationInfo` instead of `PublicationInformation` with no other changes at this time. +All sources except Alma remap to use `Publisher` instead of `PublicationInformation` with no other changes at this time. This decouples our data model more from MARC where we seem to have modeled a set of data to match 260/264 in a way that doesn't seem to map to our other sources. Additionally, we already have other places for the extra info from 260/264 to map that may be more useful than as a new Object or embedded into the original String. From 99173c3fcceafb7b6daca126d0507c333a7970e4 Mon Sep 17 00:00:00 2001 From: Graham Hukill Date: Mon, 4 Mar 2024 11:12:34 -0500 Subject: [PATCH 3/6] add option 3 that combines option 1 and 2 --- ...-support-aggregations-on-publisher-name.md | 78 +++++++++++++++++++ 1 file changed, 78 insertions(+) diff --git a/docs/adrs/0003-support-aggregations-on-publisher-name.md b/docs/adrs/0003-support-aggregations-on-publisher-name.md index 633659b..3854803 100644 --- a/docs/adrs/0003-support-aggregations-on-publisher-name.md +++ b/docs/adrs/0003-support-aggregations-on-publisher-name.md @@ -88,6 +88,84 @@ GraphQL would continue to serve `PublicationInformation`, but replace it with th | Alma | [MARC 260$c](https://www.loc.gov/marc/bibliographic/bd260.html) | Dates.kind=PublicationDate | (this may already happen?) | | Alma | [MARC 264$c](https://www.loc.gov/marc/bibliographic/bd264.html) | Dates.kind=PublicationDate | (this may already happen?) | +### Proposed mappings (Option 3) + +This option is a combination of Option 1 and 2. + +In this scenario: +- **From Option 1**: all sources write publisher information to a multivalued object field `PublicationInfo` with fields like `[name, date, location, etc.]` + - there is no normalization or parsing of the data; strings are written as found from the original record +- **From Option 2**: where data is available (most commonly with Alma) sources extract date and location from the publisher information and write those values to `Dates` and `Locations` respectively, with a `@kind=Publisher` qualifier + - in the case of dates, we _could_ normalize and validate the date string to ensure it's a valid and meaningful Opensearch date + +Advantages of this option: +- all information for a specific publisher (e.g. name, date, location) is contextualized together as a complex object under `PublicationInfo` + - e.g. we can know "the published date via the 'Great Writings' publisher is 1930" +- for TIMDEX UI search and item pages, and GraphQL aggregations, there is no need to dig into complex objects + - simply look to `Dates` or `Locations` for that information where this data has been duplicated +- logic for extracting dates and locations from publisher information could be shared across all Transmogrifier sources + - e.g. it could be an automatically applied, secondary step after the `PublicationInfo` objects are created, pulling from `PublicationInfo.date` and `Publication.location` + - allows for more thorough date parsing for `Dates` entries, without losing meaningful strings from the source record that can remain in the `PublicationInfo` object + +Example record: +```json +{ + "publication_info": [ + { + "name": "Great Writings", + "date": "1930", + "location": "Bend, OR" + }, + { + "name": "Amazon Reprints", + "date": "2020", + "location": "Seattle, WA" + }, + { + "name": "Ebooks Inc.", + "date": "Circa 2023?" + } + ], + "dates": [ + { + "kind": "Published", + "value": "1930" + }, + { + "kind": "Published", + "value": "2020" + } + ], + "locations": [ + { + "kind": "Published", + "value": "Bend, OR" + }, + { + "kind": "Published", + "value": "Seattle, WA" + } + ] +} +``` +- note that bad or missing data from publisher "Ebooks Inc." is skipped for `dates` and `locations` extraction + +This avoids some subtle but potentially confusing scenarios: + +- **Option 1**: a user clicks date facet "1910" in search UI but does not see "1910" under "Dates" in the item page + - **Reason**: the UI item page didn't know it should reach into `PublicationInfo` objects for dates to show under "Dates", as this was custom logic applied to GraphQL aggregations and search facets + - **Option 3 fix**: GraphQL, UI search, and UI item pages all pull publishers from publishers, dates from dates, locations from locations, etc., no logic required + +- **Option 2**: a user is viewing an item page for the geospatial record "Fires in 1999 Dataset" but sees a strange "2020" under the "Dates" section + - **Reason**: the item page "Publisher" section only shows "GIS Pro Inc.", as "2020" was decoupled from the publisher (`Date` object would contain `@kind=Published` qualifier, but we'd need to then include that in the item page) + - **Option 3 fix**: the "Publisher" section in the item page clearly also shows "2020", because still part of a complex object, thereby contextualizing that date + +Option 3 achieves data where and when needed, and with the appropriate amount of context, by _extracting and duplicating_ some data like dates and locations: + +- a user wants details about a publisher, look at full and complex `PublicationInfo` object in the record +- the API or UI wants to pull all meaningful dates or locations from a record, look to the `Dates` or `Locations` fields + +In either situation, no additional logic, mapping, or documentation is needed. ## Decision From 1dfcfab19a41d5a2c021658e7acd9186c79afb53 Mon Sep 17 00:00:00 2001 From: Graham Hukill Date: Mon, 4 Mar 2024 11:15:50 -0500 Subject: [PATCH 4/6] update wording for option 2 --- docs/adrs/0003-support-aggregations-on-publisher-name.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/adrs/0003-support-aggregations-on-publisher-name.md b/docs/adrs/0003-support-aggregations-on-publisher-name.md index 3854803..eab8aa7 100644 --- a/docs/adrs/0003-support-aggregations-on-publisher-name.md +++ b/docs/adrs/0003-support-aggregations-on-publisher-name.md @@ -70,9 +70,11 @@ GraphQL would collapse the new object and continue to serve it as the deprecated ### Proposed mappings (Option 2) -Move some of the Alma data to other existing objects, and create a new top level `Publisher` array of strings. +Create a new top level `Publisher` array of strings. -All sources except Alma remap to use `Publisher` instead of `PublicationInformation` with no other changes at this time. +All sources (including Alma) would begin to write the publisher name to the new multivalued string `Publisher` field. + +Additionally, any sources (most commonly Alma) that have publisher date or location information, could extract and write that data to other appropriate fields, e.g. `Dates` or `Locations` with a qualifier like `@kind="Published"`. This decouples our data model more from MARC where we seem to have modeled a set of data to match 260/264 in a way that doesn't seem to map to our other sources. Additionally, we already have other places for the extra info from 260/264 to map that may be more useful than as a new Object or embedded into the original String. From b9e0f434114ee216d6f90e2c2a7a38a231c822e5 Mon Sep 17 00:00:00 2001 From: Graham Hukill Date: Tue, 5 Mar 2024 09:54:12 -0500 Subject: [PATCH 5/6] set decision for ADR 0003 --- docs/adrs/0003-support-aggregations-on-publisher-name.md | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/docs/adrs/0003-support-aggregations-on-publisher-name.md b/docs/adrs/0003-support-aggregations-on-publisher-name.md index eab8aa7..0248ea9 100644 --- a/docs/adrs/0003-support-aggregations-on-publisher-name.md +++ b/docs/adrs/0003-support-aggregations-on-publisher-name.md @@ -171,7 +171,12 @@ In either situation, no additional logic, mapping, or documentation is needed. ## Decision -Coming soon +Proceed with Option 3: + +- create new, top level, multivalued object field `PublicationInfo` with properties `[name, date, location]` +- where possible, further parse dates and locations from `PublicationInfo` objects into `Dates` and `Locations` fields, with `@kind=Publisher` qualifier +- all pre-existing transformations begin writing to `PublicationInfo` instead of current multivalued string `publication_information` +- deprecate `publication_information` in GraphQL, point to new object field `PublicationInfo` ## Consequences From 9bb06e567a9195c4cdea92c48d6693a17169cf27 Mon Sep 17 00:00:00 2001 From: Graham Hukill Date: Wed, 6 Mar 2024 09:09:32 -0500 Subject: [PATCH 6/6] Change PublicationInfo to Publisher in ADR --- ...-support-aggregations-on-publisher-name.md | 27 ++++++++++--------- 1 file changed, 14 insertions(+), 13 deletions(-) diff --git a/docs/adrs/0003-support-aggregations-on-publisher-name.md b/docs/adrs/0003-support-aggregations-on-publisher-name.md index 0248ea9..8865efe 100644 --- a/docs/adrs/0003-support-aggregations-on-publisher-name.md +++ b/docs/adrs/0003-support-aggregations-on-publisher-name.md @@ -95,24 +95,25 @@ GraphQL would continue to serve `PublicationInformation`, but replace it with th This option is a combination of Option 1 and 2. In this scenario: -- **From Option 1**: all sources write publisher information to a multivalued object field `PublicationInfo` with fields like `[name, date, location, etc.]` +- **From Option 1**: all sources write publisher information to a multivalued object field `Publishers` (slight field name update) with fields like `[name, date, location, etc.]` - there is no normalization or parsing of the data; strings are written as found from the original record - **From Option 2**: where data is available (most commonly with Alma) sources extract date and location from the publisher information and write those values to `Dates` and `Locations` respectively, with a `@kind=Publisher` qualifier - in the case of dates, we _could_ normalize and validate the date string to ensure it's a valid and meaningful Opensearch date Advantages of this option: -- all information for a specific publisher (e.g. name, date, location) is contextualized together as a complex object under `PublicationInfo` +- all information for a specific publisher (e.g. name, date, location) is contextualized together as a complex object under `Publishers` - e.g. we can know "the published date via the 'Great Writings' publisher is 1930" - for TIMDEX UI search and item pages, and GraphQL aggregations, there is no need to dig into complex objects - simply look to `Dates` or `Locations` for that information where this data has been duplicated - logic for extracting dates and locations from publisher information could be shared across all Transmogrifier sources - - e.g. it could be an automatically applied, secondary step after the `PublicationInfo` objects are created, pulling from `PublicationInfo.date` and `Publication.location` - - allows for more thorough date parsing for `Dates` entries, without losing meaningful strings from the source record that can remain in the `PublicationInfo` object + - e.g. it could be an automatically applied, secondary step after the `Publishers` objects are created, pulling from `Publishers.date` and `Publication.location` + - allows for more thorough date parsing for `Dates` entries, without losing meaningful strings from the source record that can remain in the `Publishers` object +- able to deprecate `publication_information` in GraphQL as the new field `publishers` does not conflict -Example record: +Example TIMDEX record: ```json { - "publication_info": [ + "publishers": [ { "name": "Great Writings", "date": "1930", @@ -155,7 +156,7 @@ Example record: This avoids some subtle but potentially confusing scenarios: - **Option 1**: a user clicks date facet "1910" in search UI but does not see "1910" under "Dates" in the item page - - **Reason**: the UI item page didn't know it should reach into `PublicationInfo` objects for dates to show under "Dates", as this was custom logic applied to GraphQL aggregations and search facets + - **Reason**: the UI item page didn't know it should reach into `Publishers` objects for dates to show under "Dates", as this was custom logic applied to GraphQL aggregations and search facets - **Option 3 fix**: GraphQL, UI search, and UI item pages all pull publishers from publishers, dates from dates, locations from locations, etc., no logic required - **Option 2**: a user is viewing an item page for the geospatial record "Fires in 1999 Dataset" but sees a strange "2020" under the "Dates" section @@ -164,7 +165,7 @@ This avoids some subtle but potentially confusing scenarios: Option 3 achieves data where and when needed, and with the appropriate amount of context, by _extracting and duplicating_ some data like dates and locations: -- a user wants details about a publisher, look at full and complex `PublicationInfo` object in the record +- a user wants details about a publisher, look at full and complex `Publishers` object in the record - the API or UI wants to pull all meaningful dates or locations from a record, look to the `Dates` or `Locations` fields In either situation, no additional logic, mapping, or documentation is needed. @@ -173,11 +174,11 @@ In either situation, no additional logic, mapping, or documentation is needed. Proceed with Option 3: -- create new, top level, multivalued object field `PublicationInfo` with properties `[name, date, location]` -- where possible, further parse dates and locations from `PublicationInfo` objects into `Dates` and `Locations` fields, with `@kind=Publisher` qualifier -- all pre-existing transformations begin writing to `PublicationInfo` instead of current multivalued string `publication_information` -- deprecate `publication_information` in GraphQL, point to new object field `PublicationInfo` +- create new, top level, multivalued object field `Publishers` with properties `[name, date, location]` +- where possible, further parse dates and locations from `Publishers` objects into `Dates` and `Locations` fields, with `@kind=Publisher` qualifier +- all pre-existing transformations begin writing to `Publishers` instead of current multivalued string `publication_information` +- deprecate `publication_information` in GraphQL, point to new object field `Publishers` ## Consequences -Either option will provide more normalized/consistent data. By ensuring we have a field that represents just the publisher name -- either via `PublicationInfo.name` or `Publisher` -- we will be able to add an additional mapping of `keyword` and allow for aggregation in OpenSearch/GraphQL/consuming applications such as TIMDEX UI. +Any option will provide more normalized/consistent data. By ensuring we have a field that represents just the publisher name -- either via `Publishers.name` or `Publisher` -- we will be able to add an additional mapping of `keyword` and allow for aggregation in OpenSearch/GraphQL/consuming applications such as TIMDEX UI.