Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds ADR for remapping PublicationInformation #132

Merged
merged 6 commits into from
Mar 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .adr-dir
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
docs/adrs
184 changes: 184 additions & 0 deletions docs/adrs/0003-support-aggregations-on-publisher-name.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,184 @@
# 3. Support Aggregations on Publisher Name

Date: 2024-02-23

## Status

Accepted

## Context

Our current data model maps data about Publisher Name, Publication Year, and Publication Location into a single multivalued field (an Array of Strings). Mapping these different concepts into a single field makes the data less meaningful and less useful than it could be with a change to our data model.

Mapping differnt types of data into a single field makes aggregation confusing as we'd see things like "1999" along side "Massachusetts Institute of Technology" which is not usable.

### Current mappings

| Source | Mappings | TIMDEX Field | Examples |
| --- | --- | --- | --- |
| Alma | [MARC 260$abcdef](https://www.loc.gov/marc/bibliographic/bd260.html) | PublicationInformation | Charlottesville : University of Virginia Press, [2015], ©2015 |
| | [MARC264$abc](https://www.loc.gov/marc/bibliographic/bd264.html) | PublicationInformation | DeKalb, Illinois : NIU Press, 2014, ©2014 |
| | | PublicationInformation | null |
| | | PublicationInformation | A.O.K. 3 |
| | | PublicationInformation | American Antiquarian Society Historical Periodicals |
| DSpace@MIT (METS) | XML slect array of Strings matching element `publisher` | PublicationInformation | Massachusetts Institute of Technology |
| | | PublicationInformation | Elsevier BV |
| | | PublicationInformation | null |
| | | PublicationInformation | ACM|Creativity and Cognition |
| | | PublicationInformation | Cambridge, Mass. : Alfred P. Sloan School of Management, Massachusetts Institute of Technology |
| | | PublicationInformation | Elsevier |
| | | PublicationInformation | Wiley |
| ArchivesSpace | XML select array of Strings matching element `dc:publisher` | PublicationInformation | Massachusetts Institute of Technology. Libraries. Department of Distinctive Collections (note: this appears to be effectively a static string for all of our ASpace records) |
| MIT GIS | | PublicationInformation | |
| OGM GIS | | PublicationInformation | |

### Proposed mappings (Option 1)

Create a new object `PublicationInfo` and map data into it. Most sources just use `Publication.name` but Alma uses additional fields.

```json
{
"publicationInfo": [
{
"name": "String",
"location": "String",
"date": "String (ideally dates would be dates... but that may be out of scope for now)"
}
]
}
```

Notes: we should consider whether this really needs to be an array. 260 and 264 are repeatable fields in MARC, but whether they are regularly used in that way or if we could instead pick the "first" one that shows up and get a simpler model without losing any real world functionality would be worth investigating. For now it is modeled as an array of `PublicationInfo` assuming that is appropriate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From EngX's / API POV, what makes a single value simpler than an array? Is it parsing those results from the actual document in GraphQL?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's less EngX than end users I'm thinking about. Nested objects can be more difficult to work with in GraphQL than the equivalent top level field. And strings are easier than arrays to work with initially. This doesn't mean we should aim for totally flat records of entirely strings, just that when our data does not require objects or arrays, we should lean towards strings.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should make it an array given that it's possible for multiple occurrences, but I agree that it would be useful to find out how often that happens, if ever, to inform that decision. It would be kind of silly if we accommodated repeatable MARC fields that are never actually repeated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was going to ask about what sorts of warnings or logs are generated when Transmog encounters multiple values when it expects only one - but it seems that we've already decided not to try and impose a single-value requirement at this time?


GraphQL would collapse the new object and continue to serve it as the deprecated field `PublicationInformation` until we are confident it is no longer being used.

| Source | Mappings | TIMDEX Field | Notes |
| --- | --- | --- | --- |
| Alma | [MARC 260$a](https://www.loc.gov/marc/bibliographic/bd260.html) | PublicationInfo.Location | |
| Alma | [MARC 264$a](https://www.loc.gov/marc/bibliographic/bd264.html) | PublicationInfo.Location | |
| Alma | [MARC 260$b](https://www.loc.gov/marc/bibliographic/bd260.html) | PublicationInfo.Name | |
| Alma | [MARC 264$b](https://www.loc.gov/marc/bibliographic/bd264.html) | PublicationInfo.Name | |
| Alma | [MARC 260$c](https://www.loc.gov/marc/bibliographic/bd260.html) | PublicationInfo.Date | |
| Alma | [MARC 264$c](https://www.loc.gov/marc/bibliographic/bd264.html) | PublicationInfo.Date | |
| Alma | [MARC 260$d](https://www.loc.gov/marc/bibliographic/bd260.html) | Invalid field. Don't map | |
| Alma | [MARC 260$e](https://www.loc.gov/marc/bibliographic/bd260.html) | Don't map | |
| Alma | [MARC 260$c](https://www.loc.gov/marc/bibliographic/bd260.html) | Don't map | |
| DSpace | Keep logic | Publication.Name | note: consider future work to normalize in source or during transform |
| ASpace | Keep logic | PublicationInfo.Name | |
| GIS MIT | | PublicationInfo.Name | |
| GIS OGM | | PublicationInfo.Name | |

### Proposed mappings (Option 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JPrevost For Option 2, would it also require only choosing the first occurrence of the MARC 260 and 264 fields? 🤔

Copy link
Member Author

@JPrevost JPrevost Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! I believe similar to the discussion in Option 1, the fields are repeatable but it's unclear how frequently they are actually repeated in practice. I'd suggest we may want to look at our data to decide if this should be an Array of Strings or a single String. I suspect an Array of Strings will be where we end up but I'd hate to make that decision based on suspicion rather than data.

Copy link
Contributor

@ghukill ghukill Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to tease this out a bit more... if:

  • publisher is a multivalued string field
  • and any other publisher information gets written to other fields like dates or locations (see comment below)

wouldn't this mean, for Option 2, that we could accomodate both MARC 260 and 264? Stated more generally, that Option 2 supports multiple publishers by virtue of the new publisher field being multivalued and all other values mapped to other, multivalued fields as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. If we just make publisher an array of Strings, we can support any source that currently or in the future has multiple values for all aspects of this (dates, names, locations).

If we make publisher a single string, we could still support multiple Dates/Locations if that is the only thing that is truly multivalued in our sources. That may be a bit hard to tease out if that is all we truly need or if we should just accept that supporting multivalued is better because we just aren't sure.

Copy link
Contributor

@ghukill ghukill Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to @jonavellecuerdo for initiating a look into this data, here is a spreadsheet that gives some insight into how publication_information is currently used across sources: https://docs.google.com/spreadsheets/d/1zT4LlGQDyuwvxZDMnHGtgu45bnl2Kau42n_Njg2T4qQ/edit#gid=0.

It shows a maximum of ten rows per combination of source + publication_information array length to get a sense of what individual values look like; ranging from 2-11 values in this array.

Also, here also is a breakdown of source x array_length for this field:

+-----------------------------------------------+------------+------------------+
|source                                         |array_length|array_length_count|
+-----------------------------------------------+------------+------------------+
|MIT Alma                                       |11          |1                 |
|MIT Alma                                       |10          |1                 |
|MIT Alma                                       |9           |2                 |
|MIT Alma                                       |8           |2                 |
|MIT Alma                                       |7           |6                 |
|MIT Alma                                       |6           |24                |
|Woods Hole Open Access Server                  |5           |1                 |
|OpenGeoMetadata GIS Resources                  |5           |1                 |
|MIT Alma                                       |5           |80                |
|MIT Alma                                       |4           |526               |
|OpenGeoMetadata GIS Resources                  |4           |25                |
|MIT GIS Resources                              |3           |1                 |
|MIT Alma                                       |3           |19345             |
|Woods Hole Open Access Server                  |3           |40                |
|OpenGeoMetadata GIS Resources                  |3           |426               |
|DSpace@MIT                                     |2           |22                |
|Woods Hole Open Access Server                  |2           |1063              |
|MIT Alma                                       |2           |317647            |
|OpenGeoMetadata GIS Resources                  |2           |89724             |
|MIT GIS Resources                              |2           |1986              |
|MIT Alma                                       |1           |2721358           |
|MIT ArchivesSpace                              |1           |1288              |
|OpenGeoMetadata GIS Resources                  |1           |29687             |
|DSpace@MIT                                     |1           |127517            |
|Woods Hole Open Access Server                  |1           |5135              |
|Zenodo                                         |1           |4167              |
|LibGuides                                      |1           |362               |
|MIT GIS Resources                              |1           |56                |
|Abdul Latif Jameel Poverty Action Lab Dataverse|1           |131               |
|Research Databases                             |null        |0                 |
|MIT Alma                                       |null        |0                 |
|Woods Hole Open Access Server                  |null        |0                 |
|DSpace@MIT                                     |null        |0                 |
+-----------------------------------------------+------------+------------------+

A couple of takeaways:

  • the majority of records have 1 value (single valued)
  • about 300k in Alma have 2 values
    • BUT, many of the 2nd values are just copyright dates like @2014
  • Alma records with 3 values are more indicative of another actual publisher
  • tail for 3+ sources are somewhat outliers

From this cursory look, it appears that supporting multiple publishers would be beneficial. Furthermore, there are lots of instances where the 2nd "publisher" in these arrays is just a date, which seems to lean into Option #2 where dates could be broken out separately, thereby avoiding a Publisher object with nothing but a date while still getting that date included in the record.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for the peek into the data. This is extremely helpful context. I feel like I have enough info to add some additional context to the Options and propose a Decision (option 2 seems popular with the top level array for publisher names and everything else broken into Location and Dates).


Create a new top level `Publisher` array of strings.

All sources (including Alma) would begin to write the publisher name to the new multivalued string `Publisher` field.

Additionally, any sources (most commonly Alma) that have publisher date or location information, could extract and write that data to other appropriate fields, e.g. `Dates` or `Locations` with a qualifier like `@kind="Published"`.

This decouples our data model more from MARC where we seem to have modeled a set of data to match 260/264 in a way that doesn't seem to map to our other sources. Additionally, we already have other places for the extra info from 260/264 to map that may be more useful than as a new Object or embedded into the original String.

GraphQL would continue to serve `PublicationInformation`, but replace it with the data in the new field `Publisher`. The deprecation notice would explain that it is not a 1:1 mapping of the old field and what aspects have been moved to additional fields.


| Source | Mappings | TIMDEX Field | Notes |
| --- | --- | --- | --- |
| Alma | [MARC 260$a](https://www.loc.gov/marc/bibliographic/bd260.html) | Location.kind = Publisher Location.value=260$a | |
| Alma | [MARC 264$a](https://www.loc.gov/marc/bibliographic/bd264.html) | Location.kind = Publisher Location.value=264$a | |
| Alma | [MARC 260$b](https://www.loc.gov/marc/bibliographic/bd260.html) | Publisher | |
| Alma | [MARC 264$b](https://www.loc.gov/marc/bibliographic/bd264.html) | Publisher | |
| Alma | [MARC 260$c](https://www.loc.gov/marc/bibliographic/bd260.html) | Dates.kind=PublicationDate | (this may already happen?) |
| Alma | [MARC 264$c](https://www.loc.gov/marc/bibliographic/bd264.html) | Dates.kind=PublicationDate | (this may already happen?) |

### Proposed mappings (Option 3)

This option is a combination of Option 1 and 2.

In this scenario:
- **From Option 1**: all sources write publisher information to a multivalued object field `Publishers` (slight field name update) with fields like `[name, date, location, etc.]`
- there is no normalization or parsing of the data; strings are written as found from the original record
- **From Option 2**: where data is available (most commonly with Alma) sources extract date and location from the publisher information and write those values to `Dates` and `Locations` respectively, with a `@kind=Publisher` qualifier
- in the case of dates, we _could_ normalize and validate the date string to ensure it's a valid and meaningful Opensearch date

Advantages of this option:
- all information for a specific publisher (e.g. name, date, location) is contextualized together as a complex object under `Publishers`
- e.g. we can know "the published date via the 'Great Writings' publisher is 1930"
- for TIMDEX UI search and item pages, and GraphQL aggregations, there is no need to dig into complex objects
- simply look to `Dates` or `Locations` for that information where this data has been duplicated
- logic for extracting dates and locations from publisher information could be shared across all Transmogrifier sources
- e.g. it could be an automatically applied, secondary step after the `Publishers` objects are created, pulling from `Publishers.date` and `Publication.location`
- allows for more thorough date parsing for `Dates` entries, without losing meaningful strings from the source record that can remain in the `Publishers` object
- able to deprecate `publication_information` in GraphQL as the new field `publishers` does not conflict

Example TIMDEX record:
```json
{
"publishers": [
{
"name": "Great Writings",
"date": "1930",
"location": "Bend, OR"
},
{
"name": "Amazon Reprints",
"date": "2020",
"location": "Seattle, WA"
},
{
"name": "Ebooks Inc.",
"date": "Circa 2023?"
}
],
"dates": [
{
"kind": "Published",
"value": "1930"
},
{
"kind": "Published",
"value": "2020"
}
],
"locations": [
{
"kind": "Published",
"value": "Bend, OR"
},
{
"kind": "Published",
"value": "Seattle, WA"
}
]
}
```
- note that bad or missing data from publisher "Ebooks Inc." is skipped for `dates` and `locations` extraction

This avoids some subtle but potentially confusing scenarios:

- **Option 1**: a user clicks date facet "1910" in search UI but does not see "1910" under "Dates" in the item page
- **Reason**: the UI item page didn't know it should reach into `Publishers` objects for dates to show under "Dates", as this was custom logic applied to GraphQL aggregations and search facets
- **Option 3 fix**: GraphQL, UI search, and UI item pages all pull publishers from publishers, dates from dates, locations from locations, etc., no logic required

- **Option 2**: a user is viewing an item page for the geospatial record "Fires in 1999 Dataset" but sees a strange "2020" under the "Dates" section
- **Reason**: the item page "Publisher" section only shows "GIS Pro Inc.", as "2020" was decoupled from the publisher (`Date` object would contain `@kind=Published` qualifier, but we'd need to then include that in the item page)
- **Option 3 fix**: the "Publisher" section in the item page clearly also shows "2020", because still part of a complex object, thereby contextualizing that date

Option 3 achieves data where and when needed, and with the appropriate amount of context, by _extracting and duplicating_ some data like dates and locations:

- a user wants details about a publisher, look at full and complex `Publishers` object in the record
- the API or UI wants to pull all meaningful dates or locations from a record, look to the `Dates` or `Locations` fields

In either situation, no additional logic, mapping, or documentation is needed.

## Decision

Proceed with Option 3:

- create new, top level, multivalued object field `Publishers` with properties `[name, date, location]`
- where possible, further parse dates and locations from `Publishers` objects into `Dates` and `Locations` fields, with `@kind=Publisher` qualifier
- all pre-existing transformations begin writing to `Publishers` instead of current multivalued string `publication_information`
- deprecate `publication_information` in GraphQL, point to new object field `Publishers`

## Consequences

Any option will provide more normalized/consistent data. By ensuring we have a field that represents just the publisher name -- either via `Publishers.name` or `Publisher` -- we will be able to add an additional mapping of `keyword` and allow for aggregation in OpenSearch/GraphQL/consuming applications such as TIMDEX UI.
Loading