-
Notifications
You must be signed in to change notification settings - Fork 3
Publication metadata issues
Justin Littman edited this page May 28, 2021
·
1 revision
Dear Future Self, please note:
- Some early publications do not have identifiers. (As a result, it is likely that this has resulted in some duplicate publications.)
- Some early publications were created by manually entering CVs and are not from a reliable metadata source.
- Users are able to manually enter publications via the CAP website, with minimal or no validation on the metadata entered in the various fields.
- We do not validate identifiers (e.g. DOIs can be an invalid format if they arrive to us with a typo or other problem).
- The pub hash (our internal data structure for storing publication metadata) is inconsistent (see pub_hash_schema.yml, e.g., the various fields for Author) and rife with nulls and blanks. In general, you should be suspicious of any publication metadata contained in the pub hash.
- How and how completely the pub hash was populated varies based on the source of the data and the state of the code when the harvest was performed. Note that publications are not (intentionally) re-harvested when better data sources become available.
- By design, we harvest publications from multiple sources, each of which have various forms of data errors of their own. Since we do not curate the data (due to the large volume), any errors from the data source are propagated into our system.
- Most of the metadata is not used by CAP, so data issues are mostly hidden.
- SUL-PUB is a subset of the types of scholarly output produced by Stanford researchers (primarily journal articles and books).
HOWEVER, this is a valid dataset of publications for Stanford researchers. It could be readily used to re-harvest metadata from reliable metadata sources based on the recorded identifiers. (For example, any publication with a DOI could be re-harvested from CrossRef.)