Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send PMIDs from Elements to eSchol #66

Open
DevinSmithWork opened this issue Nov 14, 2024 · 6 comments
Open

Send PMIDs from Elements to eSchol #66

DevinSmithWork opened this issue Nov 14, 2024 · 6 comments
Assignees

Comments

@DevinSmithWork
Copy link
Collaborator

During the rewrite of the Pubmed LinkOut script, some fact-finding indicated PMIDs are being added to eScholarship manually. (e.g. only 48 eScholarship items total in 2023 had these IDs).

However, many more publications in Elements have PMIDs. Investigate if we can send these automatically via the Deposit xwalk + connectRT2.

@DevinSmithWork DevinSmithWork self-assigned this Nov 14, 2024
@DevinSmithWork
Copy link
Collaborator Author

Example Elements pub w PubMed record: 4503909

@DevinSmithWork
Copy link
Collaborator Author

DevinSmithWork commented Nov 18, 2024

Fact-Finding

We can pass all of a publication's external identifiers from the xwalk out. There are a few options for how we want to format the data.

  1. We can pass all the identifiers in a delimited list: 12345 (pmid)||abcxyz (myid)
  2. We can select only the identifiers we're looking for and pass them individually.

Warning: It's possible this may slow down RT2 a small amount. I believe the string aggregation (e.g. getting all the external-identifiers) will be slower than the PMID-only selection?

Changes required

Xwalk out elements input reference:

<api:field name="external-identifiers" type="identifier-list" display-name="External identifiers">
  <api:identifiers>
    <api:identifier scheme="pubmed">34487685</api:identifier>
    <api:identifier scheme="pmc">PMC8415897</api:identifier>
  </api:identifiers>
</api:field>

Sending all external IDs

This field mapping references the entire external-identifiers list, and passes a delimited string.

<xwalk:field-mapping to="external-identifiers" is-list="true" separator="||">
  <xwalk:field-source from="external-identifiers" />
</xwalk:field-mapping>

Only PMIDs

This field mapping uses a conditional to check the identifier:scheme in the input list, and pass through only the pubmed scheme value.

  <xwalk:field-mapping to="pmid" >
    <xwalk:field-source from="external-identifiers">
      <xwalk:if>
        <xwalk:condition operator="equals" argument-field="identifier:scheme">pubmed</xwalk:condition>
        <xwalk:result>
          <xwalk:field-source data-part="identifier:value" />
        </xwalk:result>
      </xwalk:if>
    </xwalk:field-source>
  </xwalk:field-mapping>

Output from the above:

<metadataentries>
  <metadataentry>
    <key>pmid</key>
    <value>26706127</value>
  </metadataentry>
  <metadataentry>
    <key>elements-pub-id</key>
    <value>3801055</value>
  </metadataentry>
  <metadataentry>
    <key>reporting-date-1</key>
    <value>2016-02-01</value>
  </metadataentry>

@DevinSmithWork
Copy link
Collaborator Author

Seems to be working fine: Elements pub -- eSchol pub

@DevinSmithWork
Copy link
Collaborator Author

PMIDs beginning to appear in the regular syncing process:

connectRT2.2024.11.18:7
connectRT2.2024.11.19:2
connectRT2.2024.11.20:253
connectRT2.2024.11.21:2873

@DevinSmithWork
Copy link
Collaborator Author

DevinSmithWork commented Nov 26, 2024

Symplectic answer re: disagregating via relevance scheme. No dice. ticket link

"there is no way to configue the relevance scheme to generate a hash only when a pubmed ID is added in this field. You can use the field as is, and when it is modified, a new hash will be generated regardless of what identifier is added."

@DevinSmithWork
Copy link
Collaborator Author

DevinSmithWork commented Jan 8, 2025

2025 Update: Good News / Bad News

Let's discuss this on Thursday

  • The PMID syncing ran from Dec. 17 to Dec. 26 without issue.
  • However, only about 1/4 of the total Elements pubs with PMIDs sync'ed their PMIDs to eSchol (~50k / 200k)
  • After some investigation, this is happening because of the relevance scheme's masking step.

Relevance Scheme details

  • In the rel. scheme's masking step, we disregard any publications which DO NOT have either (1) a manual record, or (2) grant links.
  • The 50k PMID items which were synced in December had one or both of these features. The remaining 150k do not.

Only one change is required to include all eSchol pubs with PubMed records, here:

<rel:condition argument-field="object.record-sources" operator="contains-any-of">manual</rel:condition>

Changes to:

<rel:condition argument-field="object.record-sources" operator="contains-any-of">manual,pubmed</rel:condition>

Estimated syncing time

50k took about 9 days of syncing. Given this pacing, somewhere around 30 days will be required for the remaining 150k.

Potential impacts for rolling diff sync

  • Presently, 79k publications are eligible for diff syncing (this includes the 50k PMID pubs).
  • Adding 150k pubs will roughly double the number of eligible pubs.
  • There are known factors which create false-positive syncing churn. It's possible this might cause issues with this much larger batch of diff syncing pubs.
    • Workaround: After syncing all the PMIDs, we could revert the rel scheme, which will have the effect of transferring new PMIDs, but not syncing any existing ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant