Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process duplicates identified by notebook 4.1 #27

Open
laureD19 opened this issue Apr 2, 2024 · 2 comments
Open

process duplicates identified by notebook 4.1 #27

laureD19 opened this issue Apr 2, 2024 · 2 comments

Comments

@laureD19
Copy link
Contributor

laureD19 commented Apr 2, 2024

duplicates mainly due to re-ingest errors of the CRF

notebook 4.1 currently gives back:

  • Using the attribute(s) "label" as filter, there are: 26 duplicated tools, 6 duplicated publications, 2 duplicated training materials, 0 duplicated workflows, 264 duplicated datasets
  • Using the attribute(s) "accessibleAt" as filter, there are: 510 duplicated tools, 26 duplicated publications, 20 duplicated training materials, 277 duplicated datasets

@carikan @mkrzmr - the suggestion would be to go through notebook 4.1 together and decide how we can share the work for the merges needed

@mkrzmr
Copy link
Contributor

mkrzmr commented Apr 10, 2024

dup_label.csv
dup_url.csv

Ran the same and uploaded the files. Suggest we divide the work and merge the items
Q: What to do with item sources? Might create issues with ingest

@laureD19
Copy link
Contributor Author

If I'm not wrong, the new item created during a merge doesn't have an item source, but keep in its history the two (or more) items merged (including their sources).
With DACE, in case of reingest, in theory, the ingest pipeline notices the difference between the first ingest and the one happenings and marks the problematic item(s) for moderators to have a look at it before their are approved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants