Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset upload fix #46

Merged
merged 3 commits into from
Aug 15, 2022
Merged

Dataset upload fix #46

merged 3 commits into from
Aug 15, 2022

Conversation

ryanrdoherty
Copy link
Member

@ryanrdoherty ryanrdoherty commented Aug 15, 2022

This PR addresses two issues:

  1. Dataset upload takes too long because we are allocating only one dataset_value_id at a time from the Oracle sequence. Fix is to allocate a batch of IDs at a time. How many depends on an estimate of the size of the dataset (or exact match for basket- and step-sourced datasets).
  2. AnswerValue has long had two methods that load all IDs of the answer value into memory at one time as a List. This is problematic for memory and bad practice. One of the methods could be removed (string processing moved elsewhere); the other was changed to return a closable Iterator<String[]>, which callers can read one PK at a time off of, or concatenate together if they want to live dangerously.

Note this does NOT include the change where basket and strategy results are transferred directly to datasets. Instead they are written to intermediate files (which baskets were doing before), and parsed back in by ListDatasetParser. Thus the contents are still being written into the CLOB field in the datasets DB table.

@ryanrdoherty ryanrdoherty requested a review from dmgaldi August 15, 2022 03:30
@ryanrdoherty ryanrdoherty self-assigned this Aug 15, 2022
@ryanrdoherty
Copy link
Member Author

Note this PR helps solve VEuPathDB/ApiCommonModel#13 but not completely. We really need that SQL fix in the model XML to resolve the query performance.

@ryanrdoherty ryanrdoherty merged commit 755c96f into master Aug 15, 2022
@ryanrdoherty ryanrdoherty deleted the dataset-upload-fix branch August 15, 2022 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants