Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-45873: Initial query_all_datasets implementation #1109

Merged
merged 7 commits into from
Nov 8, 2024
Merged

Conversation

dhirving
Copy link
Contributor

@dhirving dhirving commented Oct 29, 2024

Checklist

Extracted the query logic from the query-datasets CLI and used it to provide an initial implementation of query_all_datasets.

This gets the interface and unit tests in place for a future PR that will run the whole query on the server side for RemoteButler.

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes
  • (if changing dimensions.yaml) make a copy of dimensions.yaml in configs/old_dimensions

Copy link

codecov bot commented Oct 30, 2024

Codecov Report

Attention: Patch coverage is 98.70968% with 2 lines in your changes missing coverage. Please review.

Project coverage is 89.41%. Comparing base (e75d1d5) to head (c077b29).
Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
python/lsst/daf/butler/script/queryDatasets.py 92.30% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1109      +/-   ##
==========================================
+ Coverage   89.37%   89.41%   +0.03%     
==========================================
  Files         362      363       +1     
  Lines       48312    48423     +111     
  Branches     5872     5879       +7     
==========================================
+ Hits        43179    43296     +117     
+ Misses       3718     3717       -1     
+ Partials     1415     1410       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dhirving dhirving force-pushed the tickets/DM-45873 branch 5 times, most recently from e2f48d5 to 16374f1 Compare November 4, 2024 19:03
@dhirving dhirving marked this pull request as ready for review November 4, 2024 21:20
Copy link
Member

@timj timj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I've tested it and it does seem to work as expected. It looks like the restriction of order-by only working for a single dataset type is not part of this PR yet.

find_first=self._find_first,
name=self._dataset_type_glob,
with_dimension_records=True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following on from a Slack comment, it looks like there are other scripts using the QueryDatasets class that do need dimensions so if you want to be able to turn this off for query-datasets but turn it on for transfer-datasets then you will need to add a parameter higher up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks -- I failed to notice that QueryDatasets was shared by multiple scripts. I had removed order-by and the dimension records in the follow-up PR to this one but I'll have to put the dimension records back.

# Same as previous test, but with positive limit so no warning is
# issued.
tables = self._queryDatasets(
repo=testRepo.butler, limit=1, order_by=("visit"), collections="*", glob="*"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe test that no warning was issued (by issuing your own warning and then making sure you only got the one warning)?

Pull out the "query all datasets" logic from the query-datasets CLI command to a separate function.  In an upcoming commit this will be used to implement `Butler.query_all_datasets`.
Add a method for querying multiple dataset types simultaneously, currently hidden as `Butler._query_all_datasets`.

This implementation uses the existing logic from the query-datasets CLI for doing the search.
The dataset type is supposed to default to '*' if no dataset types are provided by the user.
Fix an issue where the limit was being issued repeatedly because the caller was modifying the results array before query_all_datasets checked its length.
Add a description of which columns are invalid and which are allowed when an order by expression is not legal.
@dhirving dhirving merged commit 1de58f9 into main Nov 8, 2024
19 checks passed
@dhirving dhirving deleted the tickets/DM-45873 branch November 8, 2024 23:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants