-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-41966: Add Butler.transfer_dimension_records_from API #921
Conversation
92ec7c9
to
564e43e
Compare
# have to be scanned. | ||
continue | ||
|
||
if not can_query: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to have a flag to not try to copy the derived records? This will break if a quantum graph is used as the source butler but I think that's fine because we shouldn't be enabling records transfer from the graph back to the primary butler because it's got all the records by definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's also a use case for making a new repo from a QG, but I'd prefer to explicitly add that (and control what the interface for it is) rather than accidentally make it work one way and then have to continue to support that way. So I don't think we need that flag for that reason.
I am a bit more worried about cases where somebody intentionally does not want to transfer something like detector
because they'd rather assert that someone has done butler registry-instrument
correctly on the destination repo, but I think that's an argument for being able to control the elements being transferred as per a previous PR comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have just realized that the butler transfer-from-graph
command lets you specify whether to copy dimension records or not. This now breaks things because of the populated_by follow up query. We seem to have the following options:
- if we can't query the source butler we issue a warning and transfer what we have.
- if we can't query the source butler we query the target butler and if those populated_by records are found we return without complaint.
- in the future we add the related records to the graph and add querying of records to the Graph Butler.
- we remove the transfer dimensions option from the transfer-from-graph command (or change the default to False and currently raise if True). The default likely should be true anyhow since in all our cases we are transferring back to a butler that created the graph.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #921 +/- ##
=======================================
Coverage 87.50% 87.50%
=======================================
Files 292 292
Lines 38067 38124 +57
Branches 8062 8081 +19
=======================================
+ Hits 33310 33362 +52
- Misses 3553 3554 +1
- Partials 1204 1208 +4 ☔ View full report in Codecov by Sentry. |
source_refs : iterable of `DatasetRef` | ||
Datasets defined in the source butler whose dimension records | ||
should be transferred to this butler. In most circumstances. | ||
transfer is faster if the dataset refs are expanded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be better to make this an iterable of DataCoordinate
, since that's what's actually holding the records and you can get that from Iterable[DatasetRef]
with (ref.dataId for ref in source_refs)
.
And even then I think that's a bit strange for a method with this name; it sounds like it should be taking a bunch of dimension records and transferring those (as well as other less-obvious associated records). But of course that's not actually the interface we need right here, so maybe this is just a naming problem.
One option might be to make this a private method, if the goal is really to support transferring datasets. On the DMTN-249 prototype I wrote a little helper class that I hoped to use to unify the transfer-from
and import_
/export
interfaces, by providing an abstraction over "a bunch of stuff you want to transfer self-consistently". I think we might need that in the transfer APIs to avoid a bunch of methods with names like transfer_dimension_records_from_given_dataset_refs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be a public API because the embargo transfers need to be able to call it before they transfer the raws from embargo to public. They can't use Butler.transfer_from()
for raws because raws are relocated to public storage outside of butler before being ingested again (but without having to go through calling ingest-raws again since they have refs already from the embargo repo and are building up the FileDataset objects). That's also why refs are the interface and not dataIds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I'm not thrilled with it, but if it's got a clear use case, go ahead, since the kind of generalization I want is even more design work (and I'd probably want to be even more cautious about releasing that half-baked), so we can cross the bridge of replacing this if and when we come to it.
) -> dict[DimensionElement, dict[DataCoordinate, DimensionRecord]]: | ||
primary_records = self._extract_dimension_records_from_data_ids( | ||
source_butler, data_ids, allowed_elements | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an assumption here that if the destination butler already has records for some of these elements, insertDimensionData(..., skip_existing=True)
is both efficient and correct as a way to resolve any conflicts. That's a reasonable-enough assumption for it be the default, but we might want to provide more control for advanced users, especially if that could avoid queries against the source butler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. For the populated_by records, wouldn't we have to query the target butler to see if they existed already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it's less important and trickier to let the user control those. If we expressed the user control as an "opt out" list of elements rather than an opt-in list we could probably still trust it, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you want a skip_elements: list[str] | None = None
parameter to be added to butler.transfer_from
and butler.transfer_dimension_records_from
so that people could say "no detector/instrument/physical_filter or no visit_detector_region" or something. I agree that if detector/instrument/physical_filter are not present in the target repo then you probably do want to run register-instrument first, although the transfer being told to skip them wouldn't explain to people why the transfer failed in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's about what I was thinking vaguely of. I don't care deeply about adding it on this ticket if you just want to get somebody else unblocked right now.
# have to be scanned. | ||
continue | ||
|
||
if not can_query: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's also a use case for making a new repo from a QG, but I'd prefer to explicitly add that (and control what the interface for it is) rather than accidentally make it work one way and then have to continue to support that way. So I don't think we need that flag for that reason.
I am a bit more worried about cases where somebody intentionally does not want to transfer something like detector
because they'd rather assert that someone has done butler registry-instrument
correctly on the destination repo, but I think that's an argument for being able to control the elements being transferred as per a previous PR comment.
|
||
records = source_butler.registry.queryDimensionRecords( # type: ignore | ||
element.name, **data_id.mapping # type: ignore | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is probably as efficient as we can make it now, but it will be really helpful when we can upload tables of data IDs to the query methods and join against them, both for efficiency and for simplifying the logic here (which is effectively a bunch table joins written in Python). Right now I think we better hope this almost never gets called.
bbb15f3
to
0f081df
Compare
This is clearer than trying to raise the same exception from itself.
Also copies related dimensions populated by the original set. Butler.transfer_from now uses a part of this API.
This makes sure that exposure is inserted before visit and before visit_definition.
0f081df
to
31fc763
Compare
Uses populated_by field to find other records to pull along.
Checklist
doc/changes