-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DM-47350: Add walltime to resource usage tables #336
Conversation
1fc600f
to
283c1cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition to the ResourceUsageSummary is missing. For example https://github.com/lsst/analysis_tools/blob/main/python/lsst/analysis/tools/tasks/gatherResourceUsage.py#L155
I'm hitting "request changes" only because I do want to take a look at the modifications to ConsolidateResourceUsageTask
when you make them.
Follow the same naming pattern (i.e. "walltime_s_")
@@ -284,6 +291,8 @@ class GatherResourceUsageTask(PipelineTask): | |||
- ``init_time``: the time spent in task construction; | |||
- ``run_time``: the time spent executing the task's runQuantum | |||
method. | |||
- ``wall_time`` : the total elapsed real time time spent executing the | |||
task. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicate "time." As written, it's a little ambiguous whether wall_time is comparable to run_time (as in its the elapsed time of spent in runQuantum) or both run_time + init_time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this new description sound good? (highlighted bit above is outdated)
@@ -456,6 +468,10 @@ def _extract_quantum_timing(self, quantum_metadata): | |||
if getattr(self.config, attr_name) | |||
} | |||
|
|||
quantum_timing["wall_time"] = (end_wall_time - start_wall_time).sec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your Task config parameter wall_time
does nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what I added
if self.config.wall_time:
start_wall_time = Time(quantum_metadata["startUtc"].split("+")[0])
end_wall_time = Time(quantum_metadata["endUtc"].split("+")[0])
quantum_timing["wall_time"] = (end_wall_time - start_wall_time).sec
should make it do something?
baf7174
to
735b211
Compare
@@ -455,6 +471,12 @@ def _extract_quantum_timing(self, quantum_metadata): | |||
) | |||
if getattr(self.config, attr_name) | |||
} | |||
if self.config.wall_time: | |||
start_wall_time = Time(quantum_metadata["startUtc"].split("+")[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at some TaskMetadata
for assembleCoadd
in DM-48233 runs, I ran into a lot of incomplete quantum_metadata
entries. In particular, at least one was missing a startUtc
field:
{'prepUtc': '2024-12-20T18:56:08.843913+00:00', 'prepCpuTime': 5.701368117, 'prepUserTime': 4.901877, 'prepSystemTime': 0.799484, 'prepMaxResidentSetSize': 595865600, 'prepMinorPageFaults': 287064, 'prepMajorPageFaults': 0, 'prepBlockInputs': 32, 'prepBlockOutputs': 24, 'prepVoluntaryContextSwitches': 14723, 'prepInvoluntaryContextSwitches': 144, 'endUtc': '2024-12-20T18:56:08.865832+00:00', 'endCpuTime': 5.718414396, 'endUserTime': 4.9187639999999995, 'endSystemTime': 0.7996449999999999, 'endMaxResidentSetSize': 595951616, 'endMinorPageFaults': 287103, 'endMajorPageFaults': 0, 'endBlockInputs': 32, 'endBlockOutputs': 32, 'endVoluntaryContextSwitches': 14748, 'endInvoluntaryContextSwitches': 147}
So I think we need some workarounds here as for the cpu time values if we want to avoid the task failing on those missing entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused as to how it's possible for prep
and end
to exist but start
to be missing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the log files, it seems that these jobs don't have any input datasets to include in the coadd, e.g.,
lsst.ctrl.mpexec.singleQuantumExecutor 2024-12-20T18:56:08.852388Z No dataset artifact found for deepCoadd_directWarp@{instrument: 'LSSTComCam', skymap: 'lsst_cells_v1', tract: 10221, patch: 90, visit: 2024112600123, band: 'i', day_obs: 20241126, physical_filter: 'i_06'} [sc=ExposureF] (run=LSSTComCam/runs/DRP/20241101_20241211/w_2024_51/DM-48233/20241220T172328Z id=a27c6f6b-c150-4073-8efb-e5acc5ee49e4) 20 INFO singleQuantumExecutor.py /cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2024_51/conda/envs/lsst-scipipe-9.0.0-exact/share/eups/Linux64/ctrl_mpexec/ge4484fe4ba+cdc99b1de5/python/lsst/ctrl/mpexec/singleQuantumExecutor.py
So, I think we can just use prepUtc
instead of startUtc
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just switched to prepUtc
in the task and the unit test. @jchiang87, I pulled in your unit test from your branch and it all looks good. Do you want your commit squashed onto the rest? I think this will be good to go pending a successful Jenkins run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, please go ahead and squash the commit.
c969edd
to
f6572b8
Compare
I think I addressed everything above. Successful Jenkins run: https://rubin-ci.slac.stanford.edu/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/3649/pipeline |
@@ -165,7 +175,6 @@ def run(self, **kwargs: Any) -> Struct: | |||
.sort_values("task"), | |||
memrun, | |||
) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't have to change, but typically we try to minimize the unnecessary white space changes.
@@ -284,6 +297,8 @@ class GatherResourceUsageTask(PipelineTask): | |||
- ``init_time``: the time spent in task construction; | |||
- ``run_time``: the time spent executing the task's runQuantum | |||
method. | |||
- ``wall_time`` : the total elapsed real time spent executing, preparing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"executing, preparing and completing" is still not precise enough for a number we're going to comparing to run_time = endCpuTime
- startCpuTime
very frequently from now on. Would be great to clarify once and for all here. @timj What's the most precise way you'd describe endUtc
- prepUtc
? It's elapsed time corresponding to prep_time + init_time + run_time, right? That would makes it elapsed time in the pre-initialization step, in task construction, AND in executing the task's runQuantum method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timj did you have an opinion here? For now, I pushed Yusra's text above, but let me know if that seems good to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. Somehow I missed the message from 3 days ago.
- prep: triggers as soon as single quantum execution is begun but can include some checks and running updatedQuantumInputs.
- start means "just before runQuantum is called" (but won't happen if an error happens earlier)
- end: triggers immediately after runQuantum (and can trigger in error condition even if start hasn't)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the distinction above into the docs. Thanks!
9a338d3
to
f4c391c
Compare
f4c391c
to
8ef5a90
Compare
No description provided.