Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM-47350: Add walltime to resource usage tables #336

Merged
merged 1 commit into from
Jan 17, 2025
Merged

Conversation

eigerx
Copy link
Contributor

@eigerx eigerx commented Jan 10, 2025

No description provided.

@eigerx eigerx force-pushed the tickets/DM-47350 branch 3 times, most recently from 1fc600f to 283c1cf Compare January 10, 2025 23:36
@eigerx eigerx requested a review from yalsayyad January 11, 2025 00:10
Copy link
Contributor

@yalsayyad yalsayyad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addition to the ResourceUsageSummary is missing. For example https://github.com/lsst/analysis_tools/blob/main/python/lsst/analysis/tools/tasks/gatherResourceUsage.py#L155

I'm hitting "request changes" only because I do want to take a look at the modifications to ConsolidateResourceUsageTask when you make them.

Follow the same naming pattern (i.e. "walltime_s_")

python/lsst/analysis/tools/tasks/gatherResourceUsage.py Outdated Show resolved Hide resolved
python/lsst/analysis/tools/tasks/gatherResourceUsage.py Outdated Show resolved Hide resolved
@@ -284,6 +291,8 @@ class GatherResourceUsageTask(PipelineTask):
- ``init_time``: the time spent in task construction;
- ``run_time``: the time spent executing the task's runQuantum
method.
- ``wall_time`` : the total elapsed real time time spent executing the
task.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove duplicate "time." As written, it's a little ambiguous whether wall_time is comparable to run_time (as in its the elapsed time of spent in runQuantum) or both run_time + init_time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this new description sound good? (highlighted bit above is outdated)

@@ -456,6 +468,10 @@ def _extract_quantum_timing(self, quantum_metadata):
if getattr(self.config, attr_name)
}

quantum_timing["wall_time"] = (end_wall_time - start_wall_time).sec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your Task config parameter wall_time does nothing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I added

if self.config.wall_time:
            start_wall_time = Time(quantum_metadata["startUtc"].split("+")[0])
            end_wall_time = Time(quantum_metadata["endUtc"].split("+")[0])
            quantum_timing["wall_time"] = (end_wall_time - start_wall_time).sec

should make it do something?

tests/test_gatherResourceUsage.py Outdated Show resolved Hide resolved
@eigerx eigerx force-pushed the tickets/DM-47350 branch 4 times, most recently from baf7174 to 735b211 Compare January 11, 2025 06:14
@@ -455,6 +471,12 @@ def _extract_quantum_timing(self, quantum_metadata):
)
if getattr(self.config, attr_name)
}
if self.config.wall_time:
start_wall_time = Time(quantum_metadata["startUtc"].split("+")[0])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at some TaskMetadata for assembleCoadd in DM-48233 runs, I ran into a lot of incomplete quantum_metadata entries. In particular, at least one was missing a startUtc field:

{'prepUtc': '2024-12-20T18:56:08.843913+00:00', 'prepCpuTime': 5.701368117, 'prepUserTime': 4.901877, 'prepSystemTime': 0.799484, 'prepMaxResidentSetSize': 595865600, 'prepMinorPageFaults': 287064, 'prepMajorPageFaults': 0, 'prepBlockInputs': 32, 'prepBlockOutputs': 24, 'prepVoluntaryContextSwitches': 14723, 'prepInvoluntaryContextSwitches': 144, 'endUtc': '2024-12-20T18:56:08.865832+00:00', 'endCpuTime': 5.718414396, 'endUserTime': 4.9187639999999995, 'endSystemTime': 0.7996449999999999, 'endMaxResidentSetSize': 595951616, 'endMinorPageFaults': 287103, 'endMajorPageFaults': 0, 'endBlockInputs': 32, 'endBlockOutputs': 32, 'endVoluntaryContextSwitches': 14748, 'endInvoluntaryContextSwitches': 147}

So I think we need some workarounds here as for the cpu time values if we want to avoid the task failing on those missing entries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused as to how it's possible for prep and end to exist but start to be missing...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the log files, it seems that these jobs don't have any input datasets to include in the coadd, e.g.,

lsst.ctrl.mpexec.singleQuantumExecutor 2024-12-20T18:56:08.852388Z No dataset artifact found for deepCoadd_directWarp@{instrument: 'LSSTComCam', skymap: 'lsst_cells_v1', tract: 10221, patch: 90, visit: 2024112600123, band: 'i', day_obs: 20241126, physical_filter: 'i_06'} [sc=ExposureF] (run=LSSTComCam/runs/DRP/20241101_20241211/w_2024_51/DM-48233/20241220T172328Z id=a27c6f6b-c150-4073-8efb-e5acc5ee49e4) 20 INFO singleQuantumExecutor.py /cvmfs/sw.lsst.eu/linux-x86_64/lsst_distrib/w_2024_51/conda/envs/lsst-scipipe-9.0.0-exact/share/eups/Linux64/ctrl_mpexec/ge4484fe4ba+cdc99b1de5/python/lsst/ctrl/mpexec/singleQuantumExecutor.py

So, I think we can just use prepUtc instead of startUtc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just switched to prepUtc in the task and the unit test. @jchiang87, I pulled in your unit test from your branch and it all looks good. Do you want your commit squashed onto the rest? I think this will be good to go pending a successful Jenkins run.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, please go ahead and squash the commit.

@eigerx eigerx force-pushed the tickets/DM-47350 branch 4 times, most recently from c969edd to f6572b8 Compare January 14, 2025 20:59
@eigerx eigerx requested a review from yalsayyad January 14, 2025 21:28
@eigerx
Copy link
Contributor Author

eigerx commented Jan 14, 2025

I think I addressed everything above. Successful Jenkins run: https://rubin-ci.slac.stanford.edu/blue/organizations/jenkins/stack-os-matrix/detail/stack-os-matrix/3649/pipeline

@@ -165,7 +175,6 @@ def run(self, **kwargs: Any) -> Struct:
.sort_values("task"),
memrun,
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have to change, but typically we try to minimize the unnecessary white space changes.

@@ -284,6 +297,8 @@ class GatherResourceUsageTask(PipelineTask):
- ``init_time``: the time spent in task construction;
- ``run_time``: the time spent executing the task's runQuantum
method.
- ``wall_time`` : the total elapsed real time spent executing, preparing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"executing, preparing and completing" is still not precise enough for a number we're going to comparing to run_time = endCpuTime - startCpuTime very frequently from now on. Would be great to clarify once and for all here. @timj What's the most precise way you'd describe endUtc - prepUtc? It's elapsed time corresponding to prep_time + init_time + run_time, right? That would makes it elapsed time in the pre-initialization step, in task construction, AND in executing the task's runQuantum method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timj did you have an opinion here? For now, I pushed Yusra's text above, but let me know if that seems good to you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. Somehow I missed the message from 3 days ago.

  • prep: triggers as soon as single quantum execution is begun but can include some checks and running updatedQuantumInputs.
  • start means "just before runQuantum is called" (but won't happen if an error happens earlier)
  • end: triggers immediately after runQuantum (and can trigger in error condition even if start hasn't)

Copy link
Contributor Author

@eigerx eigerx Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the distinction above into the docs. Thanks!

@eigerx eigerx force-pushed the tickets/DM-47350 branch 2 times, most recently from 9a338d3 to f4c391c Compare January 17, 2025 22:13
@eigerx eigerx merged commit 3711557 into main Jan 17, 2025
8 checks passed
@eigerx eigerx deleted the tickets/DM-47350 branch January 17, 2025 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants