Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HPC] Need to clarify validation requirements for pruned logs in weak-scaling #493

Closed
sparticlesteve opened this issue Jun 14, 2022 · 4 comments · Fixed by #499
Closed

Comments

@sparticlesteve
Copy link
Contributor

Our rules now describe how to submit pruned logs in the weak-scaling results to establish proven scale which can be used for hyperparameter borrowing. However, they do not describe the requirements for those pruned logs to be valid.

We discussed some potential approaches in our meeting on Monday, Jun 13:

  • we make no requirements on the pruned logs
  • we require pruned logs to pass the compliance checker, i.e. from converged runs complying with benchmark rules
  • we come up with some custom requirements on the pruned logs, e.g. requiring some amount of training to be completed
@sparticlesteve
Copy link
Contributor Author

My thinking is that pruned logs should be fully compliant and demonstrate a successful converged training instance. This is a straightforward requirement that I think adheres to our intent behind result pruning in the weak-scaling submissions. Result pruning helps mitigate the effects from straggler training instances that negatively effect the measured throughput when we measure time-to-train-all. I don't recall us intending to use pruning to help mitigate hardware failures, and our use of the term "proven scale" to me implies that these results should show that the system can actually run successfully at that scale (without crashing).

I fear that allowing invalid log files in the proven scale may enable some undesired behavior. Submitters could intentionally run on "bad nodes" and submit junk log files just to allow them to run at a larger scale after the deadline (e.g. after replacing nodes).

If there is a strong group consensus to use result pruning as a way to help mitigate hardware failures, I would be supportive of relaxing requirements on the pruned log files.

Finally, I believe that without clarifying this rule, by default I think it is implied that logs should always be considered compliant.

@coquelin77
Copy link

I agree with your default interpretation, that the logs submitted should be compliant.

If we were to allow for failed runs, I think that we should specify a minimum percentage of successful runs

@sparticlesteve
Copy link
Contributor Author

Hi @coquelin77. Thanks for your input.

For the non-pruned logs used to compute the throughput we do have the requirement that all logs are successful and that there must be at least as many as needed for the time-to-train measurement (i.e. 5 for deepcam, 10 for cosmoflow, 5 for open_catalyst). Is this what you are suggesting, or are you suggesting that we should have a requirement on the percentage of successful pruned runs?

@sparticlesteve
Copy link
Contributor Author

In our meeting last week, July 25, we tentatively decided to adopt the simple solution, which is to interpret our rules as requiring pruned logs to be compliant. Nobody is specifically pushing for relaxed pruned log requirements, and most pruned logs from last year were actually pruned due to slow convergence (large number of epochs).

In today's meeting we discussed it a bit again. We agreed to uphold that decision but also agreed we should watch what happens this year closely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants