-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
9.0 vs 8.2.6 simulation run performance ( Caliper) for ModelDB models. #3283
Comments
There's two PRs in the NMODL repo that might help: The former "fixes" the aliasing issue with a sledge hammer (for certain cases) by adding |
The caliper information for the above three models is too incomplete for a reliable assessment. Will need to add instrumenting for the major phases of cvode calls into NEURON. Nevertheless, the existing timings elicit two initial prejudices. 1) %times seem to scale similarly between 8.2 and 9.0. (i.e. absolute time ratios are generally somewhat similar to the fadvance time ratio). 2) mechanism times are generally a quite small component of the overall fadvance time. For example, here is the full caliper info for
ModelView shows that this model has (summarizing the much larger actual output)
In addition to the local variable time step, it uses
Unfortunately, event statistics do not seem to be gathered for the local variable step method. But they can roughly be inferred from the number of initializations and interpolations of the cvode instances. This is a single thread serial model. The above cvode statistics do give an idea of the great deal of assynchronous and random access into each of the range variable vectors via the cvode instances. Cvode instances boil forward under the contraints:
The rule is always: what is the next earliest thing to do. Handle an event or do a cvode step with the least t1. |
Turning off the local variable time step (using global variable time step), the results for the 156120 model are
9.0
Clearly this model will be 5 seconds faster with 9.0 than with 8.2 if we use cached DataHandles for the scatter/gather phase of simulation. edited: Notice that my local step explanation for the first 25 almost 0 time intervals for scatter/gather plots makes no sense for this global variable step scatter/gather plot. At the moment it is not clear to me why those intervals are small. edited: When Cvode is initialized, f(y, ydot) is called with ydot = NULL. In those cases, gather_ydot does nothing. |
8.2 results are
Current master results are
So the cvscatter-ptr branch resolves the performance issue. Note: ZTIME results are obtained by merging https://github.com/neuronsimulator/nrn/tree/hines/ztime-patch into the relevant branch we are considering.
|
I've noticed a large variation of runtimes (cumulative fadvance) when the same program is run multiple times. e.g. for For this model, the local vs global time step and 8.2 vs cvscatter become
For the 267666 model, the call counts are slightly different between 8.2 and cvscatter-ptr even after modifying to use only Random123. It is not clear at the moment what is the origin of this discrepancy. |
Just for future accessibilty, the chatgpt suggestion for setting PowerMode to Performance with bash is
An alternative, if installed on your machine, is |
Vector.record does not seem to explain very much of the timing differences. Added timing measurement for
|
Attempting to understand why the 9.2 local variable time step method (with one cell) is about half as fast as the 9.2 global variable time step method. Collect the vtune data for the Local variable time step with the commands
If one clicks on an item in the Summary, Top Hotspots, then all the caliper items are shown in the Bottom-up page. Then one can select all the TaskType/Function/CallStack, select "copy rows to clipboard" and paste here. Local variable time step
Similarly,
There are a few things that make sense. Because there is only one cell, the TaskCounts are similar for local and global. |
Here are the 8.2 results for 267666 . No puzzle. Local variable time step
Global variable time step
|
The following has to be a huge clue. But I don't fully grasp its implications. The first of each pair of lines is the Bottom-up TaskType/Function/CallStack for first dozen or so descendents of cv:advance_tn in CPUTime order for the Local variable time step. The second of each pair of lines is the corresponding Global variable time step entry with its CPUTime and Instructions Retired.
|
It wasn't clear to me whether relative (between local step and global step) Instructions Retired could be considered a proxy for call count. Explicitly counting the calls into
in comparison to above vtune data
On the other hand, for functions higher up in the call chain, e.g.
Hmm. Pursuing the counts into
We see:
There seems to be some kind of duplication error in range of the two for loops. |
The origin of this performance issue is indicated by the comments in netcvode.cpp
and the comment of line 1658
|
The https://github.com/neuronsimulator/nrn/tree/hines/cellcontig-scatter-treemat branch exhibits much closer performance to 8.2 but is still not quite there. E.g. 267666 lvardt using intel compiler
The main features of this branch are changing the default node data storage sort to contiguous cells and setting/solving the tree matrix for each cell directly using the SoA storage (instead of via Node** access). |
The branch https://github.com/neuronsimulator/nrn/tree/hines/cellcontig-treemat2 perhaps suffices (for x86_64) to get performance for variable step methods close enough to 8.2 performance. This branch uses direct backing store instead of Node* for cvode special handling of zero area nodes (nodes without capacitance). Performance results for ringtest for the sequence of branches that start with master and end with nocap are:
Note that none of these branches affect the fixed step performance and the variation can perhaps be attributed to test noise. An unresolved issue is Apple M1 results which are:
Performance data for the three ModelDB models that initiated work on this issue are:
|
I've redone the graphs at the top of this page using 8.2 (078a34a) and 9.0a-506-g96b78c617 hines/cellcontig-treemat2
Data gathered (for 8.2) with
Note that For these results the models with 8.2 runtime
|
NEURON versions 9.0a-467-gbaf9ea8db and 8.2.6
nrn-modeldb-ci version 6f28ea8
Python 3.11.8
cmake configuration (both NEURON versions)
In nrn-modeldb-ci, run with:
and also with a workdir (and 8.2.6 NEURON version ) of 8.2
The runmodels commands produce
7055004 Dec 10 14:33 master.json
and6886403 Dec 10 14:52 8.2.json
files that contain lines like8.2.json contained 257 "fadvance..." lines and 2 "psolve..." lines.
Plotting the master times (y axis) and 8.2.6 times (x axis) on log and linear plots gives:
When master time is larger than 8.2 time, the point for that model is above the red line (equal times).
The three largest outliers (master larger than 8.2) for sim time > 1 second are
The text was updated successfully, but these errors were encountered: