-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up and test sample metadata barplot computations #313
Conversation
still need to document/test the new functions, but this is nice
…arplot-optimization
need to like document and test and style and make it less horrendous buuut i think this might be decently faster, will hafta test
should make sm barplots even faster >:)
The following artifacts were built for this PR: empire-biplot.qzv, empire.qzv, empress-tree.qzv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fedarko! I just have a few minor comments.
_.each(fIdx2counts[fIdx], function (count, smValIdx) { | ||
if (count > 0) { | ||
fID2freqs[fID][uniqueSMVals[smValIdx]] = | ||
count / totalSampleCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should either let the user know that the sample stack bar plots are displaying proportions instead of counts or we should just display the counts (or all the user to choose).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point -- and supporting displaying "variable-length" stacked barplots (where e.g. 1 sample has a given fixed length, so a tip present in 2 samples vs. a tip present in 20 samples will display differently) would be a cool feature to add. I've added an issue for this at #322, and updated the README to be clearer about proportions being displayed.
_.each(this._tbl, function (presentFeatureIndices, sIdx) { | ||
// Figure out what metadata value this sample has at the column. | ||
cVal = scope._sm[sIdx][colIdx]; | ||
cValIdx = smVal2Idx[cVal]; | ||
// Increment s.m. value counts for each feature present in this | ||
// sample | ||
_.each(presentFeatureIndices, function (fIdx) { | ||
fIdx2counts[fIdx][cValIdx]++; | ||
fIdx2sampleCt[fIdx]++; | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_.each(this._tbl, function (presentFeatureIndices, sIdx) { | |
// Figure out what metadata value this sample has at the column. | |
cVal = scope._sm[sIdx][colIdx]; | |
cValIdx = smVal2Idx[cVal]; | |
// Increment s.m. value counts for each feature present in this | |
// sample | |
_.each(presentFeatureIndices, function (fIdx) { | |
fIdx2counts[fIdx][cValIdx]++; | |
fIdx2sampleCt[fIdx]++; | |
}); | |
_.each(this._tbl, function (samples, sIdx) { | |
// Figure out what metadata value this sample has at the column. | |
cVal = scope._sm[sIdx][colIdx]; | |
cValIdx = smVal2Idx[cVal]; | |
// Increment s.m. value counts for each feature present in this | |
// sample | |
_.each(samples, function (fIdx) { | |
fIdx2counts[fIdx][cValIdx]++; | |
fIdx2sampleCt[fIdx]++; | |
}); |
Since _tbl is a 2D array of samples by features, this may make more sense to developers who are not to familiar with biom-table. Minor suggestion feel free to ignore. (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's ok I'd like to keep this as is -- there are a couple of other places in the BIOM table JS where presentFeatureIndices
is used when iterating over the table in this way. I added some context to the comment before this loop, so hopefully that makes things clearer as a compromise :) You're totally correct, though -- we should make this code as non-intimidating as possible, since these functions are all getting used pretty frequently ...
// Also, return an Object where the keys are feature IDs pointing to | ||
// other Objects where the keys are sample metadata values, rather than | ||
// a 2D array (which is how fIdx2counts has been stored) | ||
var fID2freqs = {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a note: It might be less memory expensive/more efficient to iterate through it this was 2D array especially for large trees. Storing (even temporarily) an object that uses fId's as keys can be quite memory intensive + js has a bit more (although not a ton) of overhead compared to arrays when iterating. I would keep this the same for now and we can come back to this if memory becomes a problem when working with large trees.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood and agreed. Added a TODO to the code detailing this optimization.
One weird-ish thing is that none of the Empress JS code besides the BIOM table knows anything about feature indices -- it only stores information about feature names (aka IDs), as far as I know. Moving the index stuff to empress.js / treeData might be a hassle, but I think it could save a decent amount of space -- I guess in-the-BIOM-table tip names are technically stored in two places (treeData and the BIOM table).
@kwcantrell things should be ready to merge now, I think. Thanks! |
... ok, nvm, now they should be good (one space angered prettier 😅) |
Thanks @fedarko |
Closes #298.
On the HMP test dataset we've been working with, this went from taking ~10 seconds to draw a sample metadata barplot layer to less than 1 second. The code to compute sample metadata group "frequencies" was modified so that, rather than working on a feature-by-feature basis, all features' frequencies are now computed in a single pass through the BIOM table. Based on discussion with @ElDeveloper about ways to speed this up :)
The code to compute the "frequency map" is now stored in the
BiomTable
class and unit-tested, also.