Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up and test sample metadata barplot computations #313

Merged
merged 27 commits into from
Aug 13, 2020

Conversation

fedarko
Copy link
Collaborator

@fedarko fedarko commented Aug 8, 2020

Closes #298.

On the HMP test dataset we've been working with, this went from taking ~10 seconds to draw a sample metadata barplot layer to less than 1 second. The code to compute sample metadata group "frequencies" was modified so that, rather than working on a feature-by-feature basis, all features' frequencies are now computed in a single pass through the BIOM table. Based on discussion with @ElDeveloper about ways to speed this up :)

The code to compute the "frequency map" is now stored in the BiomTable class and unit-tested, also.

@emperor-helper
Copy link

The following artifacts were built for this PR: empire-biplot.qzv, empire.qzv, empress-tree.qzv

@kwcantrell kwcantrell self-requested a review August 8, 2020 03:17
Copy link
Collaborator

@kwcantrell kwcantrell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fedarko! I just have a few minor comments.

empress/support_files/js/biom-table.js Outdated Show resolved Hide resolved
_.each(fIdx2counts[fIdx], function (count, smValIdx) {
if (count > 0) {
fID2freqs[fID][uniqueSMVals[smValIdx]] =
count / totalSampleCount;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should either let the user know that the sample stack bar plots are displaying proportions instead of counts or we should just display the counts (or all the user to choose).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point -- and supporting displaying "variable-length" stacked barplots (where e.g. 1 sample has a given fixed length, so a tip present in 2 samples vs. a tip present in 20 samples will display differently) would be a cool feature to add. I've added an issue for this at #322, and updated the README to be clearer about proportions being displayed.

Comment on lines 541 to 550
_.each(this._tbl, function (presentFeatureIndices, sIdx) {
// Figure out what metadata value this sample has at the column.
cVal = scope._sm[sIdx][colIdx];
cValIdx = smVal2Idx[cVal];
// Increment s.m. value counts for each feature present in this
// sample
_.each(presentFeatureIndices, function (fIdx) {
fIdx2counts[fIdx][cValIdx]++;
fIdx2sampleCt[fIdx]++;
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_.each(this._tbl, function (presentFeatureIndices, sIdx) {
// Figure out what metadata value this sample has at the column.
cVal = scope._sm[sIdx][colIdx];
cValIdx = smVal2Idx[cVal];
// Increment s.m. value counts for each feature present in this
// sample
_.each(presentFeatureIndices, function (fIdx) {
fIdx2counts[fIdx][cValIdx]++;
fIdx2sampleCt[fIdx]++;
});
_.each(this._tbl, function (samples, sIdx) {
// Figure out what metadata value this sample has at the column.
cVal = scope._sm[sIdx][colIdx];
cValIdx = smVal2Idx[cVal];
// Increment s.m. value counts for each feature present in this
// sample
_.each(samples, function (fIdx) {
fIdx2counts[fIdx][cValIdx]++;
fIdx2sampleCt[fIdx]++;
});

Since _tbl is a 2D array of samples by features, this may make more sense to developers who are not to familiar with biom-table. Minor suggestion feel free to ignore. (:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's ok I'd like to keep this as is -- there are a couple of other places in the BIOM table JS where presentFeatureIndices is used when iterating over the table in this way. I added some context to the comment before this loop, so hopefully that makes things clearer as a compromise :) You're totally correct, though -- we should make this code as non-intimidating as possible, since these functions are all getting used pretty frequently ...

// Also, return an Object where the keys are feature IDs pointing to
// other Objects where the keys are sample metadata values, rather than
// a 2D array (which is how fIdx2counts has been stored)
var fID2freqs = {};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a note: It might be less memory expensive/more efficient to iterate through it this was 2D array especially for large trees. Storing (even temporarily) an object that uses fId's as keys can be quite memory intensive + js has a bit more (although not a ton) of overhead compared to arrays when iterating. I would keep this the same for now and we can come back to this if memory becomes a problem when working with large trees.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood and agreed. Added a TODO to the code detailing this optimization.

One weird-ish thing is that none of the Empress JS code besides the BIOM table knows anything about feature indices -- it only stores information about feature names (aka IDs), as far as I know. Moving the index stuff to empress.js / treeData might be a hassle, but I think it could save a decent amount of space -- I guess in-the-BIOM-table tip names are technically stored in two places (treeData and the BIOM table).

empress/support_files/js/empress.js Outdated Show resolved Hide resolved
@fedarko
Copy link
Collaborator Author

fedarko commented Aug 11, 2020

@kwcantrell things should be ready to merge now, I think. Thanks!

@fedarko
Copy link
Collaborator Author

fedarko commented Aug 11, 2020

... ok, nvm, now they should be good (one space angered prettier 😅)

@kwcantrell
Copy link
Collaborator

Thanks @fedarko

@kwcantrell kwcantrell merged commit 5bccc7f into biocore:master Aug 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize sample metadata barplot computation
3 participants