Speed up and test sample metadata barplot computations #313

fedarko · 2020-08-08T01:38:42Z

Closes #298.

On the HMP test dataset we've been working with, this went from taking ~10 seconds to draw a sample metadata barplot layer to less than 1 second. The code to compute sample metadata group "frequencies" was modified so that, rather than working on a feature-by-feature basis, all features' frequencies are now computed in a single pass through the BIOM table. Based on discussion with @ElDeveloper about ways to speed this up :)

The code to compute the "frequency map" is now stored in the BiomTable class and unit-tested, also.

still need to document/test the new functions, but this is nice

…arplot-optimization

need to like document and test and style and make it less horrendous buuut i think this might be decently faster, will hafta test

should make sm barplots even faster >:)

emperor-helper · 2020-08-08T01:48:58Z

The following artifacts were built for this PR: empire-biplot.qzv, empire.qzv, empress-tree.qzv

kwcantrell

Thanks @fedarko! I just have a few minor comments.

empress/support_files/js/biom-table.js

kwcantrell · 2020-08-11T14:06:54Z

empress/support_files/js/biom-table.js

+            _.each(fIdx2counts[fIdx], function (count, smValIdx) {
+                if (count > 0) {
+                    fID2freqs[fID][uniqueSMVals[smValIdx]] =
+                        count / totalSampleCount;


We should either let the user know that the sample stack bar plots are displaying proportions instead of counts or we should just display the counts (or all the user to choose).

That's a good point -- and supporting displaying "variable-length" stacked barplots (where e.g. 1 sample has a given fixed length, so a tip present in 2 samples vs. a tip present in 20 samples will display differently) would be a cool feature to add. I've added an issue for this at #322, and updated the README to be clearer about proportions being displayed.

kwcantrell · 2020-08-11T14:31:24Z

empress/support_files/js/biom-table.js

+        _.each(this._tbl, function (presentFeatureIndices, sIdx) {
+            // Figure out what metadata value this sample has at the column.
+            cVal = scope._sm[sIdx][colIdx];
+            cValIdx = smVal2Idx[cVal];
+            // Increment s.m. value counts for each feature present in this
+            // sample
+            _.each(presentFeatureIndices, function (fIdx) {
+                fIdx2counts[fIdx][cValIdx]++;
+                fIdx2sampleCt[fIdx]++;
+            });


Suggested change

_.each(this._tbl, function (presentFeatureIndices, sIdx) {

// Figure out what metadata value this sample has at the column.

cVal = scope._sm[sIdx][colIdx];

cValIdx = smVal2Idx[cVal];

// Increment s.m. value counts for each feature present in this

// sample

_.each(presentFeatureIndices, function (fIdx) {

fIdx2counts[fIdx][cValIdx]++;

fIdx2sampleCt[fIdx]++;

});

_.each(this._tbl, function (samples, sIdx) {

// Figure out what metadata value this sample has at the column.

cVal = scope._sm[sIdx][colIdx];

cValIdx = smVal2Idx[cVal];

// Increment s.m. value counts for each feature present in this

// sample

_.each(samples, function (fIdx) {

fIdx2counts[fIdx][cValIdx]++;

fIdx2sampleCt[fIdx]++;

});

Since _tbl is a 2D array of samples by features, this may make more sense to developers who are not to familiar with biom-table. Minor suggestion feel free to ignore. (:

If it's ok I'd like to keep this as is -- there are a couple of other places in the BIOM table JS where presentFeatureIndices is used when iterating over the table in this way. I added some context to the comment before this loop, so hopefully that makes things clearer as a compromise :) You're totally correct, though -- we should make this code as non-intimidating as possible, since these functions are all getting used pretty frequently ...

kwcantrell · 2020-08-11T14:44:32Z

empress/support_files/js/biom-table.js

+        // Also, return an Object where the keys are feature IDs pointing to
+        // other Objects where the keys are sample metadata values, rather than
+        // a 2D array (which is how fIdx2counts has been stored)
+        var fID2freqs = {};


This is just a note: It might be less memory expensive/more efficient to iterate through it this was 2D array especially for large trees. Storing (even temporarily) an object that uses fId's as keys can be quite memory intensive + js has a bit more (although not a ton) of overhead compared to arrays when iterating. I would keep this the same for now and we can come back to this if memory becomes a problem when working with large trees.

Understood and agreed. Added a TODO to the code detailing this optimization.

One weird-ish thing is that none of the Empress JS code besides the BIOM table knows anything about feature indices -- it only stores information about feature names (aka IDs), as far as I know. Moving the index stuff to empress.js / treeData might be a hassle, but I think it could save a decent amount of space -- I guess in-the-BIOM-table tip names are technically stored in two places (treeData and the BIOM table).

empress/support_files/js/empress.js

Co-authored-by: kwcantrell <[email protected]>

@kwcantrell

Addresses @kwcantrell comment in biocore#313

fedarko · 2020-08-11T20:30:01Z

@kwcantrell things should be ready to merge now, I think. Thanks!

fedarko · 2020-08-11T20:53:41Z

... ok, nvm, now they should be good (one space angered prettier 😅)

kwcantrell · 2020-08-13T18:39:31Z

Thanks @fedarko

fedarko added 19 commits August 6, 2020 14:51

MNT: Abstract+simplify sm barplots/getObsCountsBy

ba65a87

still need to document/test the new functions, but this is nice

STY: prettier

c0e023c

Merge branch 'master' of https://github.com/biocore/empress into sm-b…

9703416

…arplot-optimization

MNT: ++ instead of += 1 consistently in biom table

5a5479b

MNT: document+rename getObsCountsAndTotalBy()

13e0272

MNT: add sped up? freq map code

898a0b3

need to like document and test and style and make it less horrendous buuut i think this might be decently faster, will hafta test

STY: prettify

32174c0

MNT: use 2D arrays internally for freq map comp

d10c4ed

should make sm barplots even faster >:)

MNT: remove now-unused code i added earlier

b59ba6e

STY: prettify

e541686

DOC: improve documentation for freq map biocore#298

c1440e6

DOC: add extra context to sm barplot drawing func

3cb4771

STY: fix redundant variable declaration

65dfcc2

TST: test getFrequencyMap() - close biocore#298

f08638d

DOC: document getfrequencymap output a bit more

e91538a

grammar

7083d67

Add extra freqmap test

3735832

DOC: improve uniqueVal docs in sm barplot drawing

1ece062

TST: expand getFrequencyMap() tests

f685f99

kwcantrell self-requested a review August 8, 2020 03:17

kwcantrell reviewed Aug 11, 2020

View reviewed changes

fedarko and others added 3 commits August 11, 2020 11:59

Update empress/support_files/js/biom-table.js

7890a17

Co-authored-by: kwcantrell <[email protected]>

MNT: rename remaining fIdx2... vars

2b8626d

MNT: fID2freqs -> fID2Freqs

5fa827b

fedarko mentioned this pull request Aug 11, 2020

Support drawing variable-length stacked sample metadata barplots #322

Open

5 tasks

fedarko added 4 commits August 11, 2020 12:23

DOC: more explicit about smb proportions in README

dc6476b

DOC: clarify presentFeatureIndices usage a bit?

3aeec27

DOC: add note re a future optimization biocore#313

d2b7d37

MNT: Simplify iteration in SM barplot drawing

80770a8

Addresses @kwcantrell comment in biocore#313

STY: prettify

472a248

kwcantrell merged commit 5bccc7f into biocore:master Aug 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up and test sample metadata barplot computations #313

Speed up and test sample metadata barplot computations #313

fedarko commented Aug 8, 2020

emperor-helper commented Aug 8, 2020

kwcantrell left a comment

kwcantrell Aug 11, 2020

fedarko Aug 11, 2020

kwcantrell Aug 11, 2020

fedarko Aug 11, 2020

kwcantrell Aug 11, 2020

fedarko Aug 11, 2020

fedarko commented Aug 11, 2020

fedarko commented Aug 11, 2020

kwcantrell commented Aug 13, 2020

Speed up and test sample metadata barplot computations #313

Speed up and test sample metadata barplot computations #313

Conversation

fedarko commented Aug 8, 2020

emperor-helper commented Aug 8, 2020

kwcantrell left a comment

Choose a reason for hiding this comment

kwcantrell Aug 11, 2020

Choose a reason for hiding this comment

fedarko Aug 11, 2020

Choose a reason for hiding this comment

kwcantrell Aug 11, 2020

Choose a reason for hiding this comment

fedarko Aug 11, 2020

Choose a reason for hiding this comment

kwcantrell Aug 11, 2020

Choose a reason for hiding this comment

fedarko Aug 11, 2020

Choose a reason for hiding this comment

fedarko commented Aug 11, 2020

fedarko commented Aug 11, 2020

kwcantrell commented Aug 13, 2020