Skip to content
This repository was archived by the owner on Jun 21, 2023. It is now read-only.

Mb subtypes #738

Merged
merged 33 commits into from
Aug 21, 2020
Merged

Mb subtypes #738

merged 33 commits into from
Aug 21, 2020

Conversation

komalsrathi
Copy link
Collaborator

@komalsrathi komalsrathi commented Aug 19, 2020

Purpose/implementation Section

What scientific question is your analysis addressing?

Molecular classification of Medulloblastoma samples into Group3, Group4, SHH and WNT subtypes.

What was your approach?

I have used two different methods: MM2S and medulloPackage. The input to both methods is log-transformed FPKM matrix for poly-A and stranded datasets.

What GitHub issue does your pull request address?

#731

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Feasibility of the approach, code structure and if I should remove batch correction out of the module.

Is there anything that you want to discuss further?

Should we keep batch correction or remove it? The results are same with or without batch correcting using RNA_library as batch.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

Table in comparison*.rds format containing expected subtype classification (pathology reports) and observed subtypes along with p-values (in case of medulloPackage) and scores (in case of MM2S).

What is your summary of the results?

After comparing with the expected subtypes from pathology reports, we get an accuracy of 81.25% (26/32 correctly classified) with medulloPackage and an accuracy of 78.125% (25/32 correctly classified) with MM2S.

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

@komalsrathi
Copy link
Collaborator Author

@jaclyn-taroni I have not added this to CI, because I wasn't sure if you first want to review. The script uses either MM2S and medulloPackage depending on what the user inputs and performs batch correction on RNA_library if asked for. There are four bash scripts corresponding to the above combination (2 method x with and w/o batch correction). Let me know if I can just add one of these to the CI?

@jharenza
Copy link
Collaborator

@jaclyn-taroni I have not added this to CI, because I wasn't sure if you first want to review. The script uses either MM2S and medulloPackage depending on what the user inputs and performs batch correction on RNA_library if asked for. There are four bash scripts corresponding to the above combination (2 method x with and w/o batch correction). Let me know if I can just add one of these to the CI?

please put all in CI! :)

@jaclyn-taroni
Copy link
Member

Hi @komalsrathi, I filed komalsrathi#1 to this branch and I am providing the context here.

CI failed because of the version of tidyselect on the project Docker image (0.2.5), which doesn't include all_of().

To verify that this was the issue, I built the Docker image from this branch and identified some issues with the logic around the batch correction that were remedied in komalsrathi@9b4d7f9.

Now the shell scripts without batch correction run successfully locally.

However, when I run the shell scripts that do include batch correction, I get the following errors (arising from classify.mb())

run-molecular-subtyping-MB-batch-correct-medullo-classifier.sh fails with:

[1] "Classify medulloblastoma subtypes..."
Error in exprs[g1, ] : subscript out of bounds
Calls: classify.mb -> <Anonymous> -> signatureGenes -> apply -> FUN
Execution halted

run-molecular-subtyping-MB-batch-correct-MM2S.sh fails with:

[1] "Classify medulloblastoma subtypes..."
'select()' returned 1:many mapping between keys and columns
Error in `.rowNamesDF<-`(x, value = value) :
  duplicate 'row.names' are not allowed
Calls: classify.mb ... row.names<- -> row.names<-.data.frame -> .rowNamesDF<-
In addition: Warning message:
non-unique value when setting 'row.names': ‘NaN’
Execution halted

I'm not sure what format the classify.mb() function wants based on the documentation.

In service of debugging that and not repeating filtering and batch correction multiple times, I'd recommend that we move the filtering step out of the script that does the classification. Some ideas around that were added in komalsrathi@adf3287.

That needs polishing/more testing, but running the following from the analyses/molecular-subtyping-MB directory worked:

Rscript --vanilla 00-filter-and-batch-correction.R \
  --batch_col RNA_library \
  --output_prefix medulloblastoma-exprs \
  --output_dir input

The analysis itself is somewhat specific–I don't expect we will need to do the batch correction for medulloblastoma only outside of this project and module, so I think hardcoding file paths for the expression files is fine and in the interest of making things simpler.

I also think it could be useful to move the comparison of all the results to a notebook or script that takes as input the expected classes and the predicted classes from each method/processing strategy combination.

The shell script to run the entire module and gets added to CI could then consist of:

  1. Filter and batch correction
  2. Classification for all combinations of batch correction status and classifier
  3. Collect all the results together

I'll say that it is unclear to me at this point if the batch correction is a necessary step - the ComBat step itself is pretty quick and the classification steps don't seem too bad, so may be fine to leave it in.

Thoughts? Happy to make some of these edits to the jaclyn-taroni:mb-subtypes branch (komalsrathi#1) if that's helpful.

@komalsrathi
Copy link
Collaborator Author

Hi @jaclyn-taroni thanks for the detailed review. I saw your PR, before merging let me double check on my end (locally) why changing all_of to select causes so many issues.

CI related edits to MB subtype steps
@komalsrathi
Copy link
Collaborator Author

komalsrathi commented Aug 20, 2020

Also, forgot to mention that batch correction has no effect on the output so maybe better left out of the analysis (to make things simpler)? This was the case before (non-log input) but not anymore. Will add all results to a notebook and update.

@jaclyn-taroni
Copy link
Member

Weird, CI failed with a data download issue. I'm going to hit rerun now!

@komalsrathi
Copy link
Collaborator Author

Checking the failed CI

Co-authored-by: Jaclyn Taroni <[email protected]>
@komalsrathi
Copy link
Collaborator Author

I suspect it is related to the test files because the batch correction step runs locally - something like the poly-A MB sample missing from the poly-A testing file (haven't checked to be sure). Some ideas:

First one is probably more robust, but will increase the run time in CI. I'll also mention that the other molecular subtyping steps are grouped together in CI near the top:

### MOLECULAR SUBTYPING ###

Moving this up will make the time between commit and getting the test result for this module shorter, which is helpful for debugging!

I moved up the module in CI. Also removed the expression matrices from gitignore. The filter+batch correction is only run once and I think the error that I was seeing was in this step. Hoping this commit will fix it.

@jaclyn-taroni
Copy link
Member

Without explicitly skipping the 00 step in CI, I expect that the same problem will present itself but we'll see shortly!

@komalsrathi
Copy link
Collaborator Author

@jaclyn-taroni I am confused why would there be an issue with filtering step? It works fine locally and it takes minimal time.


[1] "Batch correct input matrices..."
Error in `[.data.frame`(expr.input.mb, , clin.mb$Kids_First_Biospecimen_ID) : 
  undefined columns selected
Calls: as.matrix -> [ -> [.data.frame
Execution halted

@jaclyn-taroni
Copy link
Member

jaclyn-taroni commented Aug 20, 2020

Because in CI it's filtering the testing files, which are a subset of the full files to save on time, we can try the strategy in komalsrathi#2

Skip filtering and batch correction. in CI
@jaclyn-taroni
Copy link
Member

Going to resolve the conflicts - I get an error locally that does not show up in CI and I believe it happens whether or not I use the committed files or regenerate them with the 00 step. Will post more on that, but first I'm going to fix the conflicts. Then I'll rebuild the image locally and see if that changes things.

@jaclyn-taroni
Copy link
Member

Okay, I was wrong in a way that made this easier to get to the bottom of this 😅 which is ideal.

When I ran this in the Docker container using the version of sva installed (3.34.0), the ComBat output was all NaN (related: jtleek/sva-devel#14).

This is resolved when installing sva the following way:

remotes::install_github("jtleek/sva-devel", ref = "123be9b2b9fd7c7cd495fab7d7d901767964ce9e")

I suspect, then, that the ComBat-corrected file committed to the repository was generated with a more recent version of sva than is installed on the project Docker image and that's why CI doesn't fail. So I will think about/look into what is the best way to install a more recent version of sva in the context of this project's image and report back with suggestions.

@komalsrathi
Copy link
Collaborator Author

@jaclyn-taroni just for reference, the files were generated using sva_3.36.0

@jaclyn-taroni
Copy link
Member

Suggestions here: komalsrathi#3 - quoting from that

I confirmed that updating sva in this manner only changes the version of the sva package; all other installed package versions remain the same.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code for the first two steps and the shell script look good! Let's plan to get komalsrathi#3 merged so the dependencies are all set such that if someone needed to regenerate the expression files later, they'd be able to do that in the Docker container.

analyses/molecular-subtyping-MB/02-compare-classes.Rmd should be removed from this pull request and included in a new pull request so we can have a discussion about the best way to display that information and save it in a format that can be used elsewhere in the project. That notebook is missing important context about what samples are included when it reports the accuracy above the table without the code chunks in my opinion.

The README also needs to be updated to reflected the recent changes before we can get this approved and merged.


The goal of this analysis is to utilize the R packages [medulloPackage](https://github.com/d3b-center/medullo-classifier-package) and [MM2S](https://github.com/cran/MM2S) that leverage expression data from RNA-seq or array to classify the medulloblastoma samples into four subtypes i.e Group3, Group4, SHH, WNT.

### Analysis scripts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section of the README should be updated to reflect the recent changes.

Copy link
Collaborator Author

@komalsrathi komalsrathi Aug 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!! I will make these changes today. I have updated the above. Please let me know whenever you can.
I will also work on submitted another PR for the markdown with more details.

Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 looks good to me! Thanks for the updates!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants