Skip to content
This repository has been archived by the owner on Oct 2, 2024. It is now read-only.

Create corplots-PR-v2 #33

Merged
merged 98 commits into from
Oct 6, 2023
Merged

Conversation

AntoniaChroni
Copy link
Contributor

@AntoniaChroni AntoniaChroni commented Sep 11, 2023

Purpose/implementation Section

What scientific question is your analysis addressing?

This is a script written to create corplots for patient cases with multiple biospecimen samples and matched longitudinal samples.

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Please review for script logic.

Is there anything that you want to discuss further?

The input file maf_autopsy.tsv is generated from the tmb-vaf-preprocess data (2/N) #16 and is placed in ../../scratch. Please run that script first to ensure that you have all files necessary.

This is a PR based on the previous #20 that was closed by accident.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes.

Results

What types of results are included (e.g., table, figure)?

There are 139 plots generated based on every combination between Kids_First_Biospecimen_ID/tumor_descriptor/Kids_First_Participant_ID and are placed in ../../plots.
Please check if that is the number of plots generated. For easier visualization and going through the results, please see the html report.

What is your summary of the results?

There are 29 autopsy samples out of the total 119 patient cases (with genomic assays) with maf information. There are 117 (out of the 118) patient samples with TMB information. There are 44 biospecimen samples missing from both TMB and VAF files.

  • Genes shown in the plots are based on the oncoprint goi list from OpenPedCan.
  • Multiple plots are generated based on the number of biospecimen samples per tumor descriptor.
  • Be aware of differences between plots with the same biospecimen samples. These might show in one as tumor descriptor-specific and in another as common.
  • Multiple biospecimen samples/tumor descriptor capture a variety of heterogeneity. We should consider including all biospecimen samples/tumor descriptor and merge that information into one.
  • Deceased samples have higher VAFs overall compared to their counterparts in other timepoints.
  • Biospecimen samples are from different tumor locations. We will obtain that information after the Nautilus harmonization (primary_site column in histologies). If so, that would allow us to perform spatial heterogeneity analysis (ticket #19).
  • "PT_3CHB9PK5", "PT_6N825561": These samples are hyper-mutant compared to the rest of the samples (VAF corplot).

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

antoniachroni and others added 30 commits August 3, 2023 11:38
@AntoniaChroni AntoniaChroni removed the request for review from rjcorb September 20, 2023 01:25
@AntoniaChroni AntoniaChroni marked this pull request as draft September 20, 2023 01:25
@AntoniaChroni
Copy link
Contributor Author

@rjcorb The following is to address the question asked about intercepts in #20.

''The intercepts here were used to differentiate clonal from subclonal gene mutations. So, it is okay if the common mutations cluster close to a specific timepoint.''

However, after discussing with @jharenza, we decided to also address the following:

  1. Increase the intercepts to reflect literature values. See 8f8c794
  2. Decrease circle size in corplots and move gene labels. See 1a8d347
  3. Remove the subtitle from plots: @jharenza I ended up keeping this because it makes it easier to assess the plots in the html report. If you think that it doesn't add value, I can remove it.
  4. Identify and take into account cases with multiple diagnosis when plotting. See 26c4bc6 and 577bab8, respectively. In the current dataset, there is one PT_1H2REHT2 patient case with a secondary diagnosis in one of the progressive samples, but for the deceased sample is all primary diagnosis. I found it simpler to code this by creating cg_sum column to summarize the cancer groups. But let me know if there is another way to go around this. Plots are now saved by cg_sum.
  5. Save table that will contain the information about genes and genes in common per pair:
    • Added create_corplot_melt function to reshape df to homogenize the table format across samples to make it easier to save
    • Made a list of df and saved all in one table. See 876a6f8
  6. I recommend the reviewers to use the ec2 instance to run the shell script to reduce computational time (as suggested by @naqvia).

@AntoniaChroni AntoniaChroni marked this pull request as ready for review September 20, 2023 02:35
@jharenza jharenza self-requested a review October 1, 2023 19:06
Copy link
Contributor

@rjcorb rjcorb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good! I really like the look of the correlation plots. I did notice that, in some cases, the labels don't seem to be aligning with points on the plot. For example, in PT_MDWPRDBT-Recurrence_BS_00TRPEQX-vs-Deceased_BS_1M63B97V-vaf-corplot.pdf, it is not clear which variants the labels correspond to. The colors also seem to not match with positions on the plot. Like in that example, shouldn't blue text point to blue points?

@AntoniaChroni
Copy link
Contributor Author

@rjcorb The labels can't be positioned exactly on the points on the plot.

I tried to draw connecting lines that would label specific points on a plot, but with no lack. I was ending up having them in all the points. :(

@naqvia
Copy link

naqvia commented Oct 3, 2023

@rjcorb The labels can't be positioned exactly on the points on the plot.

I tried to draw connecting lines that would label specific points on a plot, but with no lack. I was ending up having them in all the points. :(

Have you tried ggrepel? It automatically would draw a line to the point with the label if there too close together. Here are some examples: https://ggrepel.slowkow.com/articles/examples.html

@AntoniaChroni
Copy link
Contributor Author

@rjcorb The labels can't be positioned exactly on the points on the plot.
I tried to draw connecting lines that would label specific points on a plot, but with no lack. I was ending up having them in all the points. :(

Have you tried ggrepel? It automatically would draw a line to the point with the label if there too close together. Here are some examples: https://ggrepel.slowkow.com/articles/examples.html

Yes, I tried multiple times and it was drawing all lines instead.

@rjcorb
Copy link
Contributor

rjcorb commented Oct 4, 2023

@AntoniaChroni ah I see. I will approve since this only seems to be an issue for a few plots. But since some plots have many overlapping labels that make them difficult to read, maybe you could filter points that get labeled based on a VAF threshold?

@rjcorb rjcorb self-requested a review October 4, 2023 20:31
@AntoniaChroni AntoniaChroni merged commit f7398ec into main Oct 6, 2023
@AntoniaChroni AntoniaChroni deleted the tmb-vaf-longitudinal-create-corplots branch October 6, 2023 17:07
AntoniaChroni added a commit that referenced this pull request Jan 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants