Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adapting the script for mtDNA #16

Closed
hoelzer opened this issue Nov 12, 2024 · 9 comments · Fixed by #17
Closed

Adapting the script for mtDNA #16

hoelzer opened this issue Nov 12, 2024 · 9 comments · Fixed by #17
Labels
bug Something isn't working

Comments

@hoelzer
Copy link

hoelzer commented Nov 12, 2024

Hey @jonas-fuchs we like to adapt the script for plotting variants in mtDNAs. We have the VCF files (from freebayes) and are now trying to produce a heatmap. Ayaka is supporting this effort.

We tried

virheat vcf-input results -l 16773 -r CM022781.1

but then get

Traceback (most recent call last):
  File "/Users/martin/miniconda3/envs/virheat/bin/virheat", line 10, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/martin/miniconda3/envs/virheat/lib/python3.12/site-packages/virheat/command.py", line 162, in main
    frequency_lists, unique_mutations, file_names = data_prep.extract_vcf_data(vcf_files, args.reference, threshold=args.threshold)
                                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/martin/miniconda3/envs/virheat/lib/python3.12/site-packages/virheat/scripts/data_prep.py", line 121, in extract_vcf_data
    if not vcf_dict["AF"][idx] >= threshold:
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '>=' not supported between instances of 'str' and 'int'

I guess there is a problem with the format of the VCF and how virHEAT is parsing it?

Here is an example VCF:

IHIT11.mtDNA.vcf.zip

Are you using the AF flag for getting the allele frequency? I think this might be our problem bc we have strings like AF=0,0,0; for weird combinations of variant calls.

@hoelzer
Copy link
Author

hoelzer commented Nov 12, 2024

Yes, that's the problem. I simplified the VCF with entries:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	unknown
CM022781.1	933	.	T	C	23259.6	.	AB=0;ABP=0;AC=2;AF=1;AN=2;AO=762;CIGAR=1X;DP=762;DPB=762;DPRA=0;EPP=5.24447;EPPR=0;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=0;NS=1;NUMALT=1;ODDS=1058.88;PAIRED=0.988189;PAIREDR=0;PAO=0;PQA=0;PQR=0;PRO=0;QA=25908;QR=0;RO=0;RPL=401;RPP=7.56982;RPPR=0;RPR=361;RUN=1;SAF=374;SAP=3.56884;SAR=388;SRF=0;SRP=0;SRR=0;TYPE=snp	GT:DP:AD:RO:QR:AO:QA:GL	1/1:762:0,762:0:0:762:25908:-2331.31,-229.385,0
CM022781.1	1194	.	C	G	1.00054e-14	.	AB=0;ABP=0;AC=0;AF=0;AN=2;AO=8;CIGAR=1X;DP=724;DPB=724;DPRA=0;EPP=3.0103;EPPR=6.5162;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=915.175;PAIRED=1;PAIREDR=0.980447;PAO=0;PQA=0;PQR=0;PRO=0;QA=272;QR=24344;RO=716;RPL=8;RPP=20.3821;RPPR=3.05882;RPR=0;RUN=1;SAF=4;SAP=3.0103;SAR=4;SRF=383;SRP=10.5923;SRR=333;TYPE=snp	GT:DP:AD:RO:QR:AO:QA:GL	0/0:724:716,8:716:24344:8:272:0,-193.134,-2165.79
CM022781.1	5367	.	T	C	0	.	AB=0;ABP=0;AC=0;AF=0;AN=2;AO=6;CIGAR=1X;DP=860;DPB=860;DPRA=0;EPP=3.0103;EPPR=9.36707;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=1124.23;PAIRED=1;PAIREDR=0.984778;PAO=0;PQA=0;PQR=0;PRO=0;QA=204;QR=29036;RO=854;RPL=6;RPP=16.0391;RPPR=63.3132;RPR=0;RUN=1;SAF=3;SAP=3.0103;SAR=3;SRF=480;SRP=31.5802;SRR=374;TYPE=snp	GT:DP:AD:RO:QR:AO:QA:GL	0/0:860:854,6:854:29036:6:204:0,-240.192,-2594.05
CM022781.1	11987	.	A	G	1.02507e-15	.	AB=0;ABP=0;AC=0;AF=0;AN=2;AO=6;CIGAR=1X;DP=1046;DPB=1046;DPRA=0;EPP=3.0103;EPPR=3.84548;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=59.9192;NS=1;NUMALT=1;ODDS=1380.91;PAIRED=1;PAIREDR=0.971154;PAO=0;PQA=0;PQR=0;PRO=0;QA=204;QR=35360;RO=1040;RPL=0;RPP=16.0391;RPPR=29.2016;RPR=6;RUN=1;SAF=3;SAP=3.0103;SAR=3;SRF=516;SRP=3.14393;SRR=524;TYPE=snp	GT:DP:AD:RO:QR:AO:QA:GL	0/0:1046:1040,6:1040:35360:6:204:0,-296.183,-3162.89

and then it works:

Screenshot 2024-11-12 at 07 57 51

I guess the best is we format our VCFs in a way the work with virheat, removing rows with weird/complex AF flags.

Or do you have another idea for such situations?

@hoelzer
Copy link
Author

hoelzer commented Nov 12, 2024

Yeah... probalby it's best we filter/format our VCF files to work with your plotting tool. Here is a plot when I filter our VCFs for all entries that have AF=1 and AF=0.5 removing all other cases

virHEAT_plot.pdf

@jonas-fuchs
Copy link
Owner

jonas-fuchs commented Nov 12, 2024

@hoelzer What is the relevance of these calls: AF=(0,0,0) or AF=(0,0,0.5)? Removing them should be relatively easy to introduce (similar to what you have done manually), but I have no feeling if this is good practise.

@jonas-fuchs
Copy link
Owner

I could also check if all are zero and if not use the highest/lowest AF.

@hoelzer
Copy link
Author

hoelzer commented Nov 12, 2024

Unfortunately there are diff combinations of such AF strings with commata 🙄 I think the meaning is when you have mixed variant calls, e.g. a G in the ref but then an A or an T in the reads with varying frequency etc...

My impression was that they are not so meaningful for our analysis.. but not sure in general

Also, not sure if this is a freebayes specific way of writing such calls in the VCF file.

@jonas-fuchs
Copy link
Owner

Unfortunately there are diff combinations of such AF strings with commata 🙄 I think the meaning is when you have mixed variant calls, e.g. a G in the ref but then an A or an T in the reads with varying frequency etc...

Ah I see. But this I could implement this in the vcf parsing... Let me check how straight forward that would be.

My impression was that they are not so meaningful for our analysis.. but not sure in general

Hmm, actually I think you will then miss the ones that have several variants called at a single position, or not?

Also, not sure if this is a freebayes specific way of writing such calls in the VCF file.

Also not sure, but at least not typical for most of the variant callers I have used.

@hoelzer
Copy link
Author

hoelzer commented Nov 12, 2024

Hmm, actually I think you will then miss the ones that have several variants called at a single position, or not?

Yeah exactly, we would miss those. Probably you would have to split such an AF entry with multiple alternative alleles into multiple rows in the VCF file.

There is also another thing but this is related to freebayes and the assumed ploidy: we only have 0/0.5/1 values for AF. However, one could calculate more precise allele frequencies from the AD field which holds the raw read counts for each allele.

I asked Ayaka to adapt a little script I wrote that does a simple filtering for the AF fields with only 1 or 0.5 and no commas to also replace the AF values with more precise AD-based ones.

I am mentioning that because when you have smt like AF=0,0,0.5 you will also have a more complex AD field 🙄

@jonas-fuchs
Copy link
Owner

@hoelzer Can you give it a try (see PR). Basically, I now attempt a second comma-separated split for the ALT field and add these variants separately while taking the same information for the other field expect the INFO field multiple times. I hope it is correct for me to assume that if we have multiple ALT calls that the INFO field has the same amount of entries per classifier (such as AF). Otherwise I will need to introduce some additional checks.
virHEAT_plot.pdf
Worked with the vcf file you provided and does not change the output of my test data. But please have a look ;).

@jonas-fuchs jonas-fuchs added the bug Something isn't working label Nov 12, 2024
@hoelzer
Copy link
Author

hoelzer commented Nov 12, 2024

works as described in #17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants