Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SVs not lifted? VCF does not contain "chr" while chain file contains "chr" #1

Open
tomgutman opened this issue Jul 22, 2024 · 17 comments

Comments

@tomgutman
Copy link

Hi,

I tried using your tool which runs perfectly, but none of my SVs are lifted.
I'm trying to lift SVs called with NanomonSV tool from nanopore data with T2T reference genome to hg38 to annotate them afterwards with annotSV.

Here is the output after running your tool

Any idea why the SVs are not lifted Over ?

Thanks in advance !
Tom

...unmapped SV (see WM99_CD19_CG_Tumor.nanomonsv.result.lifted.hg38.unmapped for details):
	- 106 SV with one position (start or end) lifted while the other doesn't
	- 0 SV with one position (start or end) mapped to a different chrom from the other
	- 0 SV with "lifted start" > "lifted end"
	- 0 SV with the distance between the two lifted positions that changes significantly (difference between both SVLENs > 5%)

Here are two examples of SVs from Nanomonsv tool :

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  TUMOR   CONTROL
14      100136192       r_25    A       <DEL>   .       PASS    END=100680226;SVTYPE=DEL;SVLEN=-544034;SVINSLEN=20;SVINSSEQ=TATGAGTCTCCGTTTTTCCG        TR:VR   34:14   38:0
1       43651452        r_1_0   T       ]X:99416952]G   .       PASS    SVTYPE=BND;MATEID=r_1_1 TR:VR   37:4    41:1
@lgmgeo
Copy link
Owner

lgmgeo commented Jul 22, 2024

Thanks for testing liftoverSV.

Please, can you detail your request:

  • give the command line used
  • indicate the CHAIN_FILE used

@lgmgeo
Copy link
Owner

lgmgeo commented Jul 22, 2024

Did you use a CHAIN_FILE without "chr" in the genomic coordinates?

@tomgutman
Copy link
Author

Hi
thanks for your quick answer

Here is my command line

export PATH=/mnt/beegfs/scratch/t_gutman/nanopore_data/tools:$PATH
LIFTOVERSV=/mnt/beegfs/scratch/t_gutman/nanopore_data/tools/liftoverSV
INPUT_FILE=/mnt/beegfs/scratch/t_gutman/nanopore_data/WM99_vs_T2T/SV_calling/WM99_CD19_CG_Tumor.nanomonsv.result.vcf
CHAIN_FILE=/mnt/beegfs/scratch/t_gutman/nanopore_data/database/chm13v2-grch38.chain
OUTPUT_FILE=/mnt/beegfs/scratch/t_gutman/nanopore_data/WM99_vs_T2T/SV_calling/WM99_CD19_CG_Tumor.nanomonsv.result.lifted.hg38.vcf

$LIFTOVERSV/bin/liftoverSV -I $INPUT_FILE -C $CHAIN_FILE -O $OUTPUT_FILE

The Chain file comes from this github repo : https://github.com/marbl/CHM13

and the chain file contain "chr" which explain I guess why no SV were lifted Over

chain 255 chr1 248387328 + 5618 12693 chr1 248956422 + 260873 267915 0 246 1 0 281 40 0

@tomgutman
Copy link
Author

Using the following command awk '{gsub(/chr/,""); print}' chm13v2-grch38.chain >chm13v2-grch38.noChr.chain
I created a new chain file and running liftOverSv I obtained the following :

...unmapped SV (see WM99_CD19_CG_Tumor.nanomonsv.result.lifted.hg38.unmapped for details):
	- 36 SV with one position (start or end) lifted while the other doesn't
	- 0 SV with one position (start or end) mapped to a different chrom from the other
	- 4 SV with "lifted start" > "lifted end"
	- 3 SV with the distance between the two lifted positions that changes significantly (difference between both SVLENs > 5%)

@lgmgeo
Copy link
Owner

lgmgeo commented Jul 22, 2024

the chain file contain "chr" which explain I guess why no SV were lifted Over

Absolutely.

Using the following command awk '{gsub(/chr/,""); print}...

43 SVs are not mapped.
63 SVs are mapped.
=> I don't know what results are expected with your input file.
Does this sound correct to you?

### Regarding your first SV example:

image

END - POS = 100680226 - 100136192 = 544034
=> Ok, corresponds to SVLEN

### Online liftover
You can have a manual check with the online liftover from the broad institute:
https://liftover.broadinstitute.org//#input=14%3A100136192&hg=t2t-to-hg38
image
https://liftover.broadinstitute.org//#input=14%3A100680226&hg=t2t-to-hg38
image
END - POS = 106359792 - 105864636 = 495156
=> The distance between the two lifted positions changes significantly (difference between both SVLENs > 5%)
=> This SV is unmapped

I have no feedback on this 5% parameter.
Perhaps this criterion is too strict? I will add an option so that this threshold can be set by the user

I am really interested in any feedback, recommendation for liftover.

@tomgutman
Copy link
Author

i wasn't expecting so much differences between the two reference genome.
I am looking into it

Regarding the difference between both SVLENs > 5% :

Here are the records with this flag :

14      100136192       r_25    A       <DEL>   .       PASS    END=100680226;SVTYPE=DEL;SVLEN=-544034;SVINSLEN=20;SVINSS
EQ=TATGAGTCTCCGTTTTTCCG        the distance between the two lifted positions changes significantly (svlen diff > 5%)
14      99858805        r_24    T       <DUP>   .       Too_low_VAF     END=100049324;SVTYPE=DUP;SVLEN=190519;SVINSLEN=6;
SVINSSEQ=GATTAG        the distance between the two lifted positions changes significantly (svlen diff > 5%)
3       51295313        r_42    T       <DUP>   .       PASS    END=104273107;SVTYPE=DUP;SVLEN=52977794;SVINSLEN=30;SVINS
SEQ=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA     the distance between the two lifted positions changes significantly (svlen diff >
5%)

Chr14 is acrocentric and has seen a lot of changes between hg38 and T2T according to the main publication of T2T (Nurk, S. et al. (2022). The complete sequence of a human genome. Science.).
The duplication of chromosome 3 seems to span the centromere which could also explain the difference > 5%

It would be nice indeed to have a user define parameter. And in my case a 10% cutoff will not filter theses SVs.

Regarding the 36 SV with one position (start or end) lifted while the other doesn't
I looked at a few of them and indeed they seem to fall in newly sequenced regions (centromere, telomere...).

Regarding the 4 SV with "lifted start" > "lifted end"

  • 1 has a position which is non lifted (22:9777919)
  • d_101 is a insertion of 12bp which has both T2T and hg38 start coordinates > end coordinates and same length (-12). (I don't understand why it is filtered)
  • r_48 and r_49 are two deletions and indeed lifted start > lifted end. But it is in a region of chr3 that seems to be corrected in T2T compared to hg38
22      9777730 d_191   A       <DEL>   .       PASS    END=9777919;SVTYPE=DEL;SVLEN=-189       lifted_start (12196812) > lifted_end (12196623)
16      27089218        d_101   A       <INS>   .       PASS    END=27089206;SVTYPE=INS;SVINSLEN=196;SVINSSEQ=TCACATAGTTGTATATATAGTATATATCACATAGTTGTATATATAGTATATATCACATAGTTGTATATATAGTATATATCACATAGTTGTATATATAGTATATATCACATAGTTGTATATATAGTATATATCACATAGTTGTATATATAGTATATATCACATAGTTGTATATATAGTATATATCACATAGTTGTAT      lifted_start (26812465) > lifted_end (26812453)
3       198617703       r_48    T       <DEL>   .       PASS    END=198620350;SVTYPE=DEL;SVLEN=-2647;SVINSLEN=29;SVINSSEQ
=ATTATTTAATTATTTAATAATTATAATTA lifted_start (195749160) > lifted_end (195746511)
3       198620723       r_49    G       <DEL>   .       PASS    END=198639737;SVTYPE=DEL;SVLEN=-19014   lifted_start (195
746138) > lifted_end (195727080)

Finally,
I found that the lifted coordinates in the vcf doesn't match the ones found in the online liftover from the broad institute :
Is it linked to different Chm13-T2T versions ?
For example :

16      27089218        d_101   A       <INS>   .       PASS    END=27089206;SVTYPE=INS;SVINSLEN=196;SVINSSEQ=TCACATAGTTG
TATATATAGTATATATCACATAGTTGTATATATAGTATATATCACATAGTTGTATATATAGTATATATCACATAGTTGTATATATAGTATATATCACATAGTTGTATATATAGTATATATC
ACATAGTTGTATATATAGTATATATCACATAGTTGTATATATAGTATATATCACATAGTTGTAT      lifted_start (26812465) > lifted_end (26812453)

Coordinates T2T :

  • start : 27089218
  • end : 27089206

Coordiantes hg38 lifted (written in the vcf file)

  • start : 26812465
  • end : 26812453

Coordinates hg38 lifted (from https://liftover.broadinstitute.org//#input=22%3A9777730&hg=t2t-to-hg38)

  • start : 26812153
  • end : 26812141

@lgmgeo
Copy link
Owner

lgmgeo commented Jul 23, 2024

Regarding the difference between both SVLENs > 5% :

It would be nice indeed to have a user define parameter. And in my case a 10% cutoff will not filter theses SVs.

Done with the liftoverSV version v0.1.2_beta
See --PERCENT, P option

In your institute, do you have access to an hg38 BAM/CRAM and a T2T BAM/CRAM for the same sample?
It would be nice to check the aligned reads (with IGV for example) for these SVs. And to see if such SV whose SVLEN has increased by 10% is well covered.
If this is possible, it would be great to have such a feedback ;o)

Regarding the 4 SV with "lifted start" > "lifted end"

In my opinion, the difference should be due to a different version of the chain file used.
What is the version of your chain file (https://github.com/marbl/CHM13)?
For example, the chm13v2 is avalaible here:
https://hgdownload.soe.ucsc.edu/hubs/GCA/009/914/755/GCA_009914755.4/liftOver/hg38-chm13v2.over.chain.gz

@tomgutman
Copy link
Author

Hi,

I have indeed access to one sample aligned on hg38 and T2T.
Below are two screenshot of IGV windows. I used hg38 as a reference genome here.

In this IGV windows there is 4 tracks (2 top are hg38, 2 bottom are T2T), WM99_CD3 is considered the normal sample and WM99_CD19 is the tumor sample.

On those two screenshots, which shows the two breakpoint of a SV deletion (VCF info below) we can clearly see a structural variant identified in T2T alignement but not in hg38 alignement.
Moreover, this event is well covered and has high coverage.
For me there is no doubt it's a real event. But I have a hard time understanding precisely why it is specific to T2T.
Probably, as mentionned previously, Chr14 is acrocentric and has seen changes in its telomere part between hg38 and T2T

14 100136192 r_25 A DEL . PASS END=100680226;SVTYPE=DEL;SVLEN=-544034;SVINSLEN=20;SVINSS EQ=TATGAGTCTCCGTTTTTCCG the distance between the two lifted positions changes significantly (svlen diff > 5%)
Capture d’écran 2024-07-24 à 11 26 41 Capture d’écran 2024-07-24 à 11 25 41

@tomgutman
Copy link
Author

Here is another example, this time with a duplication. We can see an increased of the coverage at the breakpoints

3 51295313 r_42 T <DUP> . PASS END=104273107;SVTYPE=DUP;SVLEN=52977794;SVINSLEN=30;SVINS SEQ=AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA the distance between the two lifted positions changes significantly (svlen diff > 5%)
Capture d’écran 2024-07-24 à 11 40 12 Capture d’écran 2024-07-24 à 11 40 48

@tomgutman
Copy link
Author

Finally,
i used chm13v2-grch38.chain for the liftOver, which I found here : https://github.com/marbl/CHM13?tab=readme-ov-file

@lgmgeo
Copy link
Owner

lgmgeo commented Jul 28, 2024

Below are two screenshot of IGV windows. I used hg38 as a reference genome here.

Not sure to understand what you are showing (both on the same hg38 reference genome)

My idea was to look for example the following SV :

  • T2T, chr14:100136192-100680226 (in IGV, T2T reference genome)
  • hg38, lifted coordinates (in IGV, hg38 reference genome, located at the hg38 lifted coordinates)
    And then to see if such an SV whose SVLEN has increased by 10% is well covered in both cases (T2T and hg38).

@tomgutman
Copy link
Author

Hi,
following up on this analysis, I'm having a hard time understanding the following variant. I'm wondering if maybe there is a bug with the liftover of the bracket notation

Before liftOverSV : 
10	59997128	r_9_0	A	A]7:83411098]	.	PASS	SVTYPE=BND;MATEID=r_9_1	TR:VR	47:3	41:0
7	83411098	r_9_1	T	T]10:59997128]	.	PASS	SVTYPE=BND;MATEID=r_9_0	TR:VR	47:3	41:0

After liftOverSV : 
10	59143169	r_9_0	A	A]7:83411098]	.	PASS	SVTYPE=BND;MATEID=r_9_1	TR:VR	47:3	41:0
7	82159848	r_9_1	T	T]10:59997128]	.	PASS	SVTYPE=BND;MATEID=r_9_0	TR:VR	47:3	41:0

The coordinates don't match between the two mates after liftOverSV.

What do you think ?

@tomgutman
Copy link
Author

For the SV comparison between T2T and hg38 for the same sample :

T2T :

14 100136192 r_25 A DEL . PASS END=100680226;SVTYPE=DEL;SVLEN=-544034;SVINSLEN=20;SVINSS EQ=TATGAGTCTCCGTTTTTCCG the distance between the two lifted positions changes significantly (svlen diff > 5%)

global :
Capture d’écran 2024-07-31 à 10 34 45

BND left :
Capture d’écran 2024-07-31 à 10 57 16

BND right :
Capture d’écran 2024-07-31 à 10 58 05

hg38 :

14	105864636	r_25	A	<DEL>	.	PASS	END=106359792;SVTYPE=DEL;SVLEN=-495156;SVINSLEN=20;SVINSSEQ=TATGAGTCTCCGTTTTTCCG	TR:VR	34:14	38:0

global :
Capture d’écran 2024-07-31 à 11 37 25

BND left :
Capture d’écran 2024-07-31 à 11 37 46

BND right :
Capture d’écran 2024-07-31 à 11 38 02

For this SV, on chr14, the region is well covered in T2T and in hg38. To note we are in the VDJ region, which goes through a lot of recombination, moreover, my samples are B and T cells, so we can see some recombination events in the B cells

@lgmgeo
Copy link
Owner

lgmgeo commented Aug 1, 2024

I'm wondering if maybe there is a bug with the liftover of the bracket notation

Absolutely!
I was aware of this, as stated at the end of the "Feature requested for future release" section:
image

I developed this tool for an analysis in my lab where SVs were not described with square-bracketed notation.
But if liftoverSV is used by other users, I can continue the development. It may take some time (I am out of my lab in August) but I will do my best to add this feature as soon as possible.

@lgmgeo
Copy link
Owner

lgmgeo commented Aug 1, 2024

For the SV comparison between T2T and hg38 for the same sample
...
For this SV, on chr14, the region is well covered in T2T and in hg38. To note we are in the VDJ region, which goes through a lot of recombination, moreover, my samples are B and T cells, so we can see some recombination events in the B cells

Excellent! Thank you for showing this specific example with so much details!
I'm absolutely not an expert in cancer and your example in the VDJ region is really interresting.

This is a complex SV (not really a single DEL in my opinion).

Actually, I'm quite surprised with the BND coverage:
CD3 (normal), BND left [0-25]
CD19 (tumor), BND left [0-18]
=> looks like an homozygous DEL on the screenshot

CD3 (normal), BND right [0-17]
CD19 (tumor), BND right [0-32]
=> looks like an heterozygous DUP on the screenshot

CD3 (normal), between BNDs [0-43]
CD19 (tumor), between BNDs [0-32]
=> looks like an heterozygous deletion?? not sure, so complex.
=> We can see the V(D)J recombination (variable–diversity–joining rearrangement), with the multiple alleles. Really nice.

Finally, it looks like you may indeed use a cutoff of 10% for this complex SV.

@tomgutman
Copy link
Author

Thanks for all this information,

Why are you suprised by the BND coverage ? You would expect it to be less covered ?
For the BND right, how do you determine it is an heterozygous DUP ? I understand that it is a DUP by the bump in the coverage but not the heterozygous part. If it was homozygous DUP you would expect double coverage compared to the other side of the BND ? (

Also you have a very fine analysis of the global V(D)J SV. Would you be keen on explaining how you analyzed the global complex SV ?

  • I am not sure to understand why it's a heterozygous deletion. how do you determine it this ?

Thanks again for your time and effort.
It's really helpful !

@lgmgeo
Copy link
Owner

lgmgeo commented Aug 4, 2024

Hello Tom,
Maybe it would be better to talk together directly by phone.
I will contact you by email to try to define a time slot.
Véronique

@lgmgeo lgmgeo changed the title SVs not lifted SVs not lifted? VCF does not contain "chr" while chain file contains "chr" Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants