Converting TOGA results into different formats #136

alejandrogzi · 2023-08-29T05:13:22Z

alejandrogzi
Aug 29, 2023

I just wanted to let you know that I developed this tool: bed2gtf few weeks ago to overcome some limitations using bedToGenePred | genePredToGtf approach. It is one of my first developments so it is not that fancy; however, I think that may be of some help to TOGA users. Since this is a very little change I did not want to make a pull request.

Best,
Alejandro

MichaelHiller · 2023-08-29T05:25:55Z

MichaelHiller
Aug 29, 2023
Maintainer

Hi Alejandro,
very interesting.
Doesn't have to be fancy.
Could you briefly explain what is different or which cases are correctly handled by your script?
We would be happy to pull it.

Thx

0 replies

kirilenkobm · 2023-08-29T08:44:03Z

kirilenkobm
Aug 29, 2023
Maintainer

Hi @alejandrogzi @MichaelHiller,

Could you please remind me of those limitations? I think the best strategy for including the package would be to:

Link the project as a Git submodule
Add an option to ./configure.sh that includes and builds bed2gtf
Document, here in the TOGA project, how to implement it

Thanks!

0 replies

kirilenkobm · 2023-08-29T08:50:11Z

kirilenkobm
Aug 29, 2023
Maintainer

@alejandrogzi

By the way, I took a quick look at the code and everything looks good to me. Even though I'm not familiar with Rust, everything was easy to follow :).

0 replies

alejandrogzi · 2023-08-29T18:11:34Z

alejandrogzi
Aug 29, 2023
Author

@MichaelHiller , @kirilenkobm ,

Sorry for the delay response (I am 7h behind you).

Could you briefly explain what is different or which cases are correctly handled by your script?

Yes, Michael. This tool is the exact reimplementation of bedToGenePred and genePredToGtf functionality in just one step. Furthermore, it adds a "gene" feature line (which is not handled by C binaries, even with refTables); this line is necessary for some post-processing tools that need to extract gene coordinates (which could not be extracted by looking just at the transcripts info). This is a small benchmark:

In a first pipeline I did (merging make_lastz_chains + TOGA + postprocessing) I needed to handle: 1) downloading binaries, 2) downloading refTables, 3) cutting genePred files if they had just numbers as chr, 4) modifying refTables if I wanted to do custom rearrangements, 5) looping through the output gtf to add a "gene" line feature, etc (a lot of steps really). I developed bed2gtf to overcome all of this in 1 step, just by providing the .bed file, a .txt/.tsv/.csv file with isoforms (gene - transcript, making sure that all transcripts in the .bed file appears here) and the path to the output .gtf.

The main limitations of bed2gtf are: 1) output is not sorted (which I did not implement here but I am developing another tool for that also in Rust gtfsort. 2) does not provide more than gene_ids in the gene-feature line (e.g. missing gene_biotype, ...), that could be solved by just adding that info in the isoforms file and then just append it to the resultant line but I am not sure if that is necessary.

Document, here in the TOGA project, how to implement it

Bogdan, I could adjust all the info in the bed2gtf repo to fit within TOGA wiki (a shorter version just explaining functionality).

Please let me know, and thanks for the support! (just and undergrad here trying to find opportunities!)

Best,
Alejandro

0 replies

kirilenkobm · 2023-08-30T13:49:49Z

kirilenkobm
Aug 30, 2023
Maintainer

Hi @alejandrogzi

I've added the following code snippet to the configure script (it will be available in Version 1.1.5 once it's complete):

if $BUILD_BED2GTF; then
    if ! $OVERRIDE && [[ -f "./bed2gtf/some_check_file" ]]
    then
        printf "bed2gtf installation found\n"
    else
        printf "bed2gtf installation not found, cloning and building\n"
        git submodule init bed2gtf
        git submodule update bed2gtf
        # ACTUAL BUILD COMMAND HERE
    fi
fi

This script is executed when the configure script is called with the --bed2gtf flag (I'll update the readme to reflect this).

Regarding building the project, I have a question. For CESAR2.0, the build process is simple: cd CESAR2.0 && make.

My question might seem naïve, but I'm not familiar with the Rust build system. Would the best approach be to first check if cargo is installed, and if it is, navigate to the bed2gtf directory and run cargo install bed2gtf, then locate the bed2gtf executable in the /bin subdirectory (or somewhere else)?

0 replies

alejandrogzi · 2023-08-30T17:53:15Z

alejandrogzi
Aug 30, 2023
Author

Hi @kirilenkobm,

Nice!

In rust you have a range of scenarios. Start checking if cargo is installed is highly recommended I think, and handles potential posterior errors in the pipeline. If cargo is installed, we initialize and fetch the submodule data; after that, we need to cd and run cargo build --release to build the binary. The binary then is located at ./bed2gtf/target/release/bed2gtf. We use --release because is the optimized version of the software. This will make bed2gtf run even faster but will slow down the compilation time; this actually is not that big of a problem because I wrote bed2gtf on top of just 2 crates (packages), the compiling time is despicable. This, in my opinion, is the best approach.

On the other hand, cargo install bed2gtf downloads the last build binary version directly from the Rust package repository (https://crates.io/crates/bed2gtf) and saves it to ~/.cargo/bin. So this is another possible way, but it would add another step to check If ~/.cargo/bin is in PATH and, if is not, add it.

Maybe the script could be updated to this:

if $BUILD_BED2GTF; then
    if command -v cargo &> /dev/null; then
        if ! $OVERRIDE && [[ -f "./bed2gtf/Cargo.toml" ]]; then
            printf "bed2gtf installation found\n"
            # these statements should be avoided, right? In overriding
            # situations we would expect binaries already build
            if [[ -f "./bed2gtf/target/release/bed2gtf" ]]; then
                printf "bed2gtf binary found\n"
            else
                printf "bed2gtf binary not found, building\n"
                cd ./bed2gtf
                cargo build --release
                # locate binary at ./target/release/bed2gtf
            fi
        else
            printf "bed2gtf installation not found, cloning and building\n"
            git submodule init bed2gtf
            git submodule update bed2gtf
            cd ./bed2gtf
            cargo build --release
            # locate binary at ./target/release/bed2gtf
        fi
    else
        printf "cargo not found, installing rust\n"
        curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
        # include bed2gtf build here
    fi
fi

0 replies

MichaelHiller · 2023-09-06T14:00:07Z

MichaelHiller
Sep 6, 2023
Maintainer

Thx a lot, Alejandro, this is a very useful addition, esp for downstream tools that need the gene feature.

If the tool works and is properly benchmarked, I wonder if we should update all the geneAnnotation.gtf.gz files on
https://genome.senckenberg.de/download/TOGA/ ?

@kirilenkobm On our "Output reading / query_annotation.bed" section, can you pls recommend bed2gtf to convert the bed to gtf?

0 replies

shjenkins94 · 2023-09-08T07:50:24Z

shjenkins94
Sep 8, 2023

Sorry to intrude.

Coincidentally, I've been working on converting some R code I wrote to get a filtered GFF3 from TOGA output into python.

It was based on what I needed it for (getting the best transcript for each gene) so it's less general purpose but does some stuff that could be useful for TOGA postprocessing, i.e. it adds orthology classifications and gene or transcript loss status, and the name of the original reference gene(s) as attributes. It also filters the transcripts for each gene to the most intact ones.

I'm wondering if this might be useful as a further step after running bed2gtf.

0 replies

MichaelHiller · 2023-09-08T11:05:16Z

MichaelHiller
Sep 8, 2023
Maintainer

This is great. I think it would be great if downstream processing scripts and tools could be added, as this will be useful for many users.

If you agree to add your code to this repo, pls write your name / institution at the beginning of the code to make clear who developed it.
@kirilenkobm Could you pls help with adding the code?
And we should list these tools with a short HOWTO and what they do in the wiki (the wiki maybe needs to be featured more prominently on the main github page).

Thx a lot !!

0 replies

alejandrogzi · 2023-09-08T16:11:47Z

alejandrogzi
Sep 8, 2023
Author

@shjenkins94 cool! I've thinking these days to add a feature to bed2gtf that allows specifying GTF version output (GTF3 as default). GTF3 and GFF3 are pretty similar so maybe both tools could easily work together! Additionally, there a lot of tools available to convert GTF to GFF but if we need a custom tool to make the bridge between bed2gtf and your tool, we could join efforts.

@MichaelHiller during the post-processing procedure I have develop a lot of scripts, while writing them I always thought that maybe a post-TOGA pipeline could be done; however, users may need a lot of different custom ways to use it, can be tricky. Your python AssemblyStats it is a good example of that. I will try to group all of them and work on a simple pipeline to show you (these are written in python but I think some guidelines in this aspect should be done, a lot of different scripts in a diverse range of programming languages could be problematic).

Regards!

0 replies

kirilenkobm · 2023-09-09T12:55:56Z

kirilenkobm
Sep 9, 2023
Maintainer

@shjenkins94, that sounds great! However, it might make more sense if you could create a separate repository for the post-processing scripts and procedures, complete with some documentation. I can then include it as a submodule and reference it in the README. This way, versioning will be better managed, and concerns will be more cleanly separated. What are your thoughts?

0 replies

MichaelHiller · 2023-09-09T15:41:52Z

MichaelHiller
Sep 9, 2023
Maintainer

Also a good idea. But ideally we have something were many people can contribute.
Of course, several repos can be included as submodules.

Nice to see that the community is using TOGA and contributing to downstream steps !!

0 replies

shjenkins94 · 2023-09-12T10:25:18Z

shjenkins94
Sep 12, 2023

@kirilenkobm A subrepo sounds like a good idea (working in the main TOGA repo can get a bit bulky) It would also be nice to clarify a few things about TOGA results. For example, @alejandrogzi I noticed when I tried out bed2gtf that it currently fails on my TOGA output because there are some projections that are included in the final query_annotations.bed file but not the query_isoforms.tsv file.

Looking at the loss_summ_data.tsv, query_annotaion.bed, and query_isoforms.tsv files, it seems like only the most intact projections for a transcript are actually assigned to a gene. I was wondering if the rest of the projections are still useful, as well as wether they should be connected to the same gene as their more intact counterparts.

The bed file also seems to include projections of different reference transcripts that are identical on the query genome. Depending on the goal it might be good to filter these as well.

0 replies

alejandrogzi · 2023-09-13T16:50:26Z

alejandrogzi
Sep 13, 2023
Author

@shjenkins94 Yes! As stated in the bed2gtf repo it is mandatory for all the isoforms in the .bed file to appear in the isoforms file, that is how the tool maps gene features.

I noticed when I tried out bed2gtf that it currently fails on my TOGA output because there are some projections that are included in the final query_annotations.bed file but not the query_isoforms.tsv file.

That was a problem for me at some moment. Michael, Bogdan and I have an issue commenting that here: #70. I can add a feature to detect if a transcript is not present in the isoforms file, bed2gtf (latest release) just throws an error explaining that it tried to use a transcript as a key to find its respective gene and did not find it.

The 'brute' way around to solve that issue I found was using ~/toga_results/temp/isoforms.tsv (which is the same isoforms file you used as input; -kt flag on otherwise your isoforms file) to map all transcripts in the .bed file to their respective genes. This will capture all of the transcripts in the .bed because TOGA is restricted to pairs that appear in the isoforms template.

Something like this simplifies the work:

import pandas as pd

# Reads input files: bed (3rd column with transcript ids), isoforms (base input file), output
bed = pd.read_csv("~/toga_results/query_annotation.bed", 
                  sep="\t", 
                  header=None, 
                  usecols=[3], 
                  names=["transcripts"])
isoforms = "~/toga_results/temp/isoforms.tsv"
output = "~/toga_results/output.txt"

# Fast search of a given transcript using a simple dictionary (transcript-gene
# based on the isoforms.tsv)
iso_dict = {}
with open(isoforms, "r") as isoforms_file:
    for line in isoforms_file:
        if line.startswith("gene"):
            continue
        else:
            gene, transcript = line.strip().split("\t")
            iso_dict[transcript] = gene

# Maps j transcript (splitting it by ".") to the helper dictionary and
# writes a gene-toga_transcript pair to the output file
with open(output, "w") as output_file:
    for transcript in bed.transcripts:
        gene = iso_dict.get(str(transcript).split(".")[0], "not found") 
        output_file.write(f"{gene}\t{transcript}\n")

With the new file ~/toga_results/output.txt bed2gtf will definitely work!:

bed2gtf ~/toga_results/query_annotation.bed ~/toga_results/output.txt ~/toga_results/query_annotation.gtf

or inside a Rust project with:

use bed2gtf::*;
let gtf = bed2gtf(bed, isoforms, output);

On the other hand, we can solve the issue by making TOGA write all the transcripts in the .bed to an isoforms output mapping them not only to the TOGA regions but to gene ids.

Hope that helps!

0 replies

shjenkins94 · 2023-09-14T15:51:43Z

shjenkins94
Sep 14, 2023

@alejandrogzi I think the issue with using the original isoforms file is that it won't work well for genes that aren't one2one.

This might help, I did a thing post-processing where I combined the orthology_classification, gene_loss_summ, and query_isoform, and annotation bed file to get a table of the target gene, the target transcript, the query gene, the query transcript, the orthology class, the loss type of the target transcript, the loss type of the query transcript, and the chromosome that the query transcript is on in the bed file.

Here's an example of some simplified many2many results, where 2 target genes were orthologous to 2 query genes on different chromosomes. The last line is a transcript that's included in the query annotation file but not the query_isoforms file (NA for q_gene)

t_gene	t_transcript	q_gene	q_transcript	orthology_class	t_transcript_loss_type	q_transcript_loss_type	chrom
TGene1	TGene1.Iso1	reg_1	TGene1.Iso1.169041	many2many	UL	UL	chr10
TGene1	TGene1.Iso1	reg_2	TGene1.Iso1.220	many2many	UL	UL	chr7
TGene1	TGene1.Iso2	reg_2	TGene1.Iso2.220	many2many	I	I	chr7
TGene2	TGene2.Iso1	reg_1	TGene2.Iso1.169026	many2many	UL	UL	chr10
TGene2	TGene2.Iso1	reg_2	TGene2.Iso1.1880	many2many	UL	UL	chr7
NA	NA	NA	TGene1.Iso2.169041	NA	NA	L	chr10

If TGene1 and TGene2 are used as gene ids then the problem with the orphan query transcript is fixed but transcripts in completely different locations are grouped together as one gene.

0 replies

alejandrogzi · 2023-09-14T19:40:32Z

alejandrogzi
Sep 14, 2023
Author

@shjenkins94 thank you for providing a specific example! Since your transcripts ids are formatted differently, I just modified a small part of the code I provided earlier to:

### ...
with open(output, "w") as output_file:
    for transcript in bed.transcripts:
        t = ".".join(str(transcript).split(".")[:-1]) # Allows precise mapping of toga-transcripts
        gene = iso_dict.get(t, "not found") 
        output_file.write(f"{gene}\t{transcript}\n")

Testing it with your data produces:

// iso_dict (note that transcripts are used as keys)
{'TGene1.Iso1': 'TGene1', 'TGene1.Iso2': 'TGene1', 'TGene2.Iso1': 'TGene2'}

// written output
TGene1	TGene1.Iso1.169041
TGene1	TGene1.Iso1.220
TGene1	TGene1.Iso2.220
TGene2	TGene2.Iso1.169026
TGene2	TGene2.Iso1.1880
TGene1	TGene1.Iso2.169041

Running bed2gtf with that output file will write a specific block for each transcript without any overwrite (since each toga transcript is unique). I look up for a similar example with some data I can quickly access:

t_gene	t_transcript	q_gene	q_transcript	orthology_class	chrom
ENSG00000172365	ENST00000641342	reg_991	ENST00000641342.10630	many2many	chr13
ENSG00000172365	ENST00000641342	reg_14588	ENST00000641342.22739	many2many	chr14
ENSG00000172365	ENST00000641342	reg_14628	ENST00000641342.37699	many2many	chr20
ENSG00000280709	ENST00000628444	reg_991	ENST00000628444.81896	many2many	chr16
ENSG00000284609	ENST00000641139	reg_991	ENST00000641139.4896	many2many	chr1

Here 1) the same target transcript is mapped to different chromosomes, 2) different target genes are projected over the same query gene (reg_991) in different chromosomes. All of those transcripts are correctly annotated in the output gtf.

Furthermore, I just create a .bed with random coordinates using your example:

chr10	42553596	42554643	TGene1.Iso1.169041	1000	+	42553596	42554643	0,0,200	2	84,106,	0,941,
chr7	94578012	94579938	TGene1.Iso1.220	1000	-	94578012	94579938	0,0,200	2	104,628,	0,1298,
chr7	81275380	81276990	TGene1.Iso2.220	1000	+	81275380	81276990	255,160,120	2	479,137,	0,1473,
chr10	56789696	56789898	TGene2.Iso1.169026	1000	+	56789696	56789898	130,130,130	1	202,	0,
chr7	17167669	17167901	TGene2.Iso1.1880	1000	+	17167669	17167901	255,50,50	1	232,	0,
chr10	12250689	12251844	TGene1.Iso2.169041	1000	+	12250689	12251844	0,0,200	1	1155,	0,

This is the output:

//TGene1
chr10	bed2gtf	gene	42553597	42554643	.	+	.	gene_id "TGene1";
chr10	bed2gtf	transcript	42553597	42554643	.	+	.	gene_id "TGene1"; transcript_id "TGene1.Iso1.169041";
chr10	bed2gtf	exon	42553597	42553680	.	+	.	gene_id "TGene1"; transcript_id "TGene1.Iso1.169041"; exon_number "1"; exon_id "TGene1.Iso1.169041.1";
chr10	bed2gtf	CDS	42553597	42553680	.	+	0	gene_id "TGene1"; transcript_id "TGene1.Iso1.169041"; exon_number "1"; exon_id "TGene1.Iso1.169041.1";
chr10	bed2gtf	exon	42554538	42554643	.	+	.	gene_id "TGene1"; transcript_id "TGene1.Iso1.169041"; exon_number "2"; exon_id "TGene1.Iso1.169041.2";
chr10	bed2gtf	CDS	42554538	42554643	.	+	0	gene_id "TGene1"; transcript_id "TGene1.Iso1.169041"; exon_number "2"; exon_id "TGene1.Iso1.169041.2";
chr10	bed2gtf	start_codon	42553597	42553599	.	+	0	gene_id "TGene1"; transcript_id "TGene1.Iso1.169041"; exon_number "1"; exon_id "TGene1.Iso1.169041.1";
chr7	bed2gtf	transcript	94578013	94579938	.	-	.	gene_id "TGene1"; transcript_id "TGene1.Iso1.220";
chr7	bed2gtf	exon	94578013	94578116	.	-	.	gene_id "TGene1"; transcript_id "TGene1.Iso1.220"; exon_number "2"; exon_id "TGene1.Iso1.220.2";
chr7	bed2gtf	CDS	94578016	94578116	.	-	2	gene_id "TGene1"; transcript_id "TGene1.Iso1.220"; exon_number "2"; exon_id "TGene1.Iso1.220.2";
chr7	bed2gtf	exon	94579311	94579938	.	-	.	gene_id "TGene1"; transcript_id "TGene1.Iso1.220"; exon_number "1"; exon_id "TGene1.Iso1.220.1";
chr7	bed2gtf	CDS	94579311	94579938	.	-	0	gene_id "TGene1"; transcript_id "TGene1.Iso1.220"; exon_number "1"; exon_id "TGene1.Iso1.220.1";
chr7	bed2gtf	start_codon	94579936	94579938	.	-	0	gene_id "TGene1"; transcript_id "TGene1.Iso1.220"; exon_number "1"; exon_id "TGene1.Iso1.220.1";
chr7	bed2gtf	stop_codon	94578013	94578015	.	-	0	gene_id "TGene1"; transcript_id "TGene1.Iso1.220"; exon_number "2"; exon_id "TGene1.Iso1.220.2";
chr7	bed2gtf	transcript	81275381	81276990	.	+	.	gene_id "TGene1"; transcript_id "TGene1.Iso2.220";
chr7	bed2gtf	exon	81275381	81275859	.	+	.	gene_id "TGene1"; transcript_id "TGene1.Iso2.220"; exon_number "1"; exon_id "TGene1.Iso2.220.1";
chr7	bed2gtf	CDS	81275381	81275859	.	+	0	gene_id "TGene1"; transcript_id "TGene1.Iso2.220"; exon_number "1"; exon_id "TGene1.Iso2.220.1";
chr7	bed2gtf	exon	81276854	81276990	.	+	.	gene_id "TGene1"; transcript_id "TGene1.Iso2.220"; exon_number "2"; exon_id "TGene1.Iso2.220.2";
chr7	bed2gtf	CDS	81276854	81276990	.	+	1	gene_id "TGene1"; transcript_id "TGene1.Iso2.220"; exon_number "2"; exon_id "TGene1.Iso2.220.2";
chr7	bed2gtf	start_codon	81275381	81275383	.	+	0	gene_id "TGene1"; transcript_id "TGene1.Iso2.220"; exon_number "1"; exon_id "TGene1.Iso2.220.1";
chr10	bed2gtf	transcript	12250690	12251844	.	+	.	gene_id "TGene1"; transcript_id "TGene1.Iso2.169041";
chr10	bed2gtf	exon	12250690	12251844	.	+	.	gene_id "TGene1"; transcript_id "TGene1.Iso2.169041"; exon_number "1"; exon_id "TGene1.Iso2.169041.1";
chr10	bed2gtf	CDS	12250690	12251841	.	+	0	gene_id "TGene1"; transcript_id "TGene1.Iso2.169041"; exon_number "1"; exon_id "TGene1.Iso2.169041.1";
chr10	bed2gtf	start_codon	12250690	12250692	.	+	0	gene_id "TGene1"; transcript_id "TGene1.Iso2.169041"; exon_number "1"; exon_id "TGene1.Iso2.169041.1";
chr10	bed2gtf	stop_codon	12251842	12251844	.	+	0	gene_id "TGene1"; transcript_id "TGene1.Iso2.169041"; exon_number "1"; exon_id "TGene1.Iso2.169041.1";

//TGene2
chr10	bed2gtf	gene	56789697	56789898	.	+	.	gene_id "TGene2";
chr10	bed2gtf	transcript	56789697	56789898	.	+	.	gene_id "TGene2"; transcript_id "TGene2.Iso1.169026";
chr10	bed2gtf	exon	56789697	56789898	.	+	.	gene_id "TGene2"; transcript_id "TGene2.Iso1.169026"; exon_number "1"; exon_id "TGene2.Iso1.169026.1";
chr10	bed2gtf	CDS	56789697	56789898	.	+	0	gene_id "TGene2"; transcript_id "TGene2.Iso1.169026"; exon_number "1"; exon_id "TGene2.Iso1.169026.1";
chr10	bed2gtf	start_codon	56789697	56789699	.	+	0	gene_id "TGene2"; transcript_id "TGene2.Iso1.169026"; exon_number "1"; exon_id "TGene2.Iso1.169026.1";
chr7	bed2gtf	transcript	17167670	17167901	.	+	.	gene_id "TGene2"; transcript_id "TGene2.Iso1.1880";
chr7	bed2gtf	exon	17167670	17167901	.	+	.	gene_id "TGene2"; transcript_id "TGene2.Iso1.1880"; exon_number "1"; exon_id "TGene2.Iso1.1880.1";
chr7	bed2gtf	CDS	17167670	17167901	.	+	0	gene_id "TGene2"; transcript_id "TGene2.Iso1.1880"; exon_number "1"; exon_id "TGene2.Iso1.1880.1";
chr7	bed2gtf	start_codon	17167670	17167672	.	+	0	gene_id "TGene2"; transcript_id "TGene2.Iso1.1880"; exon_number "1"; exon_id "TGene2.Iso1.1880.1";

showing that all transcripts have been correctly converted without errors. Please let me know if this clear your questions or if I misinterpreted your problem.

Regards!

0 replies

shjenkins94 · 2023-09-19T08:08:49Z

shjenkins94
Sep 19, 2023

I think that's still the same problem, where you've got gene annotations that point to completely different locations.

Did you get "reg_991" being projected to different regions from TOGA results? From what I can tell from the make_query_isoforms the process is basically:

Split all transcripts into exons
Separate exons that aren't on the same strand or chromosome
Group transcripts with exons overlapping on the same strand and chromosome
Assign those grouped transcripts an id.

Seems like there shouldn't be a "reg" that combines transcripts that don't overlap. In fact that module only uses the query_annotation bed file so it shouldn't have any info on the original target genes.

0 replies

MichaelHiller · 2023-09-19T16:29:00Z

MichaelHiller
Sep 19, 2023
Maintainer

The transcript to gene assignment in the query happens via 'same strand exon overlap'.

I agree, in the example above, reg_991 is on 3 different chroms (chr13, chr16, chr1), which are all separate loci by definition.
@kirilenkobm I think they should get different query gene IDs (??). Can you pls have a look?

0 replies

shjenkins94 · 2023-09-20T10:38:20Z

shjenkins94
Sep 20, 2023

@MichaelHiller @kirilenkobm Could you please confirm the relationship between the query transcripts included in different files?

From what I can tell:

loss_summ_data.tsv is basically the most inclusive file, every query transcript and its status is included
query_annotations.bed has a subset of there query transcripts in loss_summ_data.tsv. Since stuff like grouping transcripts into genes happens after it's created it potentially has a bunch of unwanted transcripts.
query_isoforms.tsv has a subset of the transcripts in query_annotation.bed. It only has query transcripts that are Intact, Partially Intact, or Uncertain Loss, and is the result of grouping overlapping same-strand transcripts and assigning each group an ID.
orthology_classification.tsv has a subset of the transcripts in query_isoforms.tsv. It connects reference genes, reference transcripts, query genes, and query transcripts. It has almost the same set of query transcripts as query_isoforms.tsv, but it does a thing where it tries to split many2many orthologous classifications which results in some transcripts being discarded.

Seems like if the goal is say, a gtf of intact genes predicted by TOGA then query_annotation.bed needs to be filtered and processed using orthology_classification.tsv.

Meanwhile query_annotation.bed needs some extra processing to convert it into say, a gff that includes pseudogenes.

0 replies

kirilenkobm · 2023-09-20T10:46:15Z

kirilenkobm
Sep 20, 2023
Maintainer

Hi @shjenkins94

Yes, that's correct
Yep, query_annotations.bed contains only those transcripts, that can be theoretically annotated. For example, if all exons of a query transcript A are deleted, then there is nothing to annotate - this query transcript (projection) appears in the loss_summ_data.tsv with the L label, but is absent in the query_annotations.bed
Yes, TOGA tries to group them into "query genes". Similar to reference genes and isoforms. From TOGA version 1.1.5 they have the TOGA_ prefix label before they were labeled with the reg_ label.
Yes.

0 replies

kirilenkobm · 2023-09-20T10:48:11Z

kirilenkobm
Sep 20, 2023
Maintainer

@MichaelHiller
That's interesting... by definition, reg_XXX (or TOGA_XXXX in the future ) groups query transcripts that have intersecting exons on the same strand. I'll need to check what exactly happened.

@alejandrogzi
could you pls attach the respective query_annotations.bed and loss_summ_data.tsv files? I will debug the script that assigns these reg_XXX labels.

0 replies

alejandrogzi · 2023-09-20T20:48:46Z

alejandrogzi
Sep 20, 2023
Author

@MichaelHiller @kirilenkobm @shjenkins94,

Sorry for the delay response. I made a mistake extracting chromosomes for that example, some awk syntax misunderstanding, completely my fault. I double checked not only those examples but the whole data where I got those transcripts from, and everything is ok! (only intersecting queries grouped together). This is the corrected example:

t_gene	t_transcript	q_gene	q_transcript	orthology_class	chrom
ENSG00000172365	ENST00000641342	reg_991	ENST00000641342.10630	many2many	chr25
ENSG00000172365	ENST00000641342	reg_14588	ENST00000641342.22739	many2many	chr16
ENSG00000172365	ENST00000641342	reg_14628	ENST00000641342.37699	many2many	chr16
ENSG00000280709	ENST00000628444	reg_991	ENST00000628444.81896	many2many	chr25
ENSG00000284609	ENST00000641139	reg_991	ENST00000641139.4896	many2many	chr25

Apologies for any confusion created!

Regarding to @shjenkins94 issue:

gene annotations that point to completely different locations [...] transcripts in completely different locations are grouped together as one gene.

bed2gtf will annotate those 'correctly' (based on what is given as input). Initially was thought to offer a faster and simplified way over C binaries (including additional functionality; which combines very well with TOGA post-processing) and since more steps (e.g. checking if all transcripts grouped are coming from the same chromosome) could be added I think this goes beyond of the original scope (something more TOGA-specific). On the other hand, I think this could be worked over this idea:

[...] maybe a post-TOGA pipeline could be done; however, users may need a lot of different custom ways to use it, can be tricky [...] but I think some guidelines in this aspect should be done [...].

and part of the way to start is mentioned above:

Seems like if the goal is say, a gtf of intact genes predicted by TOGA then query_annotation.bed needs to be filtered and processed using orthology_classification.tsv. [...] Meanwhile query_annotation.bed needs some extra processing to convert it into say, a gff that includes pseudogenes.

I have some advanced compiling all of my scripts in a single pipeline but as I said earlier: some guidelines in this aspect should be done before (more like primary goals or aspects to cover). I definitely would be excited to contribute to this cause!

0 replies

MichaelHiller · 2023-12-14T20:23:45Z

MichaelHiller
Dec 14, 2023
Maintainer

Hi @alejandrogzi

we have tested bed2gtf on a TOGA run and get warnings/error messages like
Isoform ENST00000347757.SSX5.4667 not found. Check your isoforms file

This is because the transcript (more specifically, the projection) is labled as lost.
Here are 3 such cases
grep ENST00000351606 query_isoforms.tsv loss_summ_data.tsv
loss_summ_data.tsv:PROJECTION ENST00000351606.GSG1.4 L
loss_summ_data.tsv:TRANSCRIPT ENST00000351606.GSG1 L
grep ENST00000329454 query_isoforms.tsv loss_summ_data.tsv
loss_summ_data.tsv:PROJECTION ENST00000329454.SRARP.3 L
loss_summ_data.tsv:TRANSCRIPT ENST00000329454.SRARP L
grep ENST00000347757 query_isoforms.tsv loss_summ_data.tsv
query_isoforms.tsv:reg_16646 ENST00000347757.SSX5.264411
loss_summ_data.tsv:PROJECTION ENST00000347757.SSX5.4667 L
loss_summ_data.tsv:PROJECTION ENST00000347757.SSX5.264411 UL
loss_summ_data.tsv:TRANSCRIPT ENST00000347757.SSX5 UL

We annotated lost transcripts, because it tells you where remnants of once functional genes are located.

Could these warnings maybe be disabled with a flag that a user can set?
Thx
Michael

0 replies

alejandrogzi · 2023-12-14T23:08:35Z

alejandrogzi
Dec 14, 2023
Author

Hi @MichaelHiller,

Thanks for reaching out and for use bed2gtf!

As far as I understand, you are using query_isoforms.tsv as the input isoforms file for bed2gtf. Since L projections are not included, bed2gtf is throwing errors about missing isoforms. However, these transcripts are, indeed, included in the .bed file and need to be in the output .gtf file. You can quickly fix this by including those missing projections.

Assuming you have this template:

ENSG00000067445 | ENST00000420798.TRO
ENSG00000123427 | ENST00000300209.EEF1AKMT3
ENSG00000127314 | ENST00000250559.RAP1B
ENSG00000151322 | ENST00000548645.NPAS3

and this is your loss_summ_data.tsv :

PROJECTION | ENST00000420798.TRO.12 | UL
PROJECTION | ENST00000300209.EEF1AKMT3.432 | I
PROJECTION | ENST00000250559.RAP1B.76 | L
PROJECTION | ENST00000548645.NPAS3.456 | UL

You will just need to split by the last '.' for each projection in loss_summ_data.tsv and merge both tables. With that, virtually, all transcripts would be covered. This is maybe cumbersome because you will need to do some extra steps. bed2gtf will annotate everything in the .bed file only if it appears in the isoforms file. I am thinking that could be possible to ignore the error (when a transcript key does not have a gene value) and avoid creating a gene line for that transcript (the remainder: transcript, exon/CDS, start/stop, 5/3UTR will be annotate). That will avoid throwing that error message and just pass transcripts that can't be mapped. Please let me know if this answer satisfy your question or if I misunderstood something, Dr. Hiller.

Regards,
Alejandro

0 replies

shjenkins94 · 2024-03-21T17:52:09Z

shjenkins94
Mar 21, 2024

Hello all, I was testing out postoga on my TOGA results recently and ran into an issue where the bed2gff step fails if there are any transcripts ending with -1.

I'm posting about it here because it seems like those are somehow different from most of the other transcripts. Is it fine to just modify postoga to handle them like the other transcripts or should something else be done?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Converting TOGA results into different formats #136

{{title}}

Replies: 25 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Converting TOGA results into different formats #136

alejandrogzi Aug 29, 2023

Replies: 25 comments

MichaelHiller Aug 29, 2023 Maintainer

kirilenkobm Aug 29, 2023 Maintainer

kirilenkobm Aug 29, 2023 Maintainer

alejandrogzi Aug 29, 2023 Author

kirilenkobm Aug 30, 2023 Maintainer

alejandrogzi Aug 30, 2023 Author

MichaelHiller Sep 6, 2023 Maintainer

shjenkins94 Sep 8, 2023

MichaelHiller Sep 8, 2023 Maintainer

alejandrogzi Sep 8, 2023 Author

kirilenkobm Sep 9, 2023 Maintainer

MichaelHiller Sep 9, 2023 Maintainer

shjenkins94 Sep 12, 2023

alejandrogzi Sep 13, 2023 Author

shjenkins94 Sep 14, 2023

alejandrogzi Sep 14, 2023 Author

shjenkins94 Sep 19, 2023

MichaelHiller Sep 19, 2023 Maintainer

shjenkins94 Sep 20, 2023

kirilenkobm Sep 20, 2023 Maintainer

kirilenkobm Sep 20, 2023 Maintainer

alejandrogzi Sep 20, 2023 Author

MichaelHiller Dec 14, 2023 Maintainer

alejandrogzi Dec 14, 2023 Author

shjenkins94 Mar 21, 2024

alejandrogzi
Aug 29, 2023

MichaelHiller
Aug 29, 2023
Maintainer

kirilenkobm
Aug 29, 2023
Maintainer

kirilenkobm
Aug 29, 2023
Maintainer

alejandrogzi
Aug 29, 2023
Author

kirilenkobm
Aug 30, 2023
Maintainer

alejandrogzi
Aug 30, 2023
Author

MichaelHiller
Sep 6, 2023
Maintainer

shjenkins94
Sep 8, 2023

MichaelHiller
Sep 8, 2023
Maintainer

alejandrogzi
Sep 8, 2023
Author

kirilenkobm
Sep 9, 2023
Maintainer

MichaelHiller
Sep 9, 2023
Maintainer

shjenkins94
Sep 12, 2023

alejandrogzi
Sep 13, 2023
Author

shjenkins94
Sep 14, 2023

alejandrogzi
Sep 14, 2023
Author

shjenkins94
Sep 19, 2023

MichaelHiller
Sep 19, 2023
Maintainer

shjenkins94
Sep 20, 2023

kirilenkobm
Sep 20, 2023
Maintainer

kirilenkobm
Sep 20, 2023
Maintainer

alejandrogzi
Sep 20, 2023
Author

MichaelHiller
Dec 14, 2023
Maintainer

alejandrogzi
Dec 14, 2023
Author

shjenkins94
Mar 21, 2024