Converting TOGA results into different formats #136
Replies: 25 comments
-
Hi Alejandro, Thx |
Beta Was this translation helpful? Give feedback.
-
Hi @alejandrogzi @MichaelHiller, Could you please remind me of those limitations? I think the best strategy for including the package would be to:
Thanks! |
Beta Was this translation helpful? Give feedback.
-
By the way, I took a quick look at the code and everything looks good to me. Even though I'm not familiar with Rust, everything was easy to follow :). |
Beta Was this translation helpful? Give feedback.
-
@MichaelHiller , @kirilenkobm , Sorry for the delay response (I am 7h behind you).
Yes, Michael. This tool is the exact reimplementation of In a first pipeline I did (merging make_lastz_chains + TOGA + postprocessing) I needed to handle: 1) downloading binaries, 2) downloading refTables, 3) cutting genePred files if they had just numbers as chr, 4) modifying refTables if I wanted to do custom rearrangements, 5) looping through the output gtf to add a "gene" line feature, etc (a lot of steps really). I developed bed2gtf to overcome all of this in 1 step, just by providing the .bed file, a .txt/.tsv/.csv file with isoforms (gene - transcript, making sure that all transcripts in the .bed file appears here) and the path to the output .gtf. The main limitations of bed2gtf are: 1) output is not sorted (which I did not implement here but I am developing another tool for that also in Rust gtfsort. 2) does not provide more than gene_ids in the gene-feature line (e.g. missing gene_biotype, ...), that could be solved by just adding that info in the isoforms file and then just append it to the resultant line but I am not sure if that is necessary.
Bogdan, I could adjust all the info in the bed2gtf repo to fit within TOGA wiki (a shorter version just explaining functionality). Please let me know, and thanks for the support! (just and undergrad here trying to find opportunities!) Best, |
Beta Was this translation helpful? Give feedback.
-
I've added the following code snippet to the configure script (it will be available in Version 1.1.5 once it's complete): if $BUILD_BED2GTF; then
if ! $OVERRIDE && [[ -f "./bed2gtf/some_check_file" ]]
then
printf "bed2gtf installation found\n"
else
printf "bed2gtf installation not found, cloning and building\n"
git submodule init bed2gtf
git submodule update bed2gtf
# ACTUAL BUILD COMMAND HERE
fi
fi This script is executed when the configure script is called with the Regarding building the project, I have a question. For CESAR2.0, the build process is simple: My question might seem naïve, but I'm not familiar with the Rust build system. Would the best approach be to first check if cargo is installed, and if it is, navigate to the bed2gtf directory and run |
Beta Was this translation helpful? Give feedback.
-
Hi @kirilenkobm, Nice! In rust you have a range of scenarios. Start checking if cargo is installed is highly recommended I think, and handles potential posterior errors in the pipeline. If cargo is installed, we initialize and fetch the submodule data; after that, we need to On the other hand, Maybe the script could be updated to this: if $BUILD_BED2GTF; then
if command -v cargo &> /dev/null; then
if ! $OVERRIDE && [[ -f "./bed2gtf/Cargo.toml" ]]; then
printf "bed2gtf installation found\n"
# these statements should be avoided, right? In overriding
# situations we would expect binaries already build
if [[ -f "./bed2gtf/target/release/bed2gtf" ]]; then
printf "bed2gtf binary found\n"
else
printf "bed2gtf binary not found, building\n"
cd ./bed2gtf
cargo build --release
# locate binary at ./target/release/bed2gtf
fi
else
printf "bed2gtf installation not found, cloning and building\n"
git submodule init bed2gtf
git submodule update bed2gtf
cd ./bed2gtf
cargo build --release
# locate binary at ./target/release/bed2gtf
fi
else
printf "cargo not found, installing rust\n"
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# include bed2gtf build here
fi
fi |
Beta Was this translation helpful? Give feedback.
-
Thx a lot, Alejandro, this is a very useful addition, esp for downstream tools that need the gene feature. If the tool works and is properly benchmarked, I wonder if we should update all the geneAnnotation.gtf.gz files on @kirilenkobm On our "Output reading / query_annotation.bed" section, can you pls recommend bed2gtf to convert the bed to gtf? |
Beta Was this translation helpful? Give feedback.
-
Sorry to intrude. Coincidentally, I've been working on converting some R code I wrote to get a filtered GFF3 from TOGA output into python. It was based on what I needed it for (getting the best transcript for each gene) so it's less general purpose but does some stuff that could be useful for TOGA postprocessing, i.e. it adds orthology classifications and gene or transcript loss status, and the name of the original reference gene(s) as attributes. It also filters the transcripts for each gene to the most intact ones. I'm wondering if this might be useful as a further step after running bed2gtf. |
Beta Was this translation helpful? Give feedback.
-
This is great. I think it would be great if downstream processing scripts and tools could be added, as this will be useful for many users. If you agree to add your code to this repo, pls write your name / institution at the beginning of the code to make clear who developed it. Thx a lot !! |
Beta Was this translation helpful? Give feedback.
-
@shjenkins94 cool! I've thinking these days to add a feature to bed2gtf that allows specifying GTF version output (GTF3 as default). GTF3 and GFF3 are pretty similar so maybe both tools could easily work together! Additionally, there a lot of tools available to convert GTF to GFF but if we need a custom tool to make the bridge between bed2gtf and your tool, we could join efforts. @MichaelHiller during the post-processing procedure I have develop a lot of scripts, while writing them I always thought that maybe a post-TOGA pipeline could be done; however, users may need a lot of different custom ways to use it, can be tricky. Your python AssemblyStats it is a good example of that. I will try to group all of them and work on a simple pipeline to show you (these are written in python but I think some guidelines in this aspect should be done, a lot of different scripts in a diverse range of programming languages could be problematic). Regards! |
Beta Was this translation helpful? Give feedback.
-
@shjenkins94, that sounds great! However, it might make more sense if you could create a separate repository for the post-processing scripts and procedures, complete with some documentation. I can then include it as a submodule and reference it in the README. This way, versioning will be better managed, and concerns will be more cleanly separated. What are your thoughts? |
Beta Was this translation helpful? Give feedback.
-
Also a good idea. But ideally we have something were many people can contribute. Nice to see that the community is using TOGA and contributing to downstream steps !! |
Beta Was this translation helpful? Give feedback.
-
@kirilenkobm A subrepo sounds like a good idea (working in the main TOGA repo can get a bit bulky) It would also be nice to clarify a few things about TOGA results. For example, @alejandrogzi I noticed when I tried out bed2gtf that it currently fails on my TOGA output because there are some projections that are included in the final query_annotations.bed file but not the query_isoforms.tsv file. Looking at the loss_summ_data.tsv, query_annotaion.bed, and query_isoforms.tsv files, it seems like only the most intact projections for a transcript are actually assigned to a gene. I was wondering if the rest of the projections are still useful, as well as wether they should be connected to the same gene as their more intact counterparts. The bed file also seems to include projections of different reference transcripts that are identical on the query genome. Depending on the goal it might be good to filter these as well. |
Beta Was this translation helpful? Give feedback.
-
@shjenkins94 Yes! As stated in the bed2gtf repo it is mandatory for all the isoforms in the .bed file to appear in the isoforms file, that is how the tool maps gene features.
That was a problem for me at some moment. Michael, Bogdan and I have an issue commenting that here: #70. I can add a feature to detect if a transcript is not present in the isoforms file, bed2gtf (latest release) just throws an error explaining that it tried to use a transcript as a key to find its respective gene and did not find it. The 'brute' way around to solve that issue I found was using Something like this simplifies the work: import pandas as pd
# Reads input files: bed (3rd column with transcript ids), isoforms (base input file), output
bed = pd.read_csv("~/toga_results/query_annotation.bed",
sep="\t",
header=None,
usecols=[3],
names=["transcripts"])
isoforms = "~/toga_results/temp/isoforms.tsv"
output = "~/toga_results/output.txt"
# Fast search of a given transcript using a simple dictionary (transcript-gene
# based on the isoforms.tsv)
iso_dict = {}
with open(isoforms, "r") as isoforms_file:
for line in isoforms_file:
if line.startswith("gene"):
continue
else:
gene, transcript = line.strip().split("\t")
iso_dict[transcript] = gene
# Maps j transcript (splitting it by ".") to the helper dictionary and
# writes a gene-toga_transcript pair to the output file
with open(output, "w") as output_file:
for transcript in bed.transcripts:
gene = iso_dict.get(str(transcript).split(".")[0], "not found")
output_file.write(f"{gene}\t{transcript}\n") With the new file bed2gtf ~/toga_results/query_annotation.bed ~/toga_results/output.txt ~/toga_results/query_annotation.gtf or inside a Rust project with: use bed2gtf::*;
let gtf = bed2gtf(bed, isoforms, output); On the other hand, we can solve the issue by making TOGA write all the transcripts in the .bed to an isoforms output mapping them not only to the TOGA regions but to gene ids. Hope that helps! |
Beta Was this translation helpful? Give feedback.
-
@alejandrogzi I think the issue with using the original isoforms file is that it won't work well for genes that aren't one2one. This might help, I did a thing post-processing where I combined the orthology_classification, gene_loss_summ, and query_isoform, and annotation bed file to get a table of the target gene, the target transcript, the query gene, the query transcript, the orthology class, the loss type of the target transcript, the loss type of the query transcript, and the chromosome that the query transcript is on in the bed file. Here's an example of some simplified many2many results, where 2 target genes were orthologous to 2 query genes on different chromosomes. The last line is a transcript that's included in the query annotation file but not the query_isoforms file (NA for q_gene)
If TGene1 and TGene2 are used as gene ids then the problem with the orphan query transcript is fixed but transcripts in completely different locations are grouped together as one gene. |
Beta Was this translation helpful? Give feedback.
-
@shjenkins94 thank you for providing a specific example! Since your transcripts ids are formatted differently, I just modified a small part of the code I provided earlier to: ### ...
with open(output, "w") as output_file:
for transcript in bed.transcripts:
t = ".".join(str(transcript).split(".")[:-1]) # Allows precise mapping of toga-transcripts
gene = iso_dict.get(t, "not found")
output_file.write(f"{gene}\t{transcript}\n") Testing it with your data produces: // iso_dict (note that transcripts are used as keys)
{'TGene1.Iso1': 'TGene1', 'TGene1.Iso2': 'TGene1', 'TGene2.Iso1': 'TGene2'}
// written output
TGene1 TGene1.Iso1.169041
TGene1 TGene1.Iso1.220
TGene1 TGene1.Iso2.220
TGene2 TGene2.Iso1.169026
TGene2 TGene2.Iso1.1880
TGene1 TGene1.Iso2.169041 Running bed2gtf with that output file will write a specific block for each transcript without any overwrite (since each toga transcript is unique). I look up for a similar example with some data I can quickly access:
Here 1) the same target transcript is mapped to different chromosomes, 2) different target genes are projected over the same query gene (reg_991) in different chromosomes. All of those transcripts are correctly annotated in the output gtf. Furthermore, I just create a .bed with random coordinates using your example:
This is the output: //TGene1
chr10 bed2gtf gene 42553597 42554643 . + . gene_id "TGene1";
chr10 bed2gtf transcript 42553597 42554643 . + . gene_id "TGene1"; transcript_id "TGene1.Iso1.169041";
chr10 bed2gtf exon 42553597 42553680 . + . gene_id "TGene1"; transcript_id "TGene1.Iso1.169041"; exon_number "1"; exon_id "TGene1.Iso1.169041.1";
chr10 bed2gtf CDS 42553597 42553680 . + 0 gene_id "TGene1"; transcript_id "TGene1.Iso1.169041"; exon_number "1"; exon_id "TGene1.Iso1.169041.1";
chr10 bed2gtf exon 42554538 42554643 . + . gene_id "TGene1"; transcript_id "TGene1.Iso1.169041"; exon_number "2"; exon_id "TGene1.Iso1.169041.2";
chr10 bed2gtf CDS 42554538 42554643 . + 0 gene_id "TGene1"; transcript_id "TGene1.Iso1.169041"; exon_number "2"; exon_id "TGene1.Iso1.169041.2";
chr10 bed2gtf start_codon 42553597 42553599 . + 0 gene_id "TGene1"; transcript_id "TGene1.Iso1.169041"; exon_number "1"; exon_id "TGene1.Iso1.169041.1";
chr7 bed2gtf transcript 94578013 94579938 . - . gene_id "TGene1"; transcript_id "TGene1.Iso1.220";
chr7 bed2gtf exon 94578013 94578116 . - . gene_id "TGene1"; transcript_id "TGene1.Iso1.220"; exon_number "2"; exon_id "TGene1.Iso1.220.2";
chr7 bed2gtf CDS 94578016 94578116 . - 2 gene_id "TGene1"; transcript_id "TGene1.Iso1.220"; exon_number "2"; exon_id "TGene1.Iso1.220.2";
chr7 bed2gtf exon 94579311 94579938 . - . gene_id "TGene1"; transcript_id "TGene1.Iso1.220"; exon_number "1"; exon_id "TGene1.Iso1.220.1";
chr7 bed2gtf CDS 94579311 94579938 . - 0 gene_id "TGene1"; transcript_id "TGene1.Iso1.220"; exon_number "1"; exon_id "TGene1.Iso1.220.1";
chr7 bed2gtf start_codon 94579936 94579938 . - 0 gene_id "TGene1"; transcript_id "TGene1.Iso1.220"; exon_number "1"; exon_id "TGene1.Iso1.220.1";
chr7 bed2gtf stop_codon 94578013 94578015 . - 0 gene_id "TGene1"; transcript_id "TGene1.Iso1.220"; exon_number "2"; exon_id "TGene1.Iso1.220.2";
chr7 bed2gtf transcript 81275381 81276990 . + . gene_id "TGene1"; transcript_id "TGene1.Iso2.220";
chr7 bed2gtf exon 81275381 81275859 . + . gene_id "TGene1"; transcript_id "TGene1.Iso2.220"; exon_number "1"; exon_id "TGene1.Iso2.220.1";
chr7 bed2gtf CDS 81275381 81275859 . + 0 gene_id "TGene1"; transcript_id "TGene1.Iso2.220"; exon_number "1"; exon_id "TGene1.Iso2.220.1";
chr7 bed2gtf exon 81276854 81276990 . + . gene_id "TGene1"; transcript_id "TGene1.Iso2.220"; exon_number "2"; exon_id "TGene1.Iso2.220.2";
chr7 bed2gtf CDS 81276854 81276990 . + 1 gene_id "TGene1"; transcript_id "TGene1.Iso2.220"; exon_number "2"; exon_id "TGene1.Iso2.220.2";
chr7 bed2gtf start_codon 81275381 81275383 . + 0 gene_id "TGene1"; transcript_id "TGene1.Iso2.220"; exon_number "1"; exon_id "TGene1.Iso2.220.1";
chr10 bed2gtf transcript 12250690 12251844 . + . gene_id "TGene1"; transcript_id "TGene1.Iso2.169041";
chr10 bed2gtf exon 12250690 12251844 . + . gene_id "TGene1"; transcript_id "TGene1.Iso2.169041"; exon_number "1"; exon_id "TGene1.Iso2.169041.1";
chr10 bed2gtf CDS 12250690 12251841 . + 0 gene_id "TGene1"; transcript_id "TGene1.Iso2.169041"; exon_number "1"; exon_id "TGene1.Iso2.169041.1";
chr10 bed2gtf start_codon 12250690 12250692 . + 0 gene_id "TGene1"; transcript_id "TGene1.Iso2.169041"; exon_number "1"; exon_id "TGene1.Iso2.169041.1";
chr10 bed2gtf stop_codon 12251842 12251844 . + 0 gene_id "TGene1"; transcript_id "TGene1.Iso2.169041"; exon_number "1"; exon_id "TGene1.Iso2.169041.1";
//TGene2
chr10 bed2gtf gene 56789697 56789898 . + . gene_id "TGene2";
chr10 bed2gtf transcript 56789697 56789898 . + . gene_id "TGene2"; transcript_id "TGene2.Iso1.169026";
chr10 bed2gtf exon 56789697 56789898 . + . gene_id "TGene2"; transcript_id "TGene2.Iso1.169026"; exon_number "1"; exon_id "TGene2.Iso1.169026.1";
chr10 bed2gtf CDS 56789697 56789898 . + 0 gene_id "TGene2"; transcript_id "TGene2.Iso1.169026"; exon_number "1"; exon_id "TGene2.Iso1.169026.1";
chr10 bed2gtf start_codon 56789697 56789699 . + 0 gene_id "TGene2"; transcript_id "TGene2.Iso1.169026"; exon_number "1"; exon_id "TGene2.Iso1.169026.1";
chr7 bed2gtf transcript 17167670 17167901 . + . gene_id "TGene2"; transcript_id "TGene2.Iso1.1880";
chr7 bed2gtf exon 17167670 17167901 . + . gene_id "TGene2"; transcript_id "TGene2.Iso1.1880"; exon_number "1"; exon_id "TGene2.Iso1.1880.1";
chr7 bed2gtf CDS 17167670 17167901 . + 0 gene_id "TGene2"; transcript_id "TGene2.Iso1.1880"; exon_number "1"; exon_id "TGene2.Iso1.1880.1";
chr7 bed2gtf start_codon 17167670 17167672 . + 0 gene_id "TGene2"; transcript_id "TGene2.Iso1.1880"; exon_number "1"; exon_id "TGene2.Iso1.1880.1"; showing that all transcripts have been correctly converted without errors. Please let me know if this clear your questions or if I misinterpreted your problem. Regards! |
Beta Was this translation helpful? Give feedback.
-
I think that's still the same problem, where you've got gene annotations that point to completely different locations. Did you get "reg_991" being projected to different regions from TOGA results? From what I can tell from the make_query_isoforms the process is basically:
Seems like there shouldn't be a "reg" that combines transcripts that don't overlap. In fact that module only uses the query_annotation bed file so it shouldn't have any info on the original target genes. |
Beta Was this translation helpful? Give feedback.
-
The transcript to gene assignment in the query happens via 'same strand exon overlap'. I agree, in the example above, reg_991 is on 3 different chroms (chr13, chr16, chr1), which are all separate loci by definition. |
Beta Was this translation helpful? Give feedback.
-
@MichaelHiller @kirilenkobm Could you please confirm the relationship between the query transcripts included in different files? From what I can tell:
Seems like if the goal is say, a gtf of intact genes predicted by TOGA then query_annotation.bed needs to be filtered and processed using orthology_classification.tsv. Meanwhile query_annotation.bed needs some extra processing to convert it into say, a gff that includes pseudogenes. |
Beta Was this translation helpful? Give feedback.
-
Hi @shjenkins94
|
Beta Was this translation helpful? Give feedback.
-
@MichaelHiller @alejandrogzi |
Beta Was this translation helpful? Give feedback.
-
@MichaelHiller @kirilenkobm @shjenkins94, Sorry for the delay response. I made a mistake extracting chromosomes for that example, some awk syntax misunderstanding, completely my fault. I double checked not only those examples but the whole data where I got those transcripts from, and everything is ok! (only intersecting queries grouped together). This is the corrected example:
Apologies for any confusion created! Regarding to @shjenkins94 issue:
bed2gtf will annotate those 'correctly' (based on what is given as input). Initially was thought to offer a faster and simplified way over C binaries (including additional functionality; which combines very well with TOGA post-processing) and since more steps (e.g. checking if all transcripts grouped are coming from the same chromosome) could be added I think this goes beyond of the original scope (something more TOGA-specific). On the other hand, I think this could be worked over this idea:
and part of the way to start is mentioned above:
I have some advanced compiling all of my scripts in a single pipeline but as I said earlier: some guidelines in this aspect should be done before (more like primary goals or aspects to cover). I definitely would be excited to contribute to this cause! |
Beta Was this translation helpful? Give feedback.
-
we have tested bed2gtf on a TOGA run and get warnings/error messages like This is because the transcript (more specifically, the projection) is labled as lost. We annotated lost transcripts, because it tells you where remnants of once functional genes are located. Could these warnings maybe be disabled with a flag that a user can set? |
Beta Was this translation helpful? Give feedback.
-
Hi @MichaelHiller, Thanks for reaching out and for use bed2gtf! As far as I understand, you are using Assuming you have this template:
and this is your
You will just need to split by the last '.' for each projection in Regards, |
Beta Was this translation helpful? Give feedback.
-
Hello all, I was testing out postoga on my TOGA results recently and ran into an issue where the bed2gff step fails if there are any transcripts ending with -1. I'm posting about it here because it seems like those are somehow different from most of the other transcripts. Is it fine to just modify postoga to handle them like the other transcripts or should something else be done? |
Beta Was this translation helpful? Give feedback.
-
Hi @MichaelHiller & @kirilenkobm,
I just wanted to let you know that I developed this tool: bed2gtf few weeks ago to overcome some limitations using
bedToGenePred | genePredToGtf
approach. It is one of my first developments so it is not that fancy; however, I think that may be of some help to TOGA users. Since this is a very little change I did not want to make a pull request.Best,
Alejandro
Beta Was this translation helpful? Give feedback.
All reactions