From de5a267a4077dac3109b24ce3284b3042d098fab Mon Sep 17 00:00:00 2001 From: Eli Levy Karin <35374203+elileka@users.noreply.github.com> Date: Tue, 11 Jun 2024 09:30:40 +0200 Subject: [PATCH] Update README.md Add to GFF explanation --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index db94366..e1392aa 100644 --- a/README.md +++ b/README.md @@ -175,7 +175,7 @@ Of note, for simplicity, MetaEuk considers only ATG as a start for this scan. ##### The MetaEuk GFF: -In addition to writing a Fasta file, MetaEuk writes a GFF file. Please note that GFF is not perfectly suitable for MetaEuk because MetaEuk doesn't predict non-coding regions. This means that the MetaEuk gene starts and ends where the first and last codons could be matched. The gene and mRNA categories are the same in the MetaEuk GFF. The exon and CDS coordinates will be the same unless a small target overlap was allowed, due to which, the MetaEuk exon was shortened (see above). In this case, the CDS will report the shortening. In the sixth column you can find their individual bitsocres. The contig index starts at 1 and the start coordinate is always smaller than the end coordinate, as required by GFF. The last column contains the **TCS** identifier, followed by the low_coord of the prediction to support searching for sub-optimal exon sets (see section). Here is an example where a MetaEuk header of two exons is reported in GFF format: +In addition to writing a Fasta file, MetaEuk writes a GFF file. Please note that GFF is not perfectly suitable for MetaEuk because MetaEuk doesn't predict non-coding regions. This means that by default the MetaEuk gene starts and ends where the first and last codons could be matched (or slightly padded if `--len-scan-for-start` is set to be positive, see section). The gene and mRNA categories are the same in the MetaEuk GFF (if `--len-scan-for-start` is set to be positive, these fields will reflect the padding, as explained). The exon and CDS coordinates will be the same unless a small target overlap was allowed, due to which, the MetaEuk exon was shortened (see above). In this case, the CDS will report the shortening. In the sixth column you can find their individual bitsocres. Unlike MetaEuk's native report in the Fasta header, the contig index starts at 1 and the start coordinate is always smaller than the end coordinate, as required by GFF. The last column contains the **TCS** identifier, followed by the low_coord of the prediction to support searching for sub-optimal exon sets (see section). Here is an example where a MetaEuk header of two exons is reported in GFF format: *>protein_acc|contig_acc|-|508|1.15e-150|2|100|911|911[911]:582[582]:330[330]|501[501]:100[100]:402[402]*