all transcripts annotated as "non_coding" after isoseq/pigeon #729

kmattioli opened this issue Nov 21, 2024

kmattioli opened this issue Nov 21, 2024


Operating system

  Operating System: CentOS Linux 7 (Core)
       CPE OS Name: cpe:/o:centos:centos:7
            Kernel: Linux 3.10.0-1160.119.1.el7.x86_64
      Architecture: x86-64

Package name

isoseq 4.2.0 (commit v4.2.0)

  pbbam     : 2.7.0 (commit v2.7.0)
  pbcopper  : 2.6.0 (commit v2.6.0)
  pbmm2     : 1.16.0 (commit v1.16.0)
  minimap2  : 2.26
  parasail  : 2.1.3
  boost     : 1.81
  htslib    : 1.17
  zlib      : 1.2.13

pigeon 1.3.0 (commit -v1.3.0)

  pbbam     : 2.7.0 (commit v2.7.0)
  pbcopper  : 2.6.0 (commit v2.6.0)
  boost     : 1.81
  htslib    : 1.17
  zlib      : 1.2.13

Conda environment

Describe the bug
After running Isoseq/Pigeon pipeline, all of my transcripts are annotated as non_coding in the _classification.txt output (as well as the filtered classification output). This happens even for protein-coding genes...

To Reproduce
Following the CLI example per the docs, using Gencode v39. Relevant code (after demuxing) below:

ls ${LIMA_OUT}/*.fl.*.bam | xargs -n 1 basename > ${LIMA_OUT}/fl.fofn

isoseq refine --require-polya $LIMA_FILE $CONCAT_PRIMER_FILE $REFINE_FILE --log-level FATAL

isoseq cluster2 $REFINE_FILE ${CLUSTER_OUT}/${OUT_PREFIX}.clustered.bam --use-qvs --log-level INFO -j 8

pbmm2 align --preset ISOSEQ --sort ${CLUSTER_OUT}/${OUT_PREFIX}.clustered.bam $GENOME_FA ${ALIGN_OUT}/${OUT_PREFIX}.mapped.bam --log-level FATAL 2> ${ALIGN_OUT}/${OUT_PREFIX}.minimap2.log

isoseq collapse ${ALIGN_OUT}/${OUT_PREFIX}.mapped.bam $REFINE_FILE ${COLLAPSE_OUT}/${OUT_PREFIX}.collapsed.gff --log-level FATAL

pigeon prepare ${COLLAPSE_OUT}/${OUT_PREFIX}.collapsed.gff --log-level FATAL

pigeon classify ${COLLAPSE_OUT}/${OUT_PREFIX}.collapsed.sorted.gff $GENCODE_SORTED $GENOME_FA --fl ${COLLAPSE_OUT}/${OUT_PREFIX}.collapsed.flnc_count.txt -d ${PIGEON_OUT} -o ${OUT_PREFIX}  --cage-peak $CAGE_PEAK_SORTED --poly-a $POLYA_LIST --coverage $COVERAGE_SORTED --log-level FATAL
overage $COVERAGE_SORTED --log-level FATAL

pigeon filter ${PIGEON_OUT}/${OUT_PREFIX}_classification.txt --isoforms ${COLLAPSE_OUT}/${OUT_PREFIX}.collapsed.sorted.gff --log-level FATAL

pigeon report ${PIGEON_OUT}/${OUT_PREFIX}_classification.filtered_lite_classification.txt ${PIGEON_OUT}/${OUT_PREFIX}.saturation.txt --log-level FATAL

Expected behavior
I should get many coding transcripts and annotated ORFs! I am not sure what step is causing the issue. Let me know and I'd be happy to provide some files to reproduce, but not sure where to start...

Hi, just pinging this issue and adding a bit more context -- it looks like in the previous iterations where I've run this pipeline, SQANTI3 produced the output gtf file with CDS information (corrected.gtf.cds.gff), but the same isn't happening with pigeon classify. Any ideas? Do I have to use SQANTI3 directly to get CDS information?


Magdoll commented Dec 9, 2024

Hi @kmattioli thx for using pigeon! SQANTI3 (the academic version pigeon is based off) including GENEMARK for ORF prediction which is not included in pigeon, therefore, when running pigeon, all transcripts will be marked as noncoding since this information is not given.

