Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

genome requirement with --pseudo_aligner salmon and --skip_alignment #688

Closed
4 tasks done
didillysquat opened this issue Aug 6, 2021 · 7 comments
Closed
4 tasks done
Assignees
Milestone

Comments

@didillysquat
Copy link

Check Documentation

I have checked the following places for your error:
I have checked both of these and looked through the introduction to see which steps might require the genome.

Description of the bug

When running the pipeline with --pseudo_aligner salmon --skip_alignment and providing a valid --transcript_fasta and --salmon_index but not providing --fasta or --genome, the pipeline will not run requesting that I provide a genome file: Genome fasta file not specified with e.g. '--fasta genome.fa' or via a detectable config file.

Steps to reproduce

Steps to reproduce the behaviour:

  1. Command line: nextflow run nf-core/rnaseq --input woltering_samplesheet.csv --pseudo_aligner salmon --skip_alignment --transcript_fasta ../athal_transcriptome/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz --salmon_index ../athal_transcriptome/Arabidopsis_thaliana.TAIR10.cdna.all.fa.gz.index -profile docker
  2. See error: Genome fasta file not specified with e.g. '--fasta genome.fa' or via a detectable config file.

Expected behaviour

I would expect this specific route of the pipeline to be able to run without access to a genome, as running quantification with Salmon on the command line I need only provide the transcript fasta and the index.
I've asked to skip allignments (that would otherwise require the genome), but which other step in the pipeline is the genome required for?

I would hope that the pipeline could run without access to the genome.

Log files

nextflow.log

Have you provided the following extra information/files:

  • The command used to run the pipeline
  • The .nextflow.log file

System

  • Hardware: remote server
  • Executor: run on command line
  • OS: Ubuntu
  • Version 20.04.2 LTS

Nextflow Installation

version 21.04.1

Container engine

  • Engine: Docker
  • version: 20.10.5

Additional context

@didillysquat didillysquat added the bug Something isn't working label Aug 6, 2021
@didillysquat
Copy link
Author

I see now that it is required for the DESeq2 QC that is performed downstream of the salmon pseudo quantification.

@drpatelh
Copy link
Member

Hi @didillysquat ! Apologies for the late response. I am holiday at the mo.

It's actually required to build the decoy sequences for the Salmon index. If you have a genome fasta available I believe it's advisable to build the index with both the genome fasta and transcriptome fasta. I discussed this with @rob-p whilst adding Salmon support here.

Maybe we should also add support for instances where the genome fasta isn't available though as this issue highlights that particular edge case.

@drpatelh drpatelh reopened this Aug 10, 2021
@didillysquat
Copy link
Author

Hi @drpatelh,

There is no hurry on this at all so please don't disrupt your holidays on my behalf.

For my particular case I'm using your wonderful pipeline as a quick but clean way to get a set of salmon pseudo quantification files from RNA-seq reads that I can then import into DESeq2.

I'm sure you're far more knowledgable about this than I am but I was simply following the guidance of the salmon tutorial which worked with only an indexed transcriptome fasta (i.e. no genome). For this particular use case, it could perhaps be useful for the pipeline to detect that neither --genome nor --fasta have been provided and so limit the output accordingly (i.e. no DESeq QC) but provide a warning saying that it is doing so. (I.e. it could say "no genome provided so skipping XXX").

Having said that, one extremely useful output from your pipeline (after running it providing the --genome information) is the txt2gene.txt file (called 'salmon_txt2gene.txt' in your pipeline) that maps the transcript IDs to the genes and allows the import of the salmon counts to DESeq2 using tximport. If appropriate, it could be useful to provide this in the main salmon output directory.

Thanks for your continued efforts!

@drpatelh drpatelh added this to the 3.4 milestone Sep 22, 2021
@drpatelh
Copy link
Member

drpatelh commented Oct 4, 2021

Hi @didillysquat ! I was going to have a go at adding this feature for the 3.4 release but it will take quite a bit of refactoring so maybe we can it in 3.5.

I have, however added the functionality for the pipeline to be able to publish the salmon_tx2gene.txt files in the salmon counts directory here.

@drpatelh drpatelh modified the milestones: 3.4, 3.5 Oct 4, 2021
@drpatelh drpatelh added enhancement and removed bug Something isn't working labels Oct 4, 2021
@didillysquat
Copy link
Author

@drpatelh Super! Many thanks for that.

@drpatelh drpatelh modified the milestones: 3.5, 3.6 Dec 13, 2021
@drpatelh drpatelh modified the milestones: 3.6, 3.7 Feb 20, 2022
@drpatelh drpatelh modified the milestones: 3.7, 3.8 Apr 26, 2022
@drpatelh drpatelh modified the milestones: 3.9, 3.10 Sep 25, 2022
@drpatelh drpatelh modified the milestones: 3.10, 3.11 Dec 16, 2022
@drpatelh drpatelh self-assigned this May 7, 2023
@drpatelh drpatelh modified the milestones: 3.12, 3.13 Jun 2, 2023
@drpatelh drpatelh modified the milestones: 3.15.0, 3.16.0 May 29, 2024
@pinin4fjords
Copy link
Member

I'm finally addressing this in #1490.

But I've also noted that Salmon indices should generally be build with genomic FASTA decoys, so this isn't actually recommended unless you're sure you know what you're doing.

@pinin4fjords
Copy link
Member

Addressed by #1490

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants