Skip to content

Latest commit

 

History

History
49 lines (32 loc) · 2.55 KB

README.md

File metadata and controls

49 lines (32 loc) · 2.55 KB

UniRef genes families-level pangenome building and annotation

This tools provides a pipeline for annotating and clustering input genomes sequences into UniRef90/UniRef50 genes families and clustering unknown coding sequences. The output provided is a ready-to-use PanPhlAn pangenome. Thus, it will countain all genomes contigs in a multi-FASTA file, precomputed bowtie2 indexes, and a pangenome tsv file mapping gene location on contigs.

Pipeline

  1. Prokka runs over the provided genome to annotate them
  2. Using the UniRef annotator and the UniRef DIAMOND database, sequences are associated to UniRef90 and UniRef50 ID
  3. The remaining (not mapped by UniRef annotator) sequences are clustered together at the same thresholds (90% and 50 % similarity). This leads to the attribution of UniRef90_UNK and UniRef50_UNK (unknown) IDs
  4. Then the PanPhlAn pangenome is generated : concatenation of contigs of all genomes, generation of tsv mapping file, bowtie2 indexes building.

Dependencies :

The following Python packages are needed .

  • BioPython
  • bcbio-gff
  • gffutils

The following external tools should be installed (and the PATH variable properly configured) :

On top on that, UniRef DIAMOND databases should be downloaded via the download_databases.py script.

Usage

python panphlan_exporter.py --input [input_genomes_folder]          \
                            --output [output_pangenome_folder]      \
                            --db_path [path_to_UniRef_DIAMOND_databases]
  • The --input [input_genomes_folder] should contain one fasta file per genome. The script assumes that the file name is the genome name
  • The --output [output_pangenome_folder] will be created if not existing

Additionnal parameters could be provided :

  • -t or --tmp specifies another directory for temporary files. Default is the output folder
  • -c or --clade_name specifies a prefix for PanPhlAn output files. The best would be the full species name (e.g. Escherichia_coli). Default is panplhan_clade
  • -n or --nprocs the number of threads to use.

N.B : If the ouput folder is already a PanPhlAn pangeome folder (containing the 8 or 9 files of a PanPhlAn pangenome : 1 fna, 1 pangenome tsv, 6 indexes files and 1 optionnal annotation file), then the pangenome generated by the pipeline will extend the existing one.