Long Read Proteogenomics Pipeline Learning & Troubleshooting

Learning the Sheynkman Lab LRP pipeline & troubleshooting automation. It is VERY ACTIVELY being modified. If you are using this as a guide, please contact Emily Watts ([email protected]) for assistance.

If you are in the Sheynkman Lab, my most recent LRP run can be found here. It contains all the correct file paths for Dockers and programs stored on Rivanna.

This repository is for me to house original files & scripts

I also want to add my own scripts & modify original scripts to reflect any updates that have happened since it was written.
I have organized the modules with numbers indicating the order in which to run them. Modules that can be run at the same stage have the same numbers.
I'm working on adding ChatGPT summaries to each step for an quick explanation of the scripts.

Make file structure in your working directory to make this pipeline run easily

The generic scripts in this repository assume that your directory is organized in this manner and that your raw data is in your working directory in a folder called 00_input_data

mkdir ./00_environments/
mkdir ./00_input_data/
mkdir ./00_scripts/
mkdir ./01_isoseq/
mkdir ./01_isoseq/01_filter/
mkdir ./01_isoseq/02_lima/
mkdir ./01_isoseq/03_refine/
mkdir ./01_isoseq/04_cluster/
mkdir ./01_isoseq/05_align/
mkdir ./01_isoseq/06_collapse/
mkdir ./01_reference_tables/
mkdir ./02_make_gencode_database/
mkdir ./02_sqanti/
mkdir ./03_filter_sqanti/
mkdir ./04_CPAT/
mkdir ./04_six_frame_translation/
mkdir ./04_transcriptome_summary/
mkdir ./05_orf_calling/
mkdir ./06_refine_orf_database/
mkdir ./07_accession_mapping/
mkdir ./07_make_cds_gtf/
mkdir ./08_rename_cds_to_exon/
mkdir ./09_sqanti_protein/
mkdir ./10_5p_utr/
mkdir ./11_protein_classification/
mkdir ./12_protein_gene_rename/
mkdir ./13_protein_filter/
mkdir ./14_protein_hybrid_database/
mkdir ./15_MS_file_convert/
mkdir ./16_MetaMorpheus/
mkdir ./16_MetaMorpheus/gencode/
mkdir ./16_MetaMorpheus/hybrid/
mkdir ./16_MetaMorpheus/filtered/
mkdir ./16_MetaMorpheus/refined/
mkdir ./17_peptide_analysis/
mkdir ./17_track_visualization/
mkdir ./17_protein_group_comparison/
mkdir ./17_novel_peptides/

Load modules and environment

Each module lists the required modules and either has a .yml file to create the environment needed (eventually all will have these) or instructs you on how to create the environment.

Input files for running this pipeline

raw_reads.ccs.bam from your PacBio data
primers.fasta from your PacBio data
from Gencode:
- gencode_gtf - Comprehensive gene annotation (regions: CHR) gencode.v38.annotation.gtf
- gencode_transcript_fasta - Protein-coding transcript sequences (regions: CHR) gencode.v38_pc_transcripts.fa
- gencode_translation_fasta - Protein-coding transcript translation sequences (regions: CHR) gencode.v38_pc_translations.fa
- genome_fasta - Genome sequence, primary assembly (GRCh38) (regions: PRI) GRCh38.primary_assembly.genome.fa
Human_Hexamer.tsv reference file
Human_logitModel.RData reference file
Optional: kallisto.tsv from your data
Optional (for Modules 15-17): MS search files.raw
Optional (for Modules 16-17): UniProt reviewed.fasta from UniProt database

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Long Read Proteogenomics Pipeline Learning & Troubleshooting

This repository is for me to house original files & scripts

Make file structure in your working directory to make this pipeline run easily

Load modules and environment

Input files for running this pipeline

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 364 Commits
00_environments		00_environments
01_Iso-Seq		01_Iso-Seq
01_reference_tables		01_reference_tables
02_SQANTI		02_SQANTI
02_make_gencode_database		02_make_gencode_database
03_filter_sqanti		03_filter_sqanti
04_CPAT		04_CPAT
04_six_frame_translation		04_six_frame_translation
04_transcriptome_summary		04_transcriptome_summary
05_orf_calling		05_orf_calling
06_refine_orf_database		06_refine_orf_database
07_make_cds_gtf		07_make_cds_gtf
08_rename_cds_to_exon		08_rename_cds_to_exon
09_sqanti_protein		09_sqanti_protein
10_5p_utr		10_5p_utr
11_protein_classification		11_protein_classification
12_protein_gene_rename		12_protein_gene_rename
13_protein_filter		13_protein_filter
14_make_hybrid_database		14_make_hybrid_database
15_MS_file_convert		15_MS_file_convert
15_accession_mapping		15_accession_mapping
16_MetaMorpheus		16_MetaMorpheus
17_novel_peptides		17_novel_peptides
17_peptide_analysis		17_peptide_analysis
17_protein_group_compare		17_protein_group_compare
17_track_visualization		17_track_visualization
18_SUPPA		18_SUPPA
README.md		README.md
Rivanna_modules.txt		Rivanna_modules.txt

efwatts/LRP_Troubleshooting

Folders and files

Latest commit

History

Repository files navigation

Long Read Proteogenomics Pipeline Learning & Troubleshooting

This repository is for me to house original files & scripts

Make file structure in your working directory to make this pipeline run easily

Load modules and environment

Input files for running this pipeline

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages