Skip to content

efwatts/LRP_Troubleshooting

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Long Read Proteogenomics Pipeline Learning & Troubleshooting

Learning the Sheynkman Lab LRP pipeline & troubleshooting automation. It is VERY ACTIVELY being modified. If you are using this as a guide, please contact Emily Watts ([email protected]) for assistance.

If you are in the Sheynkman Lab, my most recent LRP run can be found here. It contains all the correct file paths for Dockers and programs stored on Rivanna.

This repository is for me to house original files & scripts

I also want to add my own scripts & modify original scripts to reflect any updates that have happened since it was written.
I have organized the modules with numbers indicating the order in which to run them. Modules that can be run at the same stage have the same numbers.
I'm working on adding ChatGPT summaries to each step for an quick explanation of the scripts.

Make file structure in your working directory to make this pipeline run easily

The generic scripts in this repository assume that your directory is organized in this manner and that your raw data is in your working directory in a folder called 00_input_data

mkdir ./00_environments/
mkdir ./00_input_data/
mkdir ./00_scripts/
mkdir ./01_isoseq/
mkdir ./01_isoseq/01_filter/
mkdir ./01_isoseq/02_lima/
mkdir ./01_isoseq/03_refine/
mkdir ./01_isoseq/04_cluster/
mkdir ./01_isoseq/05_align/
mkdir ./01_isoseq/06_collapse/
mkdir ./01_reference_tables/
mkdir ./02_make_gencode_database/
mkdir ./02_sqanti/
mkdir ./03_filter_sqanti/
mkdir ./04_CPAT/
mkdir ./04_six_frame_translation/
mkdir ./04_transcriptome_summary/
mkdir ./05_orf_calling/
mkdir ./06_refine_orf_database/
mkdir ./07_accession_mapping/
mkdir ./07_make_cds_gtf/
mkdir ./08_rename_cds_to_exon/
mkdir ./09_sqanti_protein/
mkdir ./10_5p_utr/
mkdir ./11_protein_classification/
mkdir ./12_protein_gene_rename/
mkdir ./13_protein_filter/
mkdir ./14_protein_hybrid_database/
mkdir ./15_MS_file_convert/
mkdir ./16_MetaMorpheus/
mkdir ./16_MetaMorpheus/gencode/
mkdir ./16_MetaMorpheus/hybrid/
mkdir ./16_MetaMorpheus/filtered/
mkdir ./16_MetaMorpheus/refined/
mkdir ./17_peptide_analysis/
mkdir ./17_track_visualization/
mkdir ./17_protein_group_comparison/
mkdir ./17_novel_peptides/

Load modules and environment

Each module lists the required modules and either has a .yml file to create the environment needed (eventually all will have these) or instructs you on how to create the environment.

Input files for running this pipeline

  • raw_reads.ccs.bam from your PacBio data
  • primers.fasta from your PacBio data
  • from Gencode:
    • gencode_gtf - Comprehensive gene annotation (regions: CHR) gencode.v38.annotation.gtf
    • gencode_transcript_fasta - Protein-coding transcript sequences (regions: CHR) gencode.v38_pc_transcripts.fa
    • gencode_translation_fasta - Protein-coding transcript translation sequences (regions: CHR) gencode.v38_pc_translations.fa
    • genome_fasta - Genome sequence, primary assembly (GRCh38) (regions: PRI) GRCh38.primary_assembly.genome.fa
  • Human_Hexamer.tsv reference file
  • Human_logitModel.RData reference file
  • Optional: kallisto.tsv from your data
  • Optional (for Modules 15-17): MS search files.raw
  • Optional (for Modules 16-17): UniProt reviewed.fasta from UniProt database

About

learning LRP pipeline & troubleshooting automation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published