Learning the Sheynkman Lab LRP pipeline & troubleshooting automation. It is VERY ACTIVELY being modified. If you are using this as a guide, please contact Emily Watts ([email protected]) for assistance.
If you are in the Sheynkman Lab, my most recent LRP run can be found here. It contains all the correct file paths for Dockers and programs stored on Rivanna.
I also want to add my own scripts & modify original scripts to reflect any updates that have happened since it was written.
I have organized the modules with numbers indicating the order in which to run them. Modules that can be run at the same stage have the same numbers.
I'm working on adding ChatGPT summaries to each step for an quick explanation of the scripts.
The generic scripts in this repository assume that your directory is organized in this manner and that your raw data is in your working directory in a folder called 00_input_data
mkdir ./00_environments/
mkdir ./00_input_data/
mkdir ./00_scripts/
mkdir ./01_isoseq/
mkdir ./01_isoseq/01_filter/
mkdir ./01_isoseq/02_lima/
mkdir ./01_isoseq/03_refine/
mkdir ./01_isoseq/04_cluster/
mkdir ./01_isoseq/05_align/
mkdir ./01_isoseq/06_collapse/
mkdir ./01_reference_tables/
mkdir ./02_make_gencode_database/
mkdir ./02_sqanti/
mkdir ./03_filter_sqanti/
mkdir ./04_CPAT/
mkdir ./04_six_frame_translation/
mkdir ./04_transcriptome_summary/
mkdir ./05_orf_calling/
mkdir ./06_refine_orf_database/
mkdir ./07_accession_mapping/
mkdir ./07_make_cds_gtf/
mkdir ./08_rename_cds_to_exon/
mkdir ./09_sqanti_protein/
mkdir ./10_5p_utr/
mkdir ./11_protein_classification/
mkdir ./12_protein_gene_rename/
mkdir ./13_protein_filter/
mkdir ./14_protein_hybrid_database/
mkdir ./15_MS_file_convert/
mkdir ./16_MetaMorpheus/
mkdir ./16_MetaMorpheus/gencode/
mkdir ./16_MetaMorpheus/hybrid/
mkdir ./16_MetaMorpheus/filtered/
mkdir ./16_MetaMorpheus/refined/
mkdir ./17_peptide_analysis/
mkdir ./17_track_visualization/
mkdir ./17_protein_group_comparison/
mkdir ./17_novel_peptides/
Each module lists the required modules and either has a .yml
file to create the environment needed (eventually all will have these) or instructs you on how to create the environment.
- raw_reads.ccs.bam from your PacBio data
- primers.fasta from your PacBio data
- from Gencode:
- gencode_gtf - Comprehensive gene annotation (regions: CHR)
gencode.v38.annotation.gtf
- gencode_transcript_fasta - Protein-coding transcript sequences (regions: CHR)
gencode.v38_pc_transcripts.fa
- gencode_translation_fasta - Protein-coding transcript translation sequences (regions: CHR)
gencode.v38_pc_translations.fa
- genome_fasta - Genome sequence, primary assembly (GRCh38) (regions: PRI)
GRCh38.primary_assembly.genome.fa
- gencode_gtf - Comprehensive gene annotation (regions: CHR)
- Human_Hexamer.tsv reference file
- Human_logitModel.RData reference file
- Optional: kallisto.tsv from your data
- Optional (for Modules 15-17): MS search files.raw
- Optional (for Modules 16-17): UniProt reviewed.fasta from UniProt database