Gene Fusion Peptide Sequence Extraction

This script processes gene fusion data from IMPACT and ARCHER. The workflow involves filtering and parsing fusion data, extracting exon information, retrieving DNA sequences from a reference genome, and generating peptide sequences around fusion breakpoints.

Author: Helen Xie

Installation

To run this script, you need to have R installed on your system. Additionally, the following R packages are required:

data.table
dplyr
stringr
tidyr
Biostrings

You can install these packages using the following command:

install.packages(c("data.table", "dplyr", "stringr", "tidyr"))
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Biostrings")

Usage

Run the script from the command line as follows:

Rscript your_script.R data_sv.txt ucsc_refseq_hg19_mrna.txt hg19.fa

Command Line Arguments

data_sv.txt: The gene fusion data file from MSK-IMPACT.
ucsc_refseq_hg19_mrna.txt: The exon bed file for the hg19 genome.
hg19.fa: The reference genome FASTA file for hg19.

Input Files

data_sv.txt: A file containing gene fusion data, including details on chromosomes, positions, and annotations.
ucsc_refseq_hg19_mrna.txt: A BED file containing exon information for the hg19 genome.
hg19.fa: A FASTA file with the hg19 reference genome sequence.

Output

data_fusions.rds: An RDS file containing processed gene fusion data with the following columns:
- Sample_ID
- Site1_Chromosome
- Site2_Chromosome
- Site1_Position
- Site2_Position
- Site1_Description
- Site2_Description
- Class
- Event_Info
- Connection_Type
- Annotation
- First_Transcript_ID
- Second_Transcript_ID
- First_Gene
- Second_Gene
- First_Exons
- Second_Exons
- First_Sequence
- Second_Sequence
- Fusion_Sequence
- Peptide_Sequence
- Peptide_Breakpoint
- NetMHC_Peptide

Steps Performed

Read and Parse Fusion Data: Filters and processes fusion data to extract relevant information like transcript IDs, gene names, and exon ranges.
Read and Parse BED File: Processes the exon BED file to extract transcript IDs and exon numbers.
Retrieve DNA Sequences: Uses bedtools getfasta to extract DNA sequences corresponding to exon ranges from the reference genome.
Generate Peptide Sequences: Translates the DNA sequences into peptide sequences and identifies breakpoints in the fusion peptides.
Save Output: The processed data, including the fusion peptides and their surrounding sequences, are saved to an RDS file for downstream analysis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Gene Fusion Peptide Sequence Extraction

Table of Contents

Installation

Usage

Command Line Arguments

Input Files

Output

Steps Performed

Files

README.md

Latest commit

History

README.md

File metadata and controls

Gene Fusion Peptide Sequence Extraction

Table of Contents

Installation

Usage

Command Line Arguments

Input Files

Output

Steps Performed