R
- https://cran.r-project.org/src/base/R-4/R Studio
- https://www.rstudio.com/products/rstudio/download/
recount3_1.10.2 | data.table_1.14.8 | slam_0.1-50 | stringr_1.5.0 |
readr_2.1.4 | dplyr_1.1.2 | tibble_3.2.1 | tidyr_1.3.0 |
- Please note some R packages may have their own dependencies. Please Install prior to running the workflow.
Go to Recount3's Study Explorer to browse the data and choose a study or list of studies of interest. The workflow will require the proper project name(s), as shown in the Study Explorer.
Here is a full workflow script where users can provide a project names from Recount3 and obtain normalized and filtered Splice Junction counts and meta data.
- Project_Directory: The path to where the results will be written out
- Project_Name_List: A comma seperated list of Recount3 project names. This can be one or more.
- Expr_CutPoint & SampleN_CutPoint: A numeric filtering criteria for the junction data (default Expr_CutPoint=1 and SampleN_CutPoint=10).
- When applied this will filter the junction matrix and only keep junctions with an expression level above [Expr_CutPoint (1)] in [SampleN_CutPoint (10)] or more samples.
- Set up environment to download Recount3 data, load in libraries and functions, and check user inputs
- Download project data from Recount3
- Normalize junction data
- Filter out junctions that are found in GTEx and Bone Marrow samples
- Filter out junctions that have an expresion of zero accross all samples
- Filter out junctions according to the criteria given as user input
- Generate BED files for each junction matrix output
- Junction Filter Script Log: A log file tracking the number of junctions at each filtering step and the names of the junction output files
- Meta Data: Meta data extracted from the Recount3 project data
- Junction Expression Matrices: The junction counts extracted after each filtering step
- Junction Name Annotation: Junction name annotation data for each junction matrices output
- BED Files: Bed files denoting the average junction expression and total samples expressed for each junction matrix output
Some project data, for example TCGA, has a mix of sample types (e.g. Tumor, Normal, Metastatic). This script allows the user to subset the junction data output from SpliceJxn_Workflow by a column of choice from the meta data.
- Junction_File: A junction matrix output from SpliceJxn_Workflow
- Meta_File: Meta data from the same project as the Junction_File
- Column_Name: A desired column name from the meta data to subset the junction data based on (e.g. "tcga.cgc_sample_sample_type")
- Load library and read in data
- Subset a junction and meta data table based on each unique category from the desired column input
- Write out subset data
- Subset Junction & Meta data file: For each unique category and junction and meta data file will be written to the same folder the input data was from.
This script converts a splice junction expression file to a BED file. This is already performed in the SpliceJxn_Workflow.R script.
- JxnFile: A junction matrix output from SpliceJxn_Workflow
- AnnoFile: Junction name annotation data from the same filtering step and project as the Junction_File
- OutFile: A desired output file name
- Load libraries and read in data
- Split the junction name into separate columns (Chromosome, Start, Stop, Strand)
- Merge annotation information from the junction name annotation file (this denotes if the junction is known or unknown)
- Calculate the average junction expression for each junction and the sum of the total samples expressing each junction
- Write out BED files
- Average Expression BED File: Junction position information, annotation information, and average expression for each junction
- Sample Count Expressing BED File: Junction position information, annotation information, and total samples expressing each junction