This repository comprises scripts to select CpGs from bulk DNA methylation data for subsequent analysis with scTAM-seq.
In general, three classes of CpGs are selected:
- Differentially Methylated CpGs (DMCs, in RnBeads)
- Intermediately Methylated CpGs (IMCs, in IMC)
- CpGs harboring Within-Sample Heterogeneity (WSH, in WSH)
Further scripts, such as those for annotating selected sites, selecting constitutively methylated and unmethylated sites, and amplicons without an HhaI cutsite, as well as validation and plotting functions can be found in scripts.
General configurations of the pipeline, such as the number of CpGs to be covered in the panel, can be provided in the file config.yaml.
As an example of the pipeline, we used the Kulis et al. data from the BLUEPRINT project (https://ega-archive.org/datasets/EGAD00001001304, https://doi.org/10.1038/ng.3291). We recommend using a standard bisulfite sequencing data processing pipeline, such as bismark for low-level data processing. Bismark output files can be used directly as input to RnBeads, the generated bam-files can be used to compute the PDR and qFDRP. Notably, the panel designed for our current publication on scTAM-seq was generated with an earlier version of the pipeline.
The pipeline mainly depends on RnBeads, which can be installed using:
source('https://rnbeads.org/data/install.R')
Furthermore, the following R packages are required: -yaml -data.table -argparse -ggsci
To facilitate installing, we provide a yaml file for easy installation with conda/bioconda
conda create -f selection_pipeline.yml -n selection_pipeline
The pipeline has been targeted to an HPC system operated by the SGE jobs scheduling system. For other HPC environments or a local installation, the scripts have to be changed accordingly.
For each of the categories of CpGs mentioned above, a dedicated script has been created:
- In RnBeads, you'll have to point in the
rnbOptions.xml
andscript.R
file to the downloaded data directory and adjust the file according to your system. See the RnBeads documentation for further information in case you want to modify the analysis. The examplesample_annotation.csv
may have to be adapted according to the filenames of the downloaded files. Lastly, the scriptselect_reliable_DMRs.R
will select sites that are specifically methylated in each of the populations. - To compute the WSH scores PDR and qFDRP, the bam-files generated by bismark are leveraged. Scripts to compute the scores are available from WSH and have to point to the corresponding BAM files and the RnBeads output file generated earlier.
- IMCs are defined as sites with intermediate methylation in the HSCs with low PDR scores. They can be determined using the
find_IMCs_low_PDR.R
in the IMC folder. - Variably methylated sites are selected based on those that are not IMCs (i.e. have high PDR), but also high qFDRP scores. For this, the script
filter_qFDRP_results.R
is responsible. - To include control regions (non-digested, constitutively methylated, constitutively unmethylated), corresponding scripts have been generated in scripts.
After successful execution of the scripts, you can combine differend DMRs from different panels and upload them to the Tapestri Designer tool (https://designer.missionbio.com/).
You can contact Michael Scherer for questions and suggestions.
If you use this pipeline in your publication, please cite: Bianchi, A., Scherer, M., Zaurin, R. et al. scTAM-seq enables targeted high-confidence analysis of DNA methylation in single cells. Genome Biol 23, 229 (2022). https://doi.org/10.1186/s13059-022-02796-7