-
Notifications
You must be signed in to change notification settings - Fork 0
WP4
Work package 4 is working with the algorithm Cicero
, calculating if open chromatin regions can be scored by co-accessibility to gain in the next step an insights into potential gene expression.
The algorithm runs in R
. All written R Scripts are provided in the folder R Scripts.
The work flow is as follows:
The set up of this workflow is provided in the file Install_MC.R
. It runs under R version 4.1.2, and gives installation instructions to install Cicero
and its dependency package Monocle3
.
The next step is to read in the input data provided by work package 1. The code is stored in Create_CDS.R
and provides instructions how to create required file formats to fed into Cicero
.
To be able to connect peaks to clusters to answer one of the research questions, a separate matrix must be created and read out. This is the first output and can be further processed by the code in Link_Peaks_Clusters.R
.
The created cds object has to be further processed in order to run Cicero
and create the co-accessibility scores. This code is stored in the Script Create_Cicero_CDS.R
.
The output of the previous script is then a cds object that can be calculated regarding the co-accessibility of peaks. The created output files are the co-accessibility-scores (conns.csv), the annotated cds object (cds_sites) and a gene activity matrix connecting both. To annotate the cds object the information provided from the input file homo_sapiens.104.main.Chr.gtf
os needed. For the purpose of the creation of the co-accessibility-scores the information provided in the file homo_sapiens.104.main.Chr.fa.fai
is needed.
The output files are fed into the following step of biological analysis of this data.
Click here for more information to the example Run.
This code is found in the script Biological_approaches.R
. The research questions are answered as well as a cicero connection plot is created.
Click here for more information to the biological Analysis.
Chromatin, which consists of DNA, is a dynamic structure. Regulation of transcription is based on the interaction between the structure of chromatin and the recruitment of numberous transcription factors, proximal promoter elements and upstream activator sequences. Accessbile chromatin is crutial for transcriptional regulation. This accessibility is marked by DNA methylation and histone modification. Accessbility can be modified by both pathogenic and environmental factors, indicating position of regulatory regions and by that reflecting regulation of cell behavior.
To visualize regions of accessible chromatin in real time (genome-wide) the method Assay for Transpoase Accessible Chromatin with high-throughput sequencing (ATAC-seq) was generated1.
Cicero
was developed as algorithm to create a link between this chromatin accessibility of regulatory elements to their target genes and is by that creating a prediction of gene expression.
This student project was created by the team around Mario Looso from the MPI in Bad-Nauheim. In the course Biodatenanalyse, data of an already conducted study2 should be bioinformatically processed. The team chose the study "A cell atlas of chromatin accessibility across 25 adult human tissues" from Zhang et. al. This study was conducted by the usage of 70 bio-samples, with 25 different tissue types, obtained from four donors. The clustering of 472.373 nuclei resulted in 54 obvious cell types. The global goal of this student project was to work with this data gained from sci-ATAC-seq assays. By splitting the workload into individual work packages, the raw data was further processed, meeting a specific biological question per work package.
This work package looks at the co-accessible chromatin regions, creating a score to link chromatin regions with same opening scheme to distal regions in the upstream genome and annotating the open regions to known promoters, transcription start sites or transcription factors. The software Cicero
was used to get to the bottom of this question.
Cicero3 is an algorithm identifying co-accessibility pairs of DNA elements whilst connecting regulatory elements to their putative target genes based on dynamics of accessibility of linked distal elements.
Other approaches differ from Cicero which isn't using bulk chromatin accessibility generated over many cell lines and tissues but working with single cell chromatin accessibility data from a single experiment. Cicero
has to be robust to the sparsity of the data.
Algorithm4
By sampling and aggregation of similar cells in groups Cicero
quantifies correlations between putative regulatory elements. Based on this quantification Cicero
links regulatory elements to their target genes using machine learning. The algorithm is applicable to every organism and any cell types. The challenges of building a genome-wide cis-regulatory map Cicero
is facing are the following:
-
raw correlations are driven by technical features (e.g. read depth/cell)
-
insufficient observations to estimate correlations between billions of pairs of sites.
-
singe cell ATAC-seq data is very sparse
-
while the accessibility of distal elements might be correlated with their target promoters, very distant or interchromosal pair of sites are also correlated because they are part of the same regulatory program
The user provides Cicero
with clustered cells as input. The algorithm then creates a great amount of cell groups, each group containing 50 cells in similar position in clustering. This ensures the overcome of the sparsity of data, it furthermore aggregates accessible profiles for cells in groups to produce counts that subtract effects of technical variables and it measures correlations in accessibility between all pairs of sites inside a 500 kb frame. The output of Cicero
consists of this correlations, the co-accessibility-score.
An overview of the Cicero algorithm
Cicero
can also identify CCANs: cis-co-accessibility networks which are modules of sites highly co-accessible with one another with help of a community detection algorithm. Adapted from the definition of a chromatin hub CCANs should meet the following criteria.
-
They should be located in close physical proximity, closer than expected based on their proximity in the linear genome (further analysed and verified with data from ChIA-PET analysis (https://www.sciencedirect.com/science/article/abs/pii/S1046202312002204)).
-
They should interact with common groups of protein complexes.
-
The epigenetical modifications should occur at similar times.
-
They should regulate genes with promoters in the hub substantially.
Links created by Cicero
are also mediated by interacting transcription factors (TFs).
In contrast to other approaches Cicero
operates with single-cell data, therefore avoiding bulk average effects while analyzing. Analyses with Cicero
can accelerate the quantitative understanding of eukaryotic gene regulation. Additionally it may ease identification of target genes of non-coding variants of genome wide association signals. The algorithm provides an effective resource to generate links between regulatory elements and target genes in tissues or cell type by using data from a single cell experiment. The defined chromatin hubs help the construction of gene expression dynamic models, furthermore the identification of genes in which dysregulation is subject to genome-wide association.
As this epigenetic field evolves cell atlases defining each cell type and its molecular profile regulatory maps will be essential for understanding the gene expression program in both illness and health.
The limitation of Cicero
is mainly based on the putative type of it's generated connections. Further experiments are necessary to determine whether a linked distal DNA element is essential for regulatory influence or sufficient.