-
Notifications
You must be signed in to change notification settings - Fork 0
WP5
ATAC-seq can be used to detect the genome-wide accessibility of chromatin. In the open chromatin regions, footprints can be identified as small areas of lower read coverage that can be caused by transcription factors bound at these sites. However, not all footprints can be traced to known transcription factor motifs, suggesting the existence of unidentified motifs. To discover de novo motifs, the motif discovery pipeline can be used. WP5 offers the possibility to run this pipeline for single cell data in a fully automated way, including the creation of the configuration files. Furthermore, the work package contains some scripts to analyse the newly discovered motifs and to provide first insights into their possible biological significance.
This work package depends crucially on 2 other tools.
Thus, these must be installed before the pipeline can be used. To ensure complete automation, the corresponding environments with the respective names ("TOBIAS_ENV" and "snakemake") must be available. Furthermore, an environment for plotting (name = "plotting") should be created. A suitable YAML file is supplied for this.
flag | required | description |
---|---|---|
--check-logs | no | After the motif discovery pipeline is finished the script check_logs.sh will be executed |
-c | no | Short for --check-logs
|
In this script the parameters are directly passed to the script and have no associated flag. Therefore, they are identified by ther position in the script call.
parameter positon | input | required | description | default |
---|---|---|---|---|
1 | string | yes | Prefix for the clustering output | |
2 | float | no | Alternative threshold for the clustering | 0.3 |
parameter | input | required | description | default |
---|---|---|---|---|
--help | no | Show help message and exit | ||
-h | no | Short for --help
|
||
--runs-dir | Path | yes | Path to directory where the motif discovery runs are stored | |
--motifs | File | yes | motif_comparison_clusters.yml in the output directory of the motif clustering with TOBIAS (output of cluster_all.sh). | |
--out | string | yes | Prefix of how the output files should be named. | |
--annotation-dir | Path | no | Path to where the preproceesed original data lies. Given directory should contain an annotation.txt in a subdirectory as described by WP2. If this flag is set, the cell types will be renaimed according to the found annotation file. | |
--cutoff | integer | no | Minimum number of motifs within a cluster for the cluster to appear in the analysis. | 2 |
--jitter | - | no | If this flag is set, an additional dot-plot with jitter activated wil be produced. |
This work package consists of several scripts that can be executed individually or as a cascading pipeline.
Before you start the pipeline (or the individual scripts), it is important to adjust the configuration file with the global variables. After adjusting the configuration file, you can start the pipeline with the following command:
./run_all.sh
Please make sure that you execute this command in the directory where the script run_all.sh is located, as this is a normal bash-script call.
If you want to activate the step "check logs" after the motif discovery, you can do this by adding the flag --check-logs
or -c
. In this case, the pipeline would be started as follows:
./run_all.sh --check-logs
Caution: The Pipeline requires some time to run. If you are working via an SSH connection, it is recommended start this script in a screen
The individual steps of the pipeline are described in the following sections.
Before you start the pipeline (or the individual scripts), it is important to adjust the configuration file with the global variables. To do this, please open global_vars.cnf and adjust the 7 specified variables.
Variable | Meaning | Comment |
---|---|---|
PROJECT_DIR | Path to the folder to which all outputs are to be written | |
TBSDIR | Path to WP3 output | This should contain folders with the according tissue names, in which there is a subfolder with the TOBIAS output |
MDP_PIPELINE | Path to your installation of the motif discovery pipeline | |
GENOME | Genome fasta of the analysed organism. | It should be the same file that was used in the preceeding work packages. |
GTF | GTF of the analysed organism. | It should be the same file that was used in the preceeding work packages. |
DATA_PREP_DIR | Directory where the output of the data preparation (WP2) is stored. | This is needed to assign cell type names. The specified folder should contain subfolders with names corresponding to the tissues being examined. |
ANN_CHECKER | If 'yes', then the motif discovery is carried out with the annotation step, otherwise the annotation and associated analyses are skipped. |
Please make sure to not enter any spaces and do not use quotes in the configuration file.
Caution: Please make sure that the file is not saved in a Windows format/editor, as the Windows carriage return will cause the scripts to not work properly when run individually. In run_all.sh the file is reformatted to avoid this error.
The script preparations.sh is called as follows:
./preparations.sh
This script creates configuration files from the WP3 output. A configuration file is created for each WP3 run. These configuration files are used to start the motif discovery pipeline for the different conditions (i.e. tissue and cell types) of the different runs. Along with this, this script creates all the necessary directories for further analysis, that are expected by the following scripts.
The created directories are on the lowest lever (in the specified project directory):
- configs (for the motif discovery pipeline configuration files)
- runs (for the motif discovery pipeline output)
- similarity (for the similarity analysis)
The configs directory is used to store the newly created configuration files. The script also adds the required subdirectories for the tissues and cell types to the run folder (compare below).
The script run_pipeline.sh is called as follows:
./run_pipeline.sh
The script starts the motif discovery pipeline for all configuration files found in the configs directory within the project directory specified in the configuration file.
Caution: If you are working via an ssh connection, make sure you start this script in a screen as it may require some time to complete.
run_pipeline.sh
fills the runs directory with the following structure:
|-Tissue1
|---cluster1(Cell-type)
|-----annotation
|-----motif_discovery_pipeline
|-------1_footprints
|---------1_extraction
|---------2_fasta
|-------2_discovery
|---------1_meme
|---------2_processed_motif
|---------3_control_motifs
|-------3_evaluation
|---------motif_evaluation
|---------rescan
|---------venn
|-------4_annotation
|---------open_binding_sites
|---------uropa
|-------logs
|---------uropa
|---cluster10(Cell-type)
|----- [...]
|---cluster11(Cell-type)
|----- [...]
|---cluster3(Cell-type)
|----- [...]
|-Tissue2
|--- [...]
|- [...]
As you can see above, a directory is created for each tissue in the first level. This then contains one directory for each examined cluster. These currently carry the names of the assigned clusters from WP2. In WP2, an assignment of cluster to cell type can be retrieved from a file. Each cell type folder in turn contains a folder for the results of the annatations analysis (which will be filled later) and a folder for the results of the motif discovery pipeline. The folder structure is shown here schematically. For more detailed information on the motif discovery results, please consult the corresponding project.
This script is an optional script to get a quick overview if all pipeline runs worked as expected. The script check_logs.sh is called as follows:
./check_logs.sh
The script checks the log files of the analysed tissues and their corresponding cell types to see if the pipeline runs were successful (including the annotation runs). To allow multiple analysis, a date is added to the file name in the following format:
DT = date.month.year_hour:minute:second
It produces a file called 'check_DT_logs.txt' which has the following structure:
tissue cell_type log_files errors annotation
Tissue1 cluster1 4 no no error
Tissue1 cluster10 4 no no error
...
Tissue2 cluster1 4 no no error
As can be seen, an entry is created for each cell type of each tissue with the information how many log files of the motif discovery pipeline were found, whether errors occurred in the pipeline and whether errors occurred in the annotation function of the pipeline.
This script is necessary because the motif discovery pipeline gives the motifs pseudo-names (for example motif_0). Since the same pseudo-names can occur in more than one run, it is important for further analysis to rename the motifs unambiguously. For the MEME files generated by the motif discovery pipeline, this is done automatically by this script.
The script can be called as follows:
./renameAllMotifs.sh
The naming schema is: <Tissue>_<Cell Type>_<Motif Pseudo-Name Number>
. The Cell-Type in this case would be the cluster name given by WP2 as explained in run motif discovery pipeline. An example for new motif name is A8CPH_esophagus_muscularis_mucosa_cluster9_0
.
Please note that the changes are only applied to the motifs.meme file in the .../motif_discovery_pipeline/3_evaluation/
directory. All other files and directories with the pseudo-names remain untouched, but their relationship to the new names can be inferred from the directory path to their location.
To determine whether one of the motifs found (a very similar motif, respectively) occurs in different tissues or cell types, the scripts cluster_all.sh and eval_Motif_similarity.py are provided.
The idea here is to cluster all motifs found with a reasonably low threshold (=0.3) using TOBIAS. All clusters found are then classified as the same motif. A consensus sequence is calculated for each cluster, which can be used for further analysis. However, a manual verification is recommended before using the consensus sequence.
To perform the clustering, the script is called as follows:
./cluster_all.sh <Clustering_Name>
The parameter "<Clustering_Name>" is the prefix that the output folder of the clustering will carry. It should therefore be indicative of the output. In the pipeline run_all.sh, the specified prefix is "overall".
Info: If you want to vary the threshold you can do so by giving the alternative threshold as a seccond parameter ( see parameter list ).
The script produces a folder with the name <Clustering_Name>_Cluster
holding the results of the TOBIAS clustering. Aditionally, it produces a MEME-file containing all motifs called <Clustering_Name>_<Date-Time>.meme
. Both outputs are wiritten directly into the project directory specified in global_vars.cnf.
With this step, the results of the clustering can be evaluated by plotting them. To do this, you can call the script as follows:
./eval_Motif_similarity.py --runs_dir <PATH/TO/RUNS> --motifs <motif_comparison_clusters.yml> --out <FILENAME_prefix>
If you want to run this script on its own, please activate the plotting environment beforehand:
conda activate plotting
The script produces various diagrams to visualise the occurrence of similar motifs in different tissues and cell types. The script also has a number of optional parameters that are listed above. A particularly interesting parameter for analyses could be --cutoff
, which reduces the data to the clusters that have at least the specified number of motifs. The default cutoff is 2.
Furthermore, the script saves the names of the found motifs together with their cluster, cell type and tissue in a CSV-file. Two of these CSV files are created. One with all the information of motif_comparison_clusters.yml and one with only the reduced data. If the --annotation-dir
flag is set, the default names of the cell types (e.g. "cluster1") will be replaced with the cell type names suggested by WP2.
All outputs of this script will be written into the directory called 'similarity'.
The step evluate motif similarity creates two CSV files and three plots. An additional fourth plot can be created if the parameter --jitter
is set. These outputs are briefly explained here.
Info: All plots are generated with plotly as HTML and must be opened in a browser. This has the advantage that the diagrams are interactive and thus offer another dimension to the analysis.
For the CSV files the store a table with the following structure:
motif_name | Cluster | Tissue | Cell_type | |
---|---|---|---|---|
0 | Tissue1_cellTypeCluster3_1 | Cluster_1 | Tissue | cellTypeCluster3_1 |
... | ... | ... | ... | ... |
Two of these CSV files are created. One with all the information of motif_comparison_clusters.yml and one with only the reduced data. If the --annotation-dir
flag is set, the default names of the cell types (e.g. "cluster1", "cellTypeCluster3_1" in above example) will be replaced with the cell type names suggested by WP2. With the replaced names the above example would look like this:
motif_name | Cluster | Tissue | Cell_type | |
---|---|---|---|---|
0 | Tissue1_cellTypeCluster3_1 | Cluster_1 | Tissue | CellTypeName |
... | ... | ... | ... | ... |
Two bar plots are created. In both plots each bar stands for one motif-cluster found with the cluster all step. One plot is colored according to the tissue the motifs in each cluster belong to. The other is colored according to the cell type the motifs in each cluster belong to
This plot is saved with the suffix '_bar_Tis.html'
This plot is saved with the suffix '_bar_CT.html'
In the bubble plot, the x-axis stands for the cell type and the y-axis for the tissues. The colouring indicates to which cluster a motif belongs. Each bubble thus represents motifs of a cluster found in the respective tissue-cell type combination. The size of the bubble indicates how many motifs of the respective cluster were found at this location.
The plot is saved with the suffix '_Bubble.html'.
When analysing many clusters, it is possible that the bubbles in the bubble plot overlap. Therefore, it may be desirable to have an additional plot in which it is made more visible which clusters were found in a certain combination. For this purpose, one can use the flag --jitter
, which leads to the creation of an additional plot. In this plot, the points at the same location are 'jittered' so that it is possible to see which points overlap. Another advantage of the jitter plot is that the tooltip shows the exact motif for each point.
The plot is saved with the suffix '_jitter_Dotplot.html'
The data provided by the used GTF-File (homo_sapiens.104.mainChr.gtf) provides the following information that is biologically relevant:
start_codon
stop_codon
CDS
gene
transcript
three_prime_utr
five_prime_utr
exon
Selenocysteine
The following graphic is a dedicated representation and is intended to show the interrelationship of the above components at the biological level:
For the following analyses, the information "gene" is especially important.
After the motif-discovery pipeline with annotation step has been successfully run, the obtained datasets can be further analyzed. To further analyze the newly discovered motifs and the known transcription factors (TFs), it is necessary to check in the data which genes can be assigned to the respective motifs or transcription factors. For this purpose, the following scheme is generally followed:
1. set filters for distance to genes (= 1000 and 2000 base pairs).
2. get information from the respective files
> Motifs: **allhits.txt* files of the motifs produced by the motif-discovery pipeline.
> TFs: **overview.txt* files of the TFs produced by WP3 (TFBS).
3. filter the files
4. sort the data
5. save the data
For each motif and transcription factor, 4 txt files are generated in the following format:
*_1k.txt
*_1k_all.txt
*_2k.txt
*_2k_all.txt
In the files containing "1k" in the name, the genes associated with the motif/TF are assigned up to 1000 base pairs apart. On the other hand, in the files that contain "2k" in the name, all genes that have up to 2000 base pairs distance from the Motif/TF will be assigned. The files that contain "all" in the name contain all genes, even those that have been annotated multiple times. The other files (without "all") contain each gene only once. General overview:
"1k" = up to 1000 base pairs distance to gene
"2k" = Up to 2000 base pairs distance to gene
"all" = All genes, also multiple occurring, are included
without "all" = All genes occur only once
The files of the motifs are produced by the following script call:
./generate_gene_sets.sh
Caution: If you are working via an ssh connection, make sure you start this script in a screen as it may require some time to complete.
The naming scheme looks like this:
motif_1k.txt
motif_1k_all.txt
motif_2k.txt
motif_2k_all.txt
These files are then stored in the "runs/tissue/cell_type/annotation/gene_sets_motifs" folder.
The script just mentioned will also generate 4 more files with the same structure per cell type. The difference is that the name of the files, the location and the saved information is different. The names of the files look like this:
tissue_cell_type_1k.txt
tissue_cell_type_1k_all.txt
tissue_cell_type_2k.txt
tissue_cell_type_2k_all.txt
The location of the files is "runs/tissue/cell_type/annotation". The saved information differs in that all motifs of a cell type are saved here.
The files of the TFs are produced by the following script call:
./generate_gene_sets_TFs.sh
Caution: If you are working via an ssh connection, make sure you start this script in a screen as it may require some time to complete.
The naming scheme looks like this:
tissue_TFs_1k.txt
tissue_TFs_1k_all.txt
tissue_TFs_2k.txt
tissue_TFs_2k_all.txt
These files are then stored in the runs/tissue folder per tissue. This is because the TFs have been identified per tissue.
Caution: If you are working via an ssh connection, make sure you start this script in a screen as it may require some time to complete.
The generated txt files have the following basic structure:
#motif/TF_1k/2k
gene1
gene2
gene3
...
For the cell type and TF files, a blank line is inserted after each motif/TF and then the next motif/TF is appended according to the same scheme:
#motif_1/TF_1_1k/2k
gene1
gene2
gene3
...
#motif_2/TF_2_1k/2k
gene1
gene2
gene3
...
To compare the data of the txt files produced in the previous step the following script compare_gene_sets.py is called:
./compare_gene_sets.py
Caution: If you are working via an ssh connection, make sure you start this script in a screen as it may require some time to complete.
The goal of the script is to perform the following 4 analyses:
1. compare_1k_2k
2. identify_special_genes
3. correlation_to_TF
4. correlation_to_motif
The script calculates which known TFs are correlated the most with the new motifs (= new TFs). The new motifs are also compared across tissues and cell types. The aim is to identify identical motifs on the one hand and correlated motifs on the other. In addition, the aim is to identify motif-specific genes.
The goal of this analysis step is to compare the gene sets of the motifs with 1000 and 2000 base pairs apart. For this purpose, the files of one motif each (1k and 2k; without "all") are compared.
The aim of this analysis step is to extract the multiple occurring genes from a gene set of the motifs. For this purpose, the files of a motif (1k_all and 2k_all) are compared and the multiple genes of one of the two files are determined.
The aim of this is to compare the gene sets of the motifs with the gene sets of the TFs and to identify correlations. For this purpose, each motif (2k) is compared with all TFs (2k).
The goal of this is to compare the gene sets of the motifs with the gene sets of all other motifs and to identify correlations. For this purpose, each motif (2k) is compared with all other motifs (cell_type_2k).
The step compare_gene_sets.py creates seven CSV files and seven plots. These outputs are briefly explained here.
Info: All plots are generated with plotly as HTML and must be opened in a browser. This has the advantage that the diagrams are interactive and thus offer another dimension to the analysis.
Here a CSV file is generated, which has the following structure:
tissue | cell_type | motif | number_of_genes_2k | number_of_genes_1k | difference_number_of_genes | difference_in_percent[%] | |
---|---|---|---|---|---|---|---|
0 | tissue_name | cell_type_name | motif_name | counter_2k | counter_1k | difference | difference % |
... | ... | ... | ... | ... | ... | ... | ... |
Here a CSV file is generated, which has the following structure:
tissue | cell_type | motif | number_of_genes_2k_all | number_of_genes_1k_all | difference_number_of_genes | gene_2k | counter_gene_2k | gene_1k | counter_gene_1k | |
---|---|---|---|---|---|---|---|---|---|---|
0 | tissue_name | cell_type_name | motif_name | counter_2k_all | counter_1k_all | difference | genes_2k | counter_genes_2k | gene_1k | counter_genes_1k |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
It also creates a reduced table with the same structure (filtered by "counter_gene_2k").
Here a CSV file is generated, which has the following structure:
tissue | cell_type | motif | number_of_genes_2k | tissue_TF | cell_type_TF | TF | number_of_genes_TF_2k | number_of_matching_genes | similarity[%] | |
---|---|---|---|---|---|---|---|---|---|---|
0 | tissue_name | cell_type_name | motif_name | counter_genes_motif | tissue_name_TF | cell_type_name_TF | TF_name | counter_genes_TF | counter_matching_genes | similarity % |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
It also creates a reduced table with the same structure (filtered by "similarity[%]").
Here a CSV file is generated, which has the following structure:
tissue | cell_type | motif | number_of_genes_2k | tissue_other_motif | cell_type_other_motif | other_motif | number_of_genes_other_motif_2k | number_of_matching_genes | similarity[%] | |
---|---|---|---|---|---|---|---|---|---|---|
0 | tissue_name | cell_type_name | motif_name | counter_genes_motif | tissue_name_other_motif | cell_type_name_other_motif | other_motif_name | counter_genes_TF | counter_matching_genes | similarity % |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
It also creates a reduced table with the same structure (filtered by "similarity[%]").
Tissues
Cell Types
Tissues
Cell Types
Motifs