WP5

Goal: De Novo Motifs

ATAC-seq can be used to detect the genome-wide accessibility of chromatin. In the open chromatin regions, footprints can be identified as small areas of lower read coverage that can be caused by transcription factors bound at these sites. However, not all footprints can be traced to known transcription factor motifs, suggesting the existence of unidentified motifs. To discover de novo motifs, the motif discovery pipeline can be used. WP5 offers the possibility to run this pipeline for single cell data in a fully automated way, including the creation of the configuration files. Furthermore, the work package contains some scripts to analyse the newly discovered motifs and to provide first insights into their possible biological significance.

Dependencies

This work package depends crucially on 2 other tools.

Thus, these must be installed before the pipeline can be used. To ensure complete automation, the corresponding environments with the respective names ("TOBIAS_ENV" and "snakemake") must be available. Furthermore, an environment for plotting (name = "plotting") should be created. A suitable YAML file is supplied for this.

List of Parameters

run_all.sh

flag	required	description
--check-logs	no	After the motif discovery pipeline is finished the script check_logs.sh will be executed
-c	no	Short for `--check-logs`

cluster_all.sh

In this script the parameters are directly passed to the script and have no associated flag. Therefore, they are identified by ther position in the script call.

parameter positon	input	required	description	default
1	string	yes	Prefix for the clustering output
2	float	no	Alternative threshold for the clustering	0.3

eval_motif_similarity.sh

parameter	input	required	description	default
--help		no	Show help message and exit
-h		no	Short for `--help`
--runs-dir	Path	yes	Path to directory where the motif discovery runs are stored
--motifs	File	yes	motif_comparison_clusters.yml in the output directory of the motif clustering with TOBIAS (output of cluster_all.sh).
--out	string	yes	Prefix of how the output files should be named.
--annotation-dir	Path	no	Path to where the preproceesed original data lies. Given directory should contain an annotation.txt in a subdirectory as described by WP2. If this flag is set, the cell types will be renaimed according to the found annotation file.
--cutoff	integer	no	Minimum number of motifs within a cluster for the cluster to appear in the analysis.	2
--jitter	-	no	If this flag is set, an additional dot-plot with jitter activated wil be produced.

Overview of the workflow

workflow_non_transparent

This work package consists of several scripts that can be executed individually or as a cascading pipeline.

Before you start the pipeline (or the individual scripts), it is important to adjust the configuration file with the global variables. After adjusting the configuration file, you can start the pipeline with the following command:

./run_all.sh

Please make sure that you execute this command in the directory where the script run_all.sh is located, as this is a normal bash-script call.

If you want to activate the step "check logs" after the motif discovery, you can do this by adding the flag --check-logs or -c. In this case, the pipeline would be started as follows:

./run_all.sh --check-logs

Caution: The Pipeline requires some time to run. If you are working via an SSH connection, it is recommended start this script in a screen

The individual steps of the pipeline are described in the following sections.

Global variables

Before you start the pipeline (or the individual scripts), it is important to adjust the configuration file with the global variables. To do this, please open global_vars.cnf and adjust the 7 specified variables.

Variable	Meaning	Comment
PROJECT_DIR	Path to the folder to which all outputs are to be written
TBSDIR	Path to WP3 output	This should contain folders with the according tissue names, in which there is a subfolder with the TOBIAS output
MDP_PIPELINE	Path to your installation of the motif discovery pipeline
GENOME	Genome fasta of the analysed organism.	It should be the same file that was used in the preceeding work packages.
GTF	GTF of the analysed organism.	It should be the same file that was used in the preceeding work packages.
DATA_PREP_DIR	Directory where the output of the data preparation (WP2) is stored.	This is needed to assign cell type names. The specified folder should contain subfolders with names corresponding to the tissues being examined.
ANN_CHECKER	If 'yes', then the motif discovery is carried out with the annotation step, otherwise the annotation and associated analyses are skipped.

Please make sure to not enter any spaces and do not use quotes in the configuration file.

Caution: Please make sure that the file is not saved in a Windows format/editor, as the Windows carriage return will cause the scripts to not work properly when run individually. In run_all.sh the file is reformatted to avoid this error.

preparations

The script preparations.sh is called as follows:

./preparations.sh

This script creates configuration files from the WP3 output. A configuration file is created for each WP3 run. These configuration files are used to start the motif discovery pipeline for the different conditions (i.e. tissue and cell types) of the different runs. Along with this, this script creates all the necessary directories for further analysis, that are expected by the following scripts.

The created directories are on the lowest lever (in the specified project directory):

configs (for the motif discovery pipeline configuration files)
runs (for the motif discovery pipeline output)
similarity (for the similarity analysis)

The configs directory is used to store the newly created configuration files. The script also adds the required subdirectories for the tissues and cell types to the run folder (compare below).

run motif discovery pipeline

Usage

The script run_pipeline.sh is called as follows:

./run_pipeline.sh

The script starts the motif discovery pipeline for all configuration files found in the configs directory within the project directory specified in the configuration file.

Caution: If you are working via an ssh connection, make sure you start this script in a screen as it may require some time to complete.

Output

run_pipeline.sh fills the runs directory with the following structure:

   |-Tissue1
   |---cluster1(Cell-type)
   |-----annotation
   |-----motif_discovery_pipeline
   |-------1_footprints
   |---------1_extraction
   |---------2_fasta
   |-------2_discovery
   |---------1_meme
   |---------2_processed_motif
   |---------3_control_motifs
   |-------3_evaluation
   |---------motif_evaluation
   |---------rescan
   |---------venn
   |-------4_annotation
   |---------open_binding_sites
   |---------uropa
   |-------logs
   |---------uropa
   |---cluster10(Cell-type)
   |----- [...]
   |---cluster11(Cell-type)
   |----- [...]
   |---cluster3(Cell-type)
   |----- [...]
   |-Tissue2
   |--- [...]
   |- [...]

As you can see above, a directory is created for each tissue in the first level. This then contains one directory for each examined cluster. These currently carry the names of the assigned clusters from WP2. In WP2, an assignment of cluster to cell type can be retrieved from a file. Each cell type folder in turn contains a folder for the results of the annatations analysis (which will be filled later) and a folder for the results of the motif discovery pipeline. The folder structure is shown here schematically. For more detailed information on the motif discovery results, please consult the corresponding project.

post-processing

check logs

This script is an optional script to get a quick overview if all pipeline runs worked as expected. The script check_logs.sh is called as follows:

./check_logs.sh

The script checks the log files of the analysed tissues and their corresponding cell types to see if the pipeline runs were successful (including the annotation runs). To allow multiple analysis, a date is added to the file name in the following format:

DT = date.month.year_hour:minute:second

It produces a file called 'check_DT_logs.txt' which has the following structure:

tissue	  cell_type	log_files	errors	annotation	
Tissue1	  cluster1	   4	         no	no error	
Tissue1	  cluster10	   4	         no	no error
...
Tissue2	  cluster1	   4	         no	no error

As can be seen, an entry is created for each cell type of each tissue with the information how many log files of the motif discovery pipeline were found, whether errors occurred in the pipeline and whether errors occurred in the annotation function of the pipeline.

renameMotifs

This script is necessary because the motif discovery pipeline gives the motifs pseudo-names (for example motif_0). Since the same pseudo-names can occur in more than one run, it is important for further analysis to rename the motifs unambiguously. For the MEME files generated by the motif discovery pipeline, this is done automatically by this script.

The script can be called as follows:

./renameAllMotifs.sh

The naming schema is: <Tissue>_<Cell Type>_<Motif Pseudo-Name Number>. The Cell-Type in this case would be the cluster name given by WP2 as explained in run motif discovery pipeline. An example for new motif name is A8CPH_esophagus_muscularis_mucosa_cluster9_0.

Please note that the changes are only applied to the motifs.meme file in the .../motif_discovery_pipeline/3_evaluation/ directory. All other files and directories with the pseudo-names remain untouched, but their relationship to the new names can be inferred from the directory path to their location.

similarity analysis

To determine whether one of the motifs found (a very similar motif, respectively) occurs in different tissues or cell types, the scripts cluster_all.sh and eval_Motif_similarity.py are provided.

cluster all

The idea here is to cluster all motifs found with a reasonably low threshold (=0.3) using TOBIAS. All clusters found are then classified as the same motif. A consensus sequence is calculated for each cluster, which can be used for further analysis. However, a manual verification is recommended before using the consensus sequence.

To perform the clustering, the script is called as follows:

./cluster_all.sh <Clustering_Name>

The parameter "<Clustering_Name>" is the prefix that the output folder of the clustering will carry. It should therefore be indicative of the output. In the pipeline run_all.sh, the specified prefix is "overall".

Info: If you want to vary the threshold you can do so by giving the alternative threshold as a seccond parameter ( see parameter list ).

The script produces a folder with the name <Clustering_Name>_Cluster holding the results of the TOBIAS clustering. Aditionally, it produces a MEME-file containing all motifs called <Clustering_Name>_<Date-Time>.meme. Both outputs are wiritten directly into the project directory specified in global_vars.cnf.

evluate motif similarity

With this step, the results of the clustering can be evaluated by plotting them. To do this, you can call the script as follows:

./eval_Motif_similarity.py --runs_dir <PATH/TO/RUNS> --motifs <motif_comparison_clusters.yml> --out <FILENAME_prefix>

If you want to run this script on its own, please activate the plotting environment beforehand:

conda activate plotting

The script produces various diagrams to visualise the occurrence of similar motifs in different tissues and cell types. The script also has a number of optional parameters that are listed above. A particularly interesting parameter for analyses could be --cutoff, which reduces the data to the clusters that have at least the specified number of motifs. The default cutoff is 2.

Furthermore, the script saves the names of the found motifs together with their cluster, cell type and tissue in a CSV-file. Two of these CSV files are created. One with all the information of motif_comparison_clusters.yml and one with only the reduced data. If the --annotation-dir flag is set, the default names of the cell types (e.g. "cluster1") will be replaced with the cell type names suggested by WP2.

All outputs of this script will be written into the directory called 'similarity'.

Output

The step evluate motif similarity creates two CSV files and three plots. An additional fourth plot can be created if the parameter --jitter is set. These outputs are briefly explained here.

Info: All plots are generated with plotly as HTML and must be opened in a browser. This has the advantage that the diagrams are interactive and thus offer another dimension to the analysis.

CSVs

For the CSV files the store a table with the following structure:

	motif_name	Cluster	Tissue	Cell_type
0	Tissue1_cellTypeCluster3_1	Cluster_1	Tissue	cellTypeCluster3_1
...	...	...	...	...

Two of these CSV files are created. One with all the information of motif_comparison_clusters.yml and one with only the reduced data. If the --annotation-dir flag is set, the default names of the cell types (e.g. "cluster1", "cellTypeCluster3_1" in above example) will be replaced with the cell type names suggested by WP2. With the replaced names the above example would look like this:

	motif_name	Cluster	Tissue	Cell_type
0	Tissue1_cellTypeCluster3_1	Cluster_1	Tissue	CellTypeName
...	...	...	...	...

Bar plots

Two bar plots are created. In both plots each bar stands for one motif-cluster found with the cluster all step. One plot is colored according to the tissue the motifs in each cluster belong to. The other is colored according to the cell type the motifs in each cluster belong to

Tissue

03_c3_bar_T This plot is saved with the suffix '_bar_Tis.html'

Cell type

03_c3_bar_CT This plot is saved with the suffix '_bar_CT.html'

Bubble plot

In the bubble plot, the x-axis stands for the cell type and the y-axis for the tissues. The colouring indicates to which cluster a motif belongs. Each bubble thus represents motifs of a cluster found in the respective tissue-cell type combination. The size of the bubble indicates how many motifs of the respective cluster were found at this location.

03_c3_bubble_

The plot is saved with the suffix '_Bubble.html'.

Jitter plot

When analysing many clusters, it is possible that the bubbles in the bubble plot overlap. Therefore, it may be desirable to have an additional plot in which it is made more visible which clusters were found in a certain combination. For this purpose, one can use the flag --jitter, which leads to the creation of an additional plot. In this plot, the points at the same location are 'jittered' so that it is possible to see which points overlap. Another advantage of the jitter plot is that the tooltip shows the exact motif for each point.

03_c3_jitter

The plot is saved with the suffix '_jitter_Dotplot.html'

Gene Ontology analysis

Biology behind data (human)

The data provided by the used GTF-File (homo_sapiens.104.mainChr.gtf) provides the following information that is biologically relevant:

start_codon
stop_codon
CDS
gene
transcript
three_prime_utr
five_prime_utr
exon
Selenocysteine

The following graphic is a dedicated representation and is intended to show the interrelationship of the above components at the biological level:

biology_features

For the following analyses, the information "gene" is especially important.

Generate gene sets

After the motif-discovery pipeline with annotation step has been successfully run, the obtained datasets can be further analyzed. To further analyze the newly discovered motifs and the known transcription factors (TFs), it is necessary to check in the data which genes can be assigned to the respective motifs or transcription factors. For this purpose, the following scheme is generally followed:

1. set filters for distance to genes (= 1000 and 2000 base pairs).
2. get information from the respective files
> Motifs: **allhits.txt* files of the motifs produced by the motif-discovery pipeline.
> TFs: **overview.txt* files of the TFs produced by WP3 (TFBS).
3. filter the files
4. sort the data
5. save the data

For each motif and transcription factor, 4 txt files are generated in the following format:

*_1k.txt
*_1k_all.txt
*_2k.txt
*_2k_all.txt

In the files containing "1k" in the name, the genes associated with the motif/TF are assigned up to 1000 base pairs apart. On the other hand, in the files that contain "2k" in the name, all genes that have up to 2000 base pairs distance from the Motif/TF will be assigned. The files that contain "all" in the name contain all genes, even those that have been annotated multiple times. The other files (without "all") contain each gene only once. General overview:

"1k" = up to 1000 base pairs distance to gene
"2k" = Up to 2000 base pairs distance to gene
"all" = All genes, also multiple occurring, are included
without "all" = All genes occur only once

The files of the motifs are produced by the following script call:

./generate_gene_sets.sh

Caution: If you are working via an ssh connection, make sure you start this script in a screen as it may require some time to complete.

The naming scheme looks like this:

motif_1k.txt
motif_1k_all.txt
motif_2k.txt
motif_2k_all.txt

These files are then stored in the "runs/tissue/cell_type/annotation/gene_sets_motifs" folder.

The script just mentioned will also generate 4 more files with the same structure per cell type. The difference is that the name of the files, the location and the saved information is different. The names of the files look like this:

tissue_cell_type_1k.txt
tissue_cell_type_1k_all.txt
tissue_cell_type_2k.txt
tissue_cell_type_2k_all.txt

The location of the files is "runs/tissue/cell_type/annotation". The saved information differs in that all motifs of a cell type are saved here.

The files of the TFs are produced by the following script call:

./generate_gene_sets_TFs.sh

Caution: If you are working via an ssh connection, make sure you start this script in a screen as it may require some time to complete.

The naming scheme looks like this:

tissue_TFs_1k.txt
tissue_TFs_1k_all.txt
tissue_TFs_2k.txt
tissue_TFs_2k_all.txt

These files are then stored in the runs/tissue folder per tissue. This is because the TFs have been identified per tissue.

Caution: If you are working via an ssh connection, make sure you start this script in a screen as it may require some time to complete.

output (txts)

The generated txt files have the following basic structure:

#motif/TF_1k/2k
gene1
gene2
gene3
...

For the cell type and TF files, a blank line is inserted after each motif/TF and then the next motif/TF is appended according to the same scheme:

#motif_1/TF_1_1k/2k
gene1
gene2
gene3
...

#motif_2/TF_2_1k/2k
gene1
gene2
gene3
...

Compare the gene sets

To compare the data of the txt files produced in the previous step the following script compare_gene_sets.py is called:

./compare_gene_sets.py

Caution: If you are working via an ssh connection, make sure you start this script in a screen as it may require some time to complete.

The goal of the script is to perform the following 4 analyses:

1. compare_1k_2k
2. identify_special_genes
3. correlation_to_TF
4. correlation_to_motif

The script calculates which known TFs are correlated the most with the new motifs (= new TFs). The new motifs are also compared across tissues and cell types. The aim is to identify identical motifs on the one hand and correlated motifs on the other. In addition, the aim is to identify motif-specific genes.

compare_1k_2k

The goal of this analysis step is to compare the gene sets of the motifs with 1000 and 2000 base pairs apart. For this purpose, the files of one motif each (1k and 2k; without "all") are compared.

identify_special_genes

The aim of this analysis step is to extract the multiple occurring genes from a gene set of the motifs. For this purpose, the files of a motif (1k_all and 2k_all) are compared and the multiple genes of one of the two files are determined.

correlation_to_TF

The aim of this is to compare the gene sets of the motifs with the gene sets of the TFs and to identify correlations. For this purpose, each motif (2k) is compared with all TFs (2k).

correlation_to_motif

The goal of this is to compare the gene sets of the motifs with the gene sets of all other motifs and to identify correlations. For this purpose, each motif (2k) is compared with all other motifs (cell_type_2k).

output

The step compare_gene_sets.py creates seven CSV files and seven plots. These outputs are briefly explained here.

Info: All plots are generated with plotly as HTML and must be opened in a browser. This has the advantage that the diagrams are interactive and thus offer another dimension to the analysis.

CSVs

compare_1k_2k

Here a CSV file is generated, which has the following structure:

	tissue	cell_type	motif	number_of_genes_2k	number_of_genes_1k	difference_number_of_genes	difference_in_percent[%]
0	tissue_name	cell_type_name	motif_name	counter_2k	counter_1k	difference	difference %
...	...	...	...	...	...	...	...

identify_special_genes

Here a CSV file is generated, which has the following structure:

	tissue	cell_type	motif	number_of_genes_2k_all	number_of_genes_1k_all	difference_number_of_genes	gene_2k	counter_gene_2k	gene_1k	counter_gene_1k
0	tissue_name	cell_type_name	motif_name	counter_2k_all	counter_1k_all	difference	genes_2k	counter_genes_2k	gene_1k	counter_genes_1k
...	...	...	...	...	...	...	...	...	...	...

It also creates a reduced table with the same structure (filtered by "counter_gene_2k").

correlation_to_TF

Here a CSV file is generated, which has the following structure:

	tissue	cell_type	motif	number_of_genes_2k	tissue_TF	cell_type_TF	TF	number_of_genes_TF_2k	number_of_matching_genes	similarity[%]
0	tissue_name	cell_type_name	motif_name	counter_genes_motif	tissue_name_TF	cell_type_name_TF	TF_name	counter_genes_TF	counter_matching_genes	similarity %
...	...	...	...	...	...	...	...	...	...	...

It also creates a reduced table with the same structure (filtered by "similarity[%]").

correlation_to_motif

Here a CSV file is generated, which has the following structure:

	tissue	cell_type	motif	number_of_genes_2k	tissue_other_motif	cell_type_other_motif	other_motif	number_of_genes_other_motif_2k	number_of_matching_genes	similarity[%]
0	tissue_name	cell_type_name	motif_name	counter_genes_motif	tissue_name_other_motif	cell_type_name_other_motif	other_motif_name	counter_genes_TF	counter_matching_genes	similarity %
...	...	...	...	...	...	...	...	...	...	...

It also creates a reduced table with the same structure (filtered by "similarity[%]").

Bar Plots

compare_1k_2k

Tissues all_tissues

Cell Types all_cell_types

identify_special_genes

Tissues all_tissues

Cell Types all_cell_types

Motifs all_motifs

correlation_to_TF

all_TFs

correlation_to_motif

all_motifs

WP5

Goal: De Novo Motifs

Dependencies

List of Parameters

run_all.sh

cluster_all.sh

eval_motif_similarity.sh

Overview of the workflow

Global variables

preparations

run motif discovery pipeline

Usage

Output

post-processing

check logs

renameMotifs

similarity analysis

cluster all

evluate motif similarity

Output

CSVs

Bar plots

Tissue

Cell type

Bubble plot

Jitter plot

Gene Ontology analysis

Biology behind data (human)

Generate gene sets

output (txts)

Compare the gene sets

compare_1k_2k

identify_special_genes

correlation_to_TF

correlation_to_motif

output

CSVs

compare_1k_2k

identify_special_genes

correlation_to_TF

correlation_to_motif

Bar Plots

compare_1k_2k

identify_special_genes

correlation_to_TF

correlation_to_motif

Clone this wiki locally