After having installed the required software [LINK TO ADD TO THE INSTALL.md], downloaded the data and organised it in the right way, all the aznalyses can be reproduced with one make
target.
make -f makefiles/all.mk all
The command ``make -f makefiles/all.mk all` calls the other makefiles to run the successive steps of the data preparation and analysis.
The environment variable BOARD
should be set to either leaderboard
or final
.
In the example below we set it to leaderboard
.
export BOARD=leaderboard
TO BE WRITTEN
Generate the metadata
@make BOARD=${BOARD} -f makefiles/01_init.mk all_metadata
make BOARD=${BOARD} -f makefiles/01_init.mk fetch_sequences
make BOARD=${BOARD} -f makefiles/01_init.mk fastq2fasta
make BOARD=${BOARD} -f makefiles/03_PBM.mk tsv2fasta
make BOARD=${BOARD} -f makefiles/03_PBM.mk top_bg_seq_all_datasets
make BOARD=${BOARD} -f makefiles/01_init.mk rand_fragments_all_experiments
make BOARD=${BOARD} -f makefiles/01_init.mk tf_vs_others_all_experiments
make BOARD=${BOARD} -f makefiles/02_peak-motifs.mk peakmo_all_experiments EXPERIMENTS='CHS GHTS SMS HTS'
make BOARD=${BOARD} -f makefiles/02_peak-motifs.mk peakmo_diff_all_experiments EXPERIMENTS='CHS GHTS SMS HTS'
make BOARD=${BOARD} -f makefiles/03_PBM.mk peakmo_diff_all_datasets
make BOARD=${BOARD} -f makefiles/04_optimize-matrices.mk omga_all_experiments
make BOARD=${BOARD} -f makefiles/04_optimize-matrices.mk omga_collect_tables
make BOARD=${BOARD} -f makefiles/04_optimize-matrices.mk omga_results_per_type
make BOARD=${BOARD} -f makefiles/04_optimize-matrices.mkomga_collect_tables
All results are generated using make
scripts in the
makefiles
directory.
Each makefile comes with two targets documenting its use :
make -f makefiles/[myfile.mk] targets
lists the targets and indicates what they domake -f makefiles/[myfile.mk] param
lists the parameters used for the analyses
Note that all the makefiles first load makefiles/00_parameters.mk
, which contains general parameters and targets, which are used by several makefiles.
For example, the following command lists the targets for motif discovery with peak-motifs
.
make -f makefiles/02_peak-motifs.mk targets
and the default parameters can be obtained as follows
make -f makefiles/02_peak-motifs.mk param
The make variables can be redefined on the flight by specifying another value on the command line.
For example, the default experiment is CHS (ChIP-seq)
make -f makefiles/02_peak-motifs.mk param | grep EXPERIMENT
It can be overwritten as follows
make -f makefiles/02_peak-motifs.mk param EXPERIMENT=GHTS
All the other variables will be updated automatically for the GHTS (genomic high-throughput SELEX) data.
Metadata files are tab-delimited text files providing information about each dataset found in the data directory. They are generated automatically by searching input files (peaks, fasta or fastq sequences, PBM tables depending on the data type).
This can be done with the following commands.
export BOARD=leaderboard
## Note: for the final results just replace "leaderboard" by "final" in the previous statement
## Generate a metadata file for ChIP-seq experiments (CHS)
make -f makefiles/00_parameters.mk BOARD=${BOARD} EXPERIMENT=CHS metadata
## Check the content of the metadata file
cat metadata/${BOARD}/TF_DATASET_CHS.tsv
## Generate a metadata file for Protein Binding Microarray experiments (PBM)
make -f makefiles/04_PBM.mk BOARD=${BOARD} metadata_pbm
cat metadata/${BOARD}/TF_DATASET_PBM.tsv
## Generate a metadata file with all the experiments for the integration of all matrices
make -f makefiles/00_parameters.mk BOARD=${BOARD} all_metadata
## Check the date of the metadata files
ls -tlr metadata/${BOARD}/
## Count the number of datasets per TF across the metadata file
cut -f 1 metadata/${BOARD}/TF_DATASET_all-types.tsv | sort | uniq -c | sort -nr
Here is the result for the leaderboard:
10 NACC2
7 TIGD3
7 RORB
7 PRDM5
7 LEF1
6 ZNF362
5 NFKB1
4 ZNF407
3 SP140
2 GABPA
You can also count, for each TF, the number of experimental methods for which training data is available.
rsat contingency-table -i metadata/${BOARD}/TF_DATASET_all-types.tsv -col1 1 -col2 4 \
-margin -sort freq
Which gives the following result.
TF Sum HTS GHTS PBM CHS SMS
Sum 58 23 14 8 8 5 NACC2 10 7 0 2 0 1 RORB 7 4 0 2 0 1 LEF1 7 4 0 2 0 1 PRDM5 7 0 5 0 2 0 TIGD3 7 4 0 2 0 1 ZNF362 6 0 3 0 3 0 NFKB1 5 4 0 0 0 1 ZNF407 4 0 3 0 1 0 SP140 3 0 2 0 1 0 GABPA 2 0 1 0 1 0
The same can be done for the final data by replacing "leaderboard" by "final" in the above commands (export BOARD=final
, then re-run all the commands above).
TF Sum HTS GHTS CHS PBM SMS
Sum 185 74 63 23 14 11 TPRX1 11 8 0 0 2 1 SP140L 10 7 0 0 2 1 MYPOP 10 8 0 0 2 0 SALL3 10 0 9 1 0 0 MKX 9 6 0 0 2 1 ZBED2 9 0 8 1 0 0 PRDM13 8 0 7 1 0 0 CREB3L3 8 7 0 0 0 1 GCM1 7 4 0 0 2 1 ZBTB47 7 7 0 0 0 0 ZNF395 7 0 6 1 0 0 ZNF518B 7 0 6 1 0 0 ZFTA 7 4 0 0 2 1 ZNF493 6 0 3 3 0 0 MSANTD1 6 3 0 0 2 1 LEUTX 6 0 3 3 0 0 ZNF20 6 0 3 3 0 0 ZNF251 5 0 2 3 0 0 USF3 5 0 4 1 0 0 FIZ1 5 4 0 0 0 1 ZNF831 5 4 0 0 0 1 ZNF367 4 0 3 1 0 0 ZNF500 4 3 0 0 0 1 ZBED5 4 0 3 1 0 0 ZNF780B 4 3 0 0 0 1 ZNF648 4 0 3 1 0 0 ZNF721 3 3 0 0 0 0 ZNF286B 3 3 0 0 0 0 CAMTA1 3 0 2 1 0 0 MYF6 2 0 1 1 0 0
ChIP-seq (CHS) and genomic high throughput sequencing (GHTS) data are provided as coordinates (bed-formatted table with extension .peak
). In order to get the corresponding sequences in fasta format, we use the RSAT tool fetch-sequences
, which takes as input a bed file and retrieves the corresponding genomic sequences from (UCSC genome browser)[https://genome.ucsc.edu/].
## Fetch genomic sequences from UCSC genome browser
make -f makefiles/00_parameters.mk BOARD=${BOARD} fetch_sequences
## Check the sizes of the sequence files
grep -v '^#' metadata/${BOARD}/TF_DATASET_CHS.tsv | cut -f 7 | xargs du -sk
grep -v '^#' metadata/${BOARD}/TF_DATASET_GHTS.tsv | cut -f 7 | xargs du -sk
The peak-motifs
workflow is used as main tool for motif discovery.
For CHS, GHTS, HTS and SMS experiments, it is used in the single dataset mode, which detects exceptional motifs, with two criteria of exceptionality :
- k-mer over-representation relative to the background model (the significance of the over-representation is computed with a binomial test)
- k-mer positional bias¨ along the peak sequences relative to peak center (a chi-squared homogeneity test)
The k-mers declared significant are then used as seeds to build position-specific scoring matrices (in absolute counts), which are further converted to position frequency matrices (PFM) following the IBIS challenge specifications.
The target peakmo
of makefiles/02_peak-motifs.mk
runs the single-dataset analysis on a given dataset:
peak-motifs
to discover over-represented and positionally biased motifs;matrix-clustering
to cluster the motifs discovered by the different algorithms (oligo-analysis
,position-analysis
anddyad-analysis
);convert-matrix
with the option-trim_info
to suppress the non-informative columns on the left and right sides of the position-specific scoring matrices resulting from the clustering;matrix-quality
to estimate, for each trimmed motif, the enrichment in the train dataset relative to the theoretical expectation, and relative to the randomized (column-permuted) matrices.
Here is the command to analyse a single dataset. Beware, this analysis can take several minutes or more depending on the size of the dataset.
make -f makefiles/02_peak-motifs.mk BOARD=${BOARD} EXPERIMENT=CHS TF=GABPA DATASET=THC_0866 peakmo
You can get the information about result files with the param
target
make -f makefiles/02_peak-motifs.mk BOARD=${BOARD} EXPERIMENT=CHS TF=GABPA DATASET=THC_0866 param
The following commands iterate the analyses over all the datasets of the 4 types of experiments for which we run single-dataset analysis. Beware, this represents a lot of analyses, which can take several hours or days. We parallelize it on a cluster to run it efficiently.
for exp in CHS GHTS HTS SMS; do \
make -f makefiles/02_peak-motifs.mk BOARD=${BOARD} EXPERIMENT=${exp} TASK=peakmo iterate_datasets; \
done
For PBM (protein binding microarray) experiment, peak-motifs
is used in a particular way by detecting differentially represented k-mers between two subsets of the PBM oligonucleotides :
- positive spots, assumed to be bound by the TF of interest in the experiment
- background / negative spots, assumed not to be bound by the TF of interest
Positive and negative spots are selected as respectively the top and bottom entries in the list of spots ranked by signal intensity.
Since our primary analyses shown that the distribution of signal intensities does not follow a normal distribution, and each dataset shows a specific shape of signal distribution we avoid the recommended threshold of
We finally retained, for each PBM dataset,
- as positive spots, the 500 spots with the highest signal intensity;
- as background / negative spots, the 35,000 peaks with the lowest signal intensity, which corresponds to the bulk of the signal intensity distribution.
For this challenge, the differential mode was only applied to PBM data. It is managed in the script makefiles.04_PBM.mk
.
## Get the list of targets for BPM data analysis
make -f makefiles/04_PBM.mk BOARD=${BOARD} targets
## Check the parameters for PBM data analysis
make -f makefiles/04_PBM.mk BOARD=${BOARD} param
## Extract spot sequences from the tab-separated values file
make -f makefiles/04_PBM.mk BOARD=${BOARD} targets tsv2fasta
## Get the 500 spots with the highest signal as positive spots, and the 35,000 lowest spots as background sequences
make -f makefiles/04_PBM.mk BOARD=${BOARD} targets top_bg_seq_all_datasets
## Run peak-motifs in differential mode on all the PBM datasets
make -f makefiles/04_PBM.mk BOARD=${BOARD} peakmo_diff_all_datasets
The discovered motifs (after clustering and trimming) are used to scan sequences in order to evaluate the performance of each motif in discriminating the train sequences from background sequences. As background sequences, we use rsat random-genome-fragments
to pick up at random genomic fragments of the same lengths as the positive training sequences.
The selection of random genomic fragments require to dispose of a local version of the reference genome (Human genome release GRCh38). This genome can be downloaded with the following command.
Beware: the genome requires ~10Gb storage, and its transfer may take some time depending on your internet speed.
rsat download-organism -v 3 -org Homo_sapiens_GCF_000001405.40_GRCh38.p14 -server http://rsat-tagc.univ-mrs.fr/rsat
## Get random genome fragments for one dataset
make -f makefiles/02_peak-motifs.mk BOARD=${BOARD} rand_fragments
## Get random genome fragments for all the datasets of a given experiment
make -f makefiles/02_peak-motifs.mk BOARD=${BOARD} EXPERIMENT=CHS rand_fragments_all_datasets
## Get random genome fragments for all the datasets of all experiments
make -f makefiles/02_peak-motifs.mk BOARD=${BOARD} rand_fragments_all_experiments
We scan sequences using rsat matrix-scan -quick
, with parameters mimicking the IBIS challenge procedure:
- equiprobable background model (independently and identically distributed nucleotides)
- for each PWM, only count the best hit per sequence (i.e. the site that maximizes the weight, defined as the log-likelihood ratio between the sequence probabilities under the PWM model and under the background model)
We scan three types of sequences:
-
train: training sequences. Note that for PBM the training sequences are the 500 spots having the highest signal intensity, whereas for other experiments (CHS, GHTS, HTS, SMS) the train sequences are all the sequences downloaded from the IBIS web site (leaderboard or final).
-
rand: random genome fragments of the same sizes as the train sequences (see previous section)
-
test: test sequences downloaded from the IBIS web site (leaderboard or final).
## Get the list of make targets for scanning
make -f makefiles/02_peak-motifs.mk targets | grep scan
## Scan train sequences for a given dataset
make -f makefiles/02_peak-motifs.mk scan_sequences_train
## Scan random genome fragments for a given dataset
make -f makefiles/02_peak-motifs.mk scan_sequences_rand
## Scan test sequences for a given dataset
make -f makefiles/02_peak-motifs.mk scan_sequences_test
## Scan the three types of sequences (train, rand, test)
make -f makefiles/02_peak-motifs.mk scan_sequences
## scan all the datasets for all the experiments
make -f makefiles/02_peak-motifs.mk scan_sequences_all_experiments
## Find the sequence scanning result files and print their size in kb
find results/${BOARD}/train -name '*_peakmo-clust-matrices*.tsv*' -exec du -sk {} \;