ConsHMM Segmentation Tutorial

ConsHMM

ConsHMM learns a conservation state model from multiple sequence alignments provided in Multiple Sequence Alignment (MAF) format. Running the ConsHMM pipeline involves first extracting the relevant information from MAF files, then running ChromHMM to learn conservation states, segmenting the genome, and finally creating user friendly visualizations of the model's parameters.

For the purposes of this tutorial, create a directory for all the output. In the examples below, this directory is named hg19_multiz100way.

Step 1: Extracting sequence information from MAF files

MAF files are somewhat bulky in nature and contain a lot more information than required for the purpose of ConsHMM. For each base in a reference genome, ConsHMM only needs to know what the corresponding bases are for the other species in a multiple sequence alignment. parseMAF.py takes in a MAF file, and extracts just the sequence information into a csv file, in which each line corresponds to one base in the reference genome. The MAF files must encode an N-way multiple sequence alignment between a reference species s, and N-1 other species.

To perform this step on the MAF files encoding the hg19 100-way Multiz alignment, download the chr*.maf files from here and the hg19 chromosome sizes from here. Then run the following command:

mkdir hg19_multiz100way/MAF_just_sequence

python source/parseMAF.py singleChromosome -m chr22.maf.gz -s speciesUCSC100way.txt \
-o hg19_multiz100way/MAF_just_sequence/ -r hg19

If you have a multiple sequence alignment split up into files by the chromosomes of a different species than you are interested in learning a model for, run parseMAF.py in fullGenome mode. Due to the output file being quite large, parseMAF.py outputs directly into .gz format.

For the hg19 100-way Multiz alignment you can find the result of this step for all chromosomes at https://public.hoffman2.idre.ucla.edu/ernst/Y7TZG/MAF_just_sequence/hg19_multiz100way/

Obtaining a list of species in the alignment

The speciesUCSC100way.txt is provided in this repository and contains the names of all the species in the 100-way alignment. You will need to provide analogous files for another multiple sequence alignment. The order of the species in the species file is not important. Below are a couple options for how to obtain this list without manual curation.

Many alignments will be accompanied by a .nh file that contains the underlying phylogenetic tree. extractSpeciesNamesFromNH.py can be used to parse one of these files into a list of species using a command such as below:

python source/extractSpeciesFromNH.py -n hg19.100way.nh -o speciesUCSC100way.txt

However, the formatting and species naming in these .nh files is not well standardized, so you will want to check this output manually and make sure it contains the correct number and names of genomes.

A second option is to directly extract the species names from a MAF file. The extractSpeciesFromMAF.py script will keep reading from a MAF file until it has seen the number of distinct species that you expect to see in the alignment. This should work almost all the time, unless not all the species in the alignment are present in this particular file. For example, if one of the species does not contain any alignment on a particular chromosome, which is the file you are using. Another thing to note is that this script could in theory have to go through the entire file before seeing every single species, but in practice that is almost never the case, and the script actually works instantly for every MAF file we've encountered. Example of doing this on the file containing chromosome 22 from the 100-way multiz alignment with hg19:

python source/extractSpeciesFromMAF.py -m chr22.maf.gz -n 100 -o speciesUCSC100way.txt

Step 2: Transforming sequence information files to binary files for ChromHMM

ChromHMM requires files to be in the binary format explained in the ChromHMM Manual. Because the conservation states are learned at base-wise resolution and genomes can be very large, we will divide the reference genome into smaller segments for the purposes of training the model. binarizeAlignment.py takes in files generated by parseMAF.py and produces segments of the desired size in the binarized format necessary for ConsHMM. The following command binarizes chromosome 22 into 200kb segments. The output folder will contain a folder named Binarized_Input_files with files named chrA_B_features_binary.txt.gz, where A is a chromosome number, and B indexes the segments on that chromosome starting at 0.

python source/binarizeAlignment.py reference \
-s hg19_multiz100way/MAF_just_sequence/chr22_maf_sequence.csv.gz \
-o hg19_multiz100way -c chr22 -n 200000 -r hg19 -cl hg19.chrom.sizes

For the hg19 100-way Multiz alignment you can find the result of this step for all chromosomes at https://public.hoffman2.idre.ucla.edu/ernst/Y7TZG/Binarized_Input/

Note for extending to genome-wide model: If learning a model for the entire genome, you should perform the same operation on all chromosomes with the same segment size and using the same output directory. This can be done in parallel, and you should specify the same output folder for all chromosomes, because the next step requires all segments to be in the same folder to randomly sample from all chromosomes. binarizeAlignment.py will also create the folder Binarized_Input_File_Lists in your output directory. Binarized_Input_File_Lists will contain files keeping track of which files came from which chromosome. These file lists will be useful for parallelizing the segmentation by chromosome, to reduce run time in step 4.

Step 3: Running ChromHMM to learn a conservation state model

Once you have split up the alignment into binarized segments you can run ChromHMM to train a model. In the example below 150 segments are sampled on each iteration of the EM algorithm. To learn a 100 state model from the binarized input from the previous step, create an output directory for the model and run ChromHMM as below:

mkdir hg19_multiz100way/model_100_states
java -jar ChromHMM/ChromHMM.jar LearnModel -b 1 -nobed -n 150 -d -1 -lowmem -p 12 \
hg19_multiz100way/Binarized_Input hg19_multiz100way/model_100_states 100 hg19

For the ConsHMM paper this step took approximately a week, and was run with 12 cores (-p 12) and 4GB of RAM on each core. For the hg19 Multiz 100-way alignment, the model, transitions and emission files obtained after training are available both here and in this repository here.

Visualizing the emission parameters of the learned model

Although ChromHMM produces a visualization of the emission parameters, this was not designed for conservation states. To produce a more meaningful visualization see the post processing step detailed here. The corresponding R notebook and resulting heatmap for the hg19 Multiz 100-way model is included in the postProcessing folder of this repository.

Step 4: Generating a ConsHMM Segmentation

Using the model file produced by ChromHMM you can generate a segmentation of the reference genome in the multiple sequence alignment, assigning a state to each base. To do this within reasonable time and memory constraints you can do this separately for each chromosome using the binarized segments generated in step 2 and then paste the resulting segmentations together. binarizeAlignment.py has created files named chr*_binary_list.txt that can be given to MakeSegmentation as seen below for chromosome 22. Create a directory for the segmentation then run ChromHMM, as below:

mkdir hg19_multiz100way/Segmentation
java -jar ChromHMM/ChromHMM.jar MakeSegmentation -b 1 -lowmem \ 
-f hg19_multiz100way/Binarized_Input_File_Lists/chr22_binarized_files_list.txt -i chr22 \ hg19_multiz100way/model_100_states/model_100.txt hg19_multiz100way/Binarized_Input hg19_multiz100way/Segmentation

To then paste the resulting segments back together use

python source/mergeSegmentation.py -b hg19_multiz100way/Segmentation \ 
-o hg19_multiz100way/Segmentation/chr22_segmentation.bed.gz -n 200000 -s 100 -c chr22

Note that to keep file sizes low, mergeSegmentation.py outputs directly to .gz format. You can then further merge all the chromosomes together into one segmentation file through some basic Unix commands such as:

for i in {1..22} {X,Y}
do
    zcat chr${i}/chr${i}_segmentation.bed >> GW_segmentation.bed
done

GW_segmentation.bed will then contain a genome-wide segmentation. You may want to further compress this file, which can be quite large.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly