Skip to content
asperlea edited this page Jul 7, 2017 · 12 revisions

ConsHMM

ConsHMM parses data from Multiple Sequence Alignment (MAF) files into a format suitable for ChromHMM to learn a conservation state model.

Step 1: Extracting sequence information MAF files

MAF files are somewhat bulky in nature and contain a lot more information than required for the purpose of ConsHMM. parseMAF.py takes in a MAF file, and extracts just the sequence information into a csv file. The MAF files must encode an N-way multiple sequence alignment between a reference species s, and N-1 other species.

To perform this step on the MAF files encoding the 100-way multiple sequence alignment from UCSC, download the chr*.maf files from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/multiz100way/. To parse any chromosome, replace chr22 in the following command with the chromosome of choice: python parseMAF.py chr22.maf UCSC_100_way_species_names chr22_maf_sequence.csv chr22 hg19

Step 2: Transforming sequence information files to binary files for ChromHMM

ChromHMM requires files to be in the binary format explained in [insert ChromHMM manual link]. Because the conservation states are learned at base-wise resolution but genomes are too big, we will sample the genome for training data in the following step. binarizeAlignment.py converts the _maf_sequence.csv files into the required format for ChromHMM and also splits these files up into chunks of a desired size. In the following example we binarize and split chr22 into 200kb chunks. python binarizeAlignment.py chr22_maf_sequence.csv binaryFeatures chr22 200000 If learning a model for the entire genome, you should perform the same operation on all chromosomes with the same chunk size.

Step 3: Running ChromHMM with subsampling flag (-n 150)

Step 4: Merging resulting ChromHMM segmentation

mergeSegmentation.py

Clone this wiki locally