-
Notifications
You must be signed in to change notification settings - Fork 5
Home
ConsHMM parses data from Multiple Sequence Alignment (MAF) files into a format suitable for ChromHMM to learn a conservation state model.
MAF files are somewhat bulky in nature and contain a lot more information than required for the purpose of ConsHMM. parseMAF.py
takes in a MAF file, and extracts just the sequence information into a csv file. The MAF files must encode an N-way multiple sequence alignment between a reference species s, and N-1 other species.
To perform this step on the MAF files encoding the 100-way multiple sequence alignment from UCSC, download the chr*.maf
files from http://hgdownload.soe.ucsc.edu/goldenPath/hg19/multiz100way/. To parse any chromosome, replace chr22 in the following command with the chromosome of choice:
python parseMAF.py chr22.maf UCSC_100_way_species_names chr22_maf_sequence.csv chr22 hg19
ChromHMM requires files to be in the binary format explained in [insert ChromHMM manual link]. Because the conservation states are learned at base-wise resolution but genomes are too big, we will sample the genome for training data in the following step. binarizeAlignment.py converts the _maf_sequence.csv files into the required format for ChromHMM and also splits these files up into chunks of a desired size. In the following example we binarize and split chr22 into 200kb chunks.
python binarizeAlignment.py chr22_maf_sequence.csv binaryFeatures chr22 200000
If learning a model for the entire genome, you should perform the same operation on all chromosomes with the same chunk size.
mergeSegmentation.py