-
Notifications
You must be signed in to change notification settings - Fork 5
Creating allele specific ConsHMM annotations
Creating allele-specific ConsHMM annotations require the following:
- A trained ConsHMM model which will be used to create the allele-specific annotation
- A genome-wide segmentation of a genome created with the ConsHMM model above
- Sequence information from a MAF file parsed using the parseMAF command
- A file containing chromosome sizes for your target genome
The tutorial below walks through the steps of creating allele-specific annotations for chromosome 20 of the hg38 human genome based on a ConsHMM model trained on the 100-way Multiz alignment of 99 vertebrates to the human genome.
To follow along with the tutorial, download the model file from here, the segmentation from here, the sequence file from here, the chromosome sizes file from here.
To better reflect the state assignment prior for a variant at any position in the genome, we will replace the initial state parameters of the model with the genome wide frequency of each state in the segmentation, using the following command.
python source/updateInitialParams.py -m model_100.txt -s 100_states_segmentation.bed.gz -o model_100_init_replaced.txt
The variant segmentation is computationally time expensive, so to speed this process up, we will split the alignment data into pieces of 10 mega bases, for ease of parallelization. The allele specific assignments rely on segmenting a window of size W bases upstream and W bases downstream of the variant of interest. Therefore, the splits have to be padded with W lines from the neighboring chunks in the split. The splitMAFJustSequence.sh
script does this for you.
source/splitMAFJustSequence.sh chr20_maf_sequence.csv.gz 10000000 maf_sequence_split chr20 10
The ReassignVariantState java class can now create allele specific segmentations for each of the 10MB chunks from the previous step. The class needs to know the index of each chunk in order to account for properly padding the ends of the chromosomes with positions where nothing aligns and matches the reference sequence. Below, we show how to create allele specific segmentations for all the chunks, but bear in mind that this step is time consuming and is highly recommended to be parallelized.
mkdir -p variantSegmentations
cd source
for j in ../maf_sequence_split/chr20_maf_sequence*
do
index=${j##*_}
index=${index%.*}
index=${index: -2}
# It is highly recommended to submit the following line as a job on a computing cluster and parallelize this for loop
java -classpath ../ChromHMM/ChromHMM.jar:. ReassignVariantState ${j} \
variantSegmentations/chr20_${index}_variant_segmentation.txt.gz hg38 \
model_100_init_replaced.txt ${index} 10000000 ../chromosomeInfo/hg38.chrom.sizes chr20 21
done
cd ..
To paste the chunks you created above together into one file for chromosome 20, while properly accounting for chunk overlap, use the following command:
python source/mergeVariantSegmentation.py -c chr20 -d variantSegmentations/ -n 8 \
-o variantSegmentations/hg38_multiz100way/chr20_variant_segmentation.txt.gz
Note The -n 8
parameter above corresponds to there being 8 10MB pieces for chromosome 20. You should adjust this number if working with a different chromosome. To get the number of files for a certain chromosome, assuming you followed the same file naming conventions as above, you could use a bash command like the following, replacing 20 with your chromosome number:
wc -l variantSegmentations/chr20_* | cut -f1 -d ' '