Skip to content

Creating allele specific ConsHMM annotations

Adriana Arneson edited this page Feb 19, 2020 · 1 revision

Overview

Creating allele-specific ConsHMM annotations require the following:

  • A trained ConsHMM model which will be used to create the allele-specific annotation
  • A genome-wide segmentation of a genome created with the ConsHMM model above
  • Sequence information from a MAF file parsed using the parseMAF command
  • A file containing chromosome sizes for your target genome

The tutorial below walks through the steps of creating allele-specific annotations for chromosome 20 of the hg38 human genome based on a ConsHMM model trained on the 100-way Multiz alignment of 99 vertebrates to the human genome.

Step 1: Download necessary files

To follow along with the tutorial, download the model file from here, the segmentation from here, the sequence file from here, the chromosome sizes file from here.

Step 2: Updating model parameters

To better reflect the state assignment prior for a variant at any position in the genome, we will replace the initial state parameters of the model with the genome wide frequency of each state in the segmentation, using the following command.

python source/updateInitialParams.py -m model_100.txt -s 100_states_segmentation.bed.gz -o model_100_init_replaced.txt

Step 3: Split up MAF sequence files

The variant segmentation is computationally time expensive, so to speed this process up, we will split the alignment data into pieces of 10 mega bases, for ease of parallelization. The allele specific assignments rely on segmenting a window of size W bases upstream and W bases downstream of the variant of interest. Therefore, the splits have to be padded with W lines from the neighboring chunks in the split. The splitMAFJustSequence.sh script does this for you.

source/splitMAFJustSequence.sh chr20_maf_sequence.csv.gz 10000000 maf_sequence_split chr20 10

Step 4: Generate allele specific segmentations

The ReassignVariantState java class can now create allele specific segmentations for each of the 10MB chunks from the previous step. The class needs to know the index of each chunk in order to account for properly padding the ends of the chromosomes with positions where nothing aligns and matches the reference sequence. Below, we show how to create allele specific segmentations for all the chunks, but bear in mind that this step is time consuming and is highly recommended to be parallelized.

mkdir -p variantSegmentations
cd source
for j in ../maf_sequence_split/chr20_maf_sequence*
do
    index=${j##*_}
    index=${index%.*}
    index=${index: -2}
    # It is highly recommended to submit the following line as a job on a computing cluster and parallelize this for loop
    java -classpath ../ChromHMM/ChromHMM.jar:. ReassignVariantState ${j} \
variantSegmentations/chr20_${index}_variant_segmentation.txt.gz hg38 \
model_100_init_replaced.txt ${index} 10000000 ../chromosomeInfo/hg38.chrom.sizes chr20 21 
done
cd ..

Step 5: Paste together chromosome pieces

To paste the chunks you created above together into one file for chromosome 20, while properly accounting for chunk overlap, use the following command:

python source/mergeVariantSegmentation.py -c chr20 -d variantSegmentations/ -n 8 \
-o variantSegmentations/hg38_multiz100way/chr20_variant_segmentation.txt.gz

Note The -n 8 parameter above corresponds to there being 8 10MB pieces for chromosome 20. You should adjust this number if working with a different chromosome. To get the number of files for a certain chromosome, assuming you followed the same file naming conventions as above, you could use a bash command like the following, replacing 20 with your chromosome number:

wc -l variantSegmentations/chr20_* | cut -f1 -d ' '