Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
liuqianhn authored Jul 28, 2019
1 parent 2b3d458 commit 0790cf4
Showing 1 changed file with 49 additions and 0 deletions.
49 changes: 49 additions & 0 deletions day2_alignment/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -306,6 +306,55 @@ Where you can see that the vast majority of the two sets of called variants are

We should emphasize that the above comparative analysis is a quick-and-dirty way of checking variant overlaps using `bcftools`. We did not consider many other issues, such as indel re-alignment, the splitting of multi-allelic variants, the heterozygosity of the variants, the regions with low mappability, and so on. For more formal comparison of varint call sets, there are more specialized tools, such as the [vcfeval](https://github.com/RealTimeGenomics/rtg-tools) and [hap.py/som.py](https://github.com/Illumina/hap.py) released by Illumina. For a description on how GA4GH evaluates variant calling performance using GIAB data, you can check [here](https://github.com/ga4gh/benchmarking-tools/blob/master/resources/high-confidence-sets/giab.md).

#### 1.7 Additional variant calling.
You can also try to use other variant calling methods, `Strelka`. We give how to run `Strelka` and simply test its called variants with the ground-truth variants, and you can calculate precision and recall by yourself.

#### 1.7.1 `Strelka` for a bam file aligned with `bwa bam`
```
conda activate dvpy27
rm -dr "bwa.Strelka"
configureStrelkaGermlineWorkflow.py --bam chr1.2mb.bwa.bam --referenceFasta ref/hg37d5.chr1.fa --runDir bwa.Strelka --callRegions ref/chr1.2mb.bed.gz
time bwa.Strelka/runWorkflow.py -m local -j 2
conda deactivate
```
will call variants, and the commands below will show how many variants are generated for the region of interest.
```
bcftools view -i '%QUAL>=200' bwa.Strelka/results/variants/genome.vcf.gz > bwa.Strelka/results/variants/genome.q200.vcf
bgzip -f bwa.Strelka/results/variants/genome.q200.vcf -c > bwa.Strelka/results/variants/genome.q200.vcf.gz
bcftools index -f bwa.Strelka/results/variants/genome.q200.vcf.gz
bcftools stats bwa.Strelka/results/variants/genome.q200.vcf.gz | grep "^SN"
```

Similarly, you can use the commands below to see how many called variants are in the ground-truth variants.
```
bcftools isec bwa.Strelka/results/variants/genome.q200.vcf.gz vcf/vcf.chr1.2mb.vcf.gz -p bwa.strelka_perf
bcftools stats bwa.strelka_perf/0002.vcf | grep "^SN"
```
According to the results, you can calculate precision and recall by yourself.

#### 1.7.2 `Strelka` for a bam file aligned with `minimap2`
```
rm -dr "mp2.Strelka"
conda activate dvpy27
configureStrelkaGermlineWorkflow.py --bam chr1.2mb.mp2.bam --referenceFasta ref/hg37d5.chr1.fa --runDir mp2.Strelka --callRegions ref/chr1.2mb.bed.gz
time mp2.Strelka/runWorkflow.py -m local -j 2
conda deactivate
```
Simiar to what we did for a bam file generated by `bwa mem`, the command above generated the variant calling for a bam generated by `minimap2`

```
bcftools view -i '%QUAL>=200' mp2.Strelka/results/variants/genome.vcf.gz > mp2.Strelka/results/variants/genome.q200.vcf
bgzip -f mp2.Strelka/results/variants/genome.q200.vcf -c > mp2.Strelka/results/variants/genome.q200.vcf.gz
bcftools index -f mp2.Strelka/results/variants/genome.q200.vcf.gz
bcftools stats mp2.Strelka/results/variants/genome.q200.vcf.gz | grep "^SN"
```
The commands above will tell you how many reliable variants are called.
And then, the commands below show how many called variants are in the ground-truth variants.
```
bcftools isec mp2.Strelka/results/variants/genome.q200.vcf.gz vcf/vcf.chr1.2mb.vcf.gz -p mp2.strelka_perf
bcftools stats mp2.strelka_perf/0002.vcf | grep "^SN"
```

### 2. Long read alignment and variants calling

#### 2.1 Preparation of the folder and data
Expand Down

0 comments on commit 0790cf4

Please sign in to comment.