From 3eaae3afcada477857c29459fd5cf3ead3c2e078 Mon Sep 17 00:00:00 2001 From: Nicole Deflaux Date: Thu, 15 Dec 2016 14:07:21 -0800 Subject: [PATCH] Add page for Platinum Genomes DeepVariant GRCh38 data. Change-Id: Ibb766b99eeded27c991670cac1deb29334d4d85e --- .../discover_public_data/genomic_data_toc.rst | 1 + .../discover_public_data/platinum_genomes.rst | 2 +- .../platinum_genomes_deepvariant.rst | 49 +++++++++++++++++++ 3 files changed, 51 insertions(+), 1 deletion(-) create mode 100644 docs/source/use_cases/discover_public_data/platinum_genomes_deepvariant.rst diff --git a/docs/source/use_cases/discover_public_data/genomic_data_toc.rst b/docs/source/use_cases/discover_public_data/genomic_data_toc.rst index d7bb2f0..658f5bb 100644 --- a/docs/source/use_cases/discover_public_data/genomic_data_toc.rst +++ b/docs/source/use_cases/discover_public_data/genomic_data_toc.rst @@ -22,6 +22,7 @@ __ RenderedVersion_ 1000_genomes platinum_genomes + platinum_genomes_deepvariant reference_genomes mssng_data isb_cgc_data diff --git a/docs/source/use_cases/discover_public_data/platinum_genomes.rst b/docs/source/use_cases/discover_public_data/platinum_genomes.rst index e5720af..d566b0c 100644 --- a/docs/source/use_cases/discover_public_data/platinum_genomes.rst +++ b/docs/source/use_cases/discover_public_data/platinum_genomes.rst @@ -17,7 +17,7 @@ __ RenderedVersion_ .. comment: end: goto-read-the-docs -This dataset comprises the 17 member CEPH pedigree 1463. See http://www.illumina.com/platinumgenomes/ for full details. +This dataset comprises the `6 member CEPH pedigree 1463 `_. See http://www.illumina.com/platinumgenomes/ for full details. Google Cloud Platform data locations ------------------------------------ diff --git a/docs/source/use_cases/discover_public_data/platinum_genomes_deepvariant.rst b/docs/source/use_cases/discover_public_data/platinum_genomes_deepvariant.rst new file mode 100644 index 0000000..85f8c81 --- /dev/null +++ b/docs/source/use_cases/discover_public_data/platinum_genomes_deepvariant.rst @@ -0,0 +1,49 @@ +Platinum Genomes DeepVariant +============================ + +.. comment: begin: goto-read-the-docs + +.. container:: visible-only-on-github + + +-----------------------------------------------------------------------------------+ + | **The properly rendered version of this document can be found at Read The Docs.** | + | | + | **If you are reading this on github, you should instead click** `here`__. | + +-----------------------------------------------------------------------------------+ + +.. _RenderedVersion: http://googlegenomics.readthedocs.org/en/latest/use_cases/discover_public_data/platinum_genomes.html + +__ RenderedVersion_ + +.. comment: end: goto-read-the-docs + +This dataset comprises the `6 member CEPH pedigree 1463 `_ called using the DeepVariant toolchain and reference genome GRCh38. See the `DeepVariant preprint `_ for full details: + +| `Creating a universal SNP and small indel variant caller with deep neural networks `_ +| Ryan Poplin, Dan Newburger, Jojo Dijamco, Nam Nguyen, Dion Loy, Sam Gross, Cory Y. McLean, Mark A. DePristo +| DOI: https://doi.org/10.1101/092890 +| + +Google Cloud Platform data locations +------------------------------------ + +* Google Cloud Storage folder `gs://genomics-public-data/platinum-genomes-deepvariant `_ +* Google Genomics Dataset ID `14839180708999654392 `_ + + * `Variant Reference Bounds `_ + +* Google BigQuery Dataset ID `genomics-public-data:platinum_genomes_deepvariant `_ + +Provenance +---------- + +* The FASTQ files in `gs://genomics-public-data/platinum-genomes/fastq/ `_ were run through the DeepVariant toolchain to produce the corresponding ``*.deepvariant.g.vcf`` and ``*.deepvariant.vcf`` files in `gs://genomics-public-data/platinum-genomes-deepvariant/vcf/ `_. +* These files were then imported to Google Genomics and the variants were exported to Google BigQuery as table `genomics-public-data:platinum_genomes_deepvariant.single_sample_genome_calls `_. +* The data was then merged to produce variants-only `multisample-platinum-genomes-deepvariant.vcf `_ and table `genomics-public-data:platinum_genomes_deepvariant.multisample_variants `_. + + * The merging logic: + + * groups together only single- and multi-nucleotide polymorphisms with the same reference representation and alternate allele length that originate at the same chromosome and reference position + * merges all insertions at the same reference position, and + * splits complex variants into multiple records. + * Individual variants with GQ < 20 are hard-masked to no-calls, with the genotype likelihoods retained.