Skip to content
This repository has been archived by the owner on Oct 29, 2023. It is now read-only.

Add PrecisionFDA dataset. #144

Merged
merged 4 commits into from
Jan 30, 2017
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,7 @@
.. _VariantSet: https://cloud.google.com/genomics/reference/rest/v1/variantsets
.. _Load Genomic Variants: https://cloud.google.com/genomics/v1/load-variants
.. _Understanding the BigQuery Variants Table Schema: https://cloud.google.com/genomics/v1/bigquery-variants-schema
.. _Verily DeepVariant: https://cloud.google.com/genomics/v1alpha2/deepvariant

.. _Using Google Cloud Storage with Big Data: https://cloud.google.com/storage/docs/working-with-big-data
.. _gsutil: https://cloud.google.com/storage/docs/gsutil
Expand Down Expand Up @@ -252,7 +253,7 @@

.. GLOBAL SUBSTITUTIONS CAN GO HERE

.. |sparkADC| replace:: If the `Application Default Credentials`_ are not sufficient, use ``--secretsFile=PATH/TO/YOUR/client_secrets.json``. If you do not already have this file, see the `authentication instructions`_ to obtain it.
.. |sparkADC| replace:: If the `Application Default Credentials`_ are not sufficient, use ``--client-secrets=PATH/TO/YOUR/client_secrets.json``. If you do not already have this file, see the `authentication instructions`_ to obtain it.
.. |dataflowADC| replace:: If the `Application Default Credentials`_ are not sufficient, use ``--client-secrets PATH/TO/YOUR/client_secrets.json``. If you do not already have this file, see the `authentication instructions`_ to obtain it.
.. |dataflowSomeRefs| replace:: Use a comma-separated list to run over multiple disjoint regions. For example to run over `BRCA1`_ and `BRCA2`_ ``--references=chr13:32889610:32973808,chr17:41196311:41277499``.
.. |dataflowAllRefs| replace:: To run this pipeline over the entire genome, use ``--allReferences`` instead of ``--references=chr17:41196311:41277499``.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/includes/spark_setup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,6 @@

cd spark-examples
sbt assembly
cp target/scala-2.10/googlegenomics-spark-examples-assembly-*.jar ~/
cp target/scala-2.*/googlegenomics-spark-examples-assembly-*.jar ~/
cd ~/

Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ __ RenderedVersion_
1000_genomes
platinum_genomes
platinum_genomes_deepvariant
precision_fda
reference_genomes
mssng_data
isb_cgc_data
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,13 @@ Platinum Genomes DeepVariant
| **If you are reading this on github, you should instead click** `here`__. |
+-----------------------------------------------------------------------------------+

.. _RenderedVersion: http://googlegenomics.readthedocs.org/en/latest/use_cases/discover_public_data/platinum_genomes.html
.. _RenderedVersion: http://googlegenomics.readthedocs.org/en/latest/use_cases/discover_public_data/platinum_genomes_deepvariant.html

__ RenderedVersion_

.. comment: end: goto-read-the-docs

This dataset comprises the `6 member CEPH pedigree 1463 <http://www.ebi.ac.uk/ena/data/view/PRJEB3381>`_ called using the DeepVariant toolchain and reference genome GRCh38. See the `DeepVariant preprint <http://biorxiv.org/content/early/2016/12/14/092890>`_ for full details:
This dataset comprises the `6 member CEPH pedigree 1463 <http://www.ebi.ac.uk/ena/data/view/PRJEB3381>`_ called using the the alpha version of the `Verily DeepVariant`_ toolchain aligned to :ref:`vgrch38` reference genome. See the `DeepVariant preprint <http://biorxiv.org/content/early/2016/12/14/092890>`_ for full details:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the the "


| `Creating a universal SNP and small indel variant caller with deep neural networks <http://biorxiv.org/content/early/2016/12/14/092890>`_
| Ryan Poplin, Dan Newburger, Jojo Dijamco, Nam Nguyen, Dion Loy, Sam Gross, Cory Y. McLean, Mark A. DePristo
Expand Down
38 changes: 38 additions & 0 deletions docs/source/use_cases/discover_public_data/precision_fda.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
PrecisionFDA Truth Challenge
============================

.. comment: begin: goto-read-the-docs

.. container:: visible-only-on-github

+-----------------------------------------------------------------------------------+
| **The properly rendered version of this document can be found at Read The Docs.** |
| |
| **If you are reading this on github, you should instead click** `here`__. |
+-----------------------------------------------------------------------------------+

.. _RenderedVersion: http://googlegenomics.readthedocs.org/en/latest/use_cases/discover_public_data/precision_fda.html

__ RenderedVersion_

.. comment: end: goto-read-the-docs

This dataset includes both:

* the input for the `PrecisionFDA Truth Challenge <https://precision.fda.gov/challenges/truth>`_ comprised of whole-genome sequences for HG001 (NA12878) and HG002 (NA24385)
* the output from the alpha version of the `Verily DeepVariant`_ toolchain aligned to :ref:`vgrch38` reference genome. See the `DeepVariant preprint <http://biorxiv.org/content/early/2016/12/14/092890>`_ for full details:

| `Creating a universal SNP and small indel variant caller with deep neural networks <http://biorxiv.org/content/early/2016/12/14/092890>`_
| Ryan Poplin, Dan Newburger, Jojo Dijamco, Nam Nguyen, Dion Loy, Sam Gross, Cory Y. McLean, Mark A. DePristo
| DOI: https://doi.org/10.1101/092890
|

Google Cloud Platform data locations
------------------------------------

* Google Cloud Storage folder `gs://genomics-public-data/precision-fda <https://console.cloud.google.com/storage/genomics-public-data/precision-fda/>`_

Provenance
----------

* The FASTQ files in `gs://genomics-public-data/precision-fda/input <https://console.cloud.google.com/storage/genomics-public-data/precision-fda/input>`_ were run through the `Verily DeepVariant`_ alpha toolchain to produce the corresponding files in `gs://genomics-public-data/precision-fda/output/deepvariant-alpha <https://console.cloud.google.com/storage/genomics-public-data/precision-fda/output/deepvariant-alpha>`_.
30 changes: 30 additions & 0 deletions docs/source/use_cases/discover_public_data/reference_genomes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,36 @@ Genome Reference Consortium Human Build 38 includes data from 39 gzipped fasta f

More information on this source data can be found in this `NCBI article <http://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/>`__ and in the `FTP README <ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/README_ASSEMBLIES>`__.


.. _vgrch38:

Verily's GRCh38
^^^^^^^^^^^^^^^

Verily's GRCh38 reference genome is fully compatible with any b38 genome in the autosome.

Verily's GRCh38:

* excludes all patch sequences
* omits alternate haplotype chromosomes
* includes decoy sequences
* masks out duplicate copies of centromeric regions

The base assembly is `GRCh38_no_alt_plus_hs38d1 <ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz>`_. This assembly version was created specifically for analysis, with its rationale and exact genome modifications thoroughly documented in its `README <ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/README_analysis_sets.txt>`_ file.

Verily applied the following modifications to the base assembly:

* Reference segment names are prefixed with "chr".

+--------------------------------------------------------------+
| Many of the additional data files we use are provided |
| by GENCODE, which uses "chr" naming convention. |
+--------------------------------------------------------------+

* All 74 extended IUPAC codes are converted to the first matching alphabetical base pair as recommended in the VCF 4.3 specification.

* This release of the genome reference is named ``GRCh38_Verily_v1``

hg19
^^^^

Expand Down