diff --git a/docs/source/conf.py b/docs/source/conf.py index a6eb119..b7b59b6 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -186,6 +186,7 @@ .. _VariantSet: https://cloud.google.com/genomics/reference/rest/v1/variantsets .. _Load Genomic Variants: https://cloud.google.com/genomics/v1/load-variants .. _Understanding the BigQuery Variants Table Schema: https://cloud.google.com/genomics/v1/bigquery-variants-schema +.. _Verily DeepVariant: https://cloud.google.com/genomics/v1alpha2/deepvariant .. _Using Google Cloud Storage with Big Data: https://cloud.google.com/storage/docs/working-with-big-data .. _gsutil: https://cloud.google.com/storage/docs/gsutil @@ -252,7 +253,7 @@ .. GLOBAL SUBSTITUTIONS CAN GO HERE -.. |sparkADC| replace:: If the `Application Default Credentials`_ are not sufficient, use ``--secretsFile=PATH/TO/YOUR/client_secrets.json``. If you do not already have this file, see the `authentication instructions`_ to obtain it. +.. |sparkADC| replace:: If the `Application Default Credentials`_ are not sufficient, use ``--client-secrets=PATH/TO/YOUR/client_secrets.json``. If you do not already have this file, see the `authentication instructions`_ to obtain it. .. |dataflowADC| replace:: If the `Application Default Credentials`_ are not sufficient, use ``--client-secrets PATH/TO/YOUR/client_secrets.json``. If you do not already have this file, see the `authentication instructions`_ to obtain it. .. |dataflowSomeRefs| replace:: Use a comma-separated list to run over multiple disjoint regions. For example to run over `BRCA1`_ and `BRCA2`_ ``--references=chr13:32889610:32973808,chr17:41196311:41277499``. .. |dataflowAllRefs| replace:: To run this pipeline over the entire genome, use ``--allReferences`` instead of ``--references=chr17:41196311:41277499``. diff --git a/docs/source/includes/spark_setup.rst b/docs/source/includes/spark_setup.rst index a37078b..2c292c2 100644 --- a/docs/source/includes/spark_setup.rst +++ b/docs/source/includes/spark_setup.rst @@ -43,6 +43,6 @@ cd spark-examples sbt assembly - cp target/scala-2.10/googlegenomics-spark-examples-assembly-*.jar ~/ + cp target/scala-2.*/googlegenomics-spark-examples-assembly-*.jar ~/ cd ~/ diff --git a/docs/source/use_cases/discover_public_data/genomic_data_toc.rst b/docs/source/use_cases/discover_public_data/genomic_data_toc.rst index 658f5bb..b02f26a 100644 --- a/docs/source/use_cases/discover_public_data/genomic_data_toc.rst +++ b/docs/source/use_cases/discover_public_data/genomic_data_toc.rst @@ -23,6 +23,7 @@ __ RenderedVersion_ 1000_genomes platinum_genomes platinum_genomes_deepvariant + precision_fda reference_genomes mssng_data isb_cgc_data diff --git a/docs/source/use_cases/discover_public_data/platinum_genomes_deepvariant.rst b/docs/source/use_cases/discover_public_data/platinum_genomes_deepvariant.rst index 85f8c81..49dac99 100644 --- a/docs/source/use_cases/discover_public_data/platinum_genomes_deepvariant.rst +++ b/docs/source/use_cases/discover_public_data/platinum_genomes_deepvariant.rst @@ -11,13 +11,13 @@ Platinum Genomes DeepVariant | **If you are reading this on github, you should instead click** `here`__. | +-----------------------------------------------------------------------------------+ -.. _RenderedVersion: http://googlegenomics.readthedocs.org/en/latest/use_cases/discover_public_data/platinum_genomes.html +.. _RenderedVersion: http://googlegenomics.readthedocs.org/en/latest/use_cases/discover_public_data/platinum_genomes_deepvariant.html __ RenderedVersion_ .. comment: end: goto-read-the-docs -This dataset comprises the `6 member CEPH pedigree 1463 `_ called using the DeepVariant toolchain and reference genome GRCh38. See the `DeepVariant preprint `_ for full details: +This dataset comprises the `6 member CEPH pedigree 1463 `_ called using the alpha version of the `Verily DeepVariant`_ toolchain aligned to :ref:`vgrch38` reference genome. See the `DeepVariant preprint `_ for full details: | `Creating a universal SNP and small indel variant caller with deep neural networks `_ | Ryan Poplin, Dan Newburger, Jojo Dijamco, Nam Nguyen, Dion Loy, Sam Gross, Cory Y. McLean, Mark A. DePristo diff --git a/docs/source/use_cases/discover_public_data/precision_fda.rst b/docs/source/use_cases/discover_public_data/precision_fda.rst new file mode 100644 index 0000000..d49c31d --- /dev/null +++ b/docs/source/use_cases/discover_public_data/precision_fda.rst @@ -0,0 +1,38 @@ +PrecisionFDA Truth Challenge +============================ + +.. comment: begin: goto-read-the-docs + +.. container:: visible-only-on-github + + +-----------------------------------------------------------------------------------+ + | **The properly rendered version of this document can be found at Read The Docs.** | + | | + | **If you are reading this on github, you should instead click** `here`__. | + +-----------------------------------------------------------------------------------+ + +.. _RenderedVersion: http://googlegenomics.readthedocs.org/en/latest/use_cases/discover_public_data/precision_fda.html + +__ RenderedVersion_ + +.. comment: end: goto-read-the-docs + +This dataset includes both: + +* the input for the `PrecisionFDA Truth Challenge `_ comprised of whole-genome sequences for HG001 (NA12878) and HG002 (NA24385) +* the output from the alpha version of the `Verily DeepVariant`_ toolchain aligned to :ref:`vgrch38` reference genome. See the `DeepVariant preprint `_ for full details: + + | `Creating a universal SNP and small indel variant caller with deep neural networks `_ + | Ryan Poplin, Dan Newburger, Jojo Dijamco, Nam Nguyen, Dion Loy, Sam Gross, Cory Y. McLean, Mark A. DePristo + | DOI: https://doi.org/10.1101/092890 + | + +Google Cloud Platform data locations +------------------------------------ + +* Google Cloud Storage folder `gs://genomics-public-data/precision-fda `_ + +Provenance +---------- + +* The FASTQ files in `gs://genomics-public-data/precision-fda/input `_ were run through the `Verily DeepVariant`_ alpha toolchain to produce the corresponding files in `gs://genomics-public-data/precision-fda/output/deepvariant-alpha `_. diff --git a/docs/source/use_cases/discover_public_data/reference_genomes.rst b/docs/source/use_cases/discover_public_data/reference_genomes.rst index d0fc6cd..8b2f894 100644 --- a/docs/source/use_cases/discover_public_data/reference_genomes.rst +++ b/docs/source/use_cases/discover_public_data/reference_genomes.rst @@ -58,6 +58,36 @@ Genome Reference Consortium Human Build 38 includes data from 39 gzipped fasta f More information on this source data can be found in this `NCBI article `__ and in the `FTP README `__. + +.. _vgrch38: + +Verily's GRCh38 +^^^^^^^^^^^^^^^ + +Verily's GRCh38 reference genome is fully compatible with any b38 genome in the autosome. + +Verily's GRCh38: + +* excludes all patch sequences +* omits alternate haplotype chromosomes +* includes decoy sequences +* masks out duplicate copies of centromeric regions + +The base assembly is `GRCh38_no_alt_plus_hs38d1 `_. This assembly version was created specifically for analysis, with its rationale and exact genome modifications thoroughly documented in its `README `_ file. + +Verily applied the following modifications to the base assembly: + +* Reference segment names are prefixed with "chr". + + +--------------------------------------------------------------+ + | Many of the additional data files we use are provided | + | by GENCODE, which uses "chr" naming convention. | + +--------------------------------------------------------------+ + +* All 74 extended IUPAC codes are converted to the first matching alphabetical base pair as recommended in the VCF 4.3 specification. + +* This release of the genome reference is named ``GRCh38_Verily_v1`` + hg19 ^^^^