Skip to content

Commit

Permalink
Replace dataproc-initialization-actions with $MY_BUCKET in README (Go…
Browse files Browse the repository at this point in the history
  • Loading branch information
functicons authored Jul 25, 2019
1 parent 2f51e6a commit d2b8683
Show file tree
Hide file tree
Showing 37 changed files with 78 additions and 87 deletions.
13 changes: 2 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,16 +10,7 @@ Initialization actions are stored in a [Google Cloud Storage](https://cloud.goog
[--initialization-actions [GCS_URI,...]] \
[--initialization-action-timeout TIMEOUT]

For convenience, copies of initialization actions in this repository are stored in the publicly accessible Cloud Storage bucket `gs://dataproc-initialization-actions`. The folder structure of this Cloud Storage bucket mirrors this repository. You should be able to use this Cloud Storage bucket (and the initialization scripts within it) for your clusters.

For example:

```bash
gcloud dataproc clusters create my-presto-cluster \
--initialization-actions gs://dataproc-initialization-actions/presto/presto.sh
```

You are strongly encouraged to copy initialization actions to your own GCS bucket in automated pipelines to ensure hermetic deployments. For example:
Before creating clusters, you need to copy initialization actions to your own GCS bucket. For example:

```bash
MY_BUCKET=<gcs-bucket>
Expand All @@ -28,7 +19,7 @@ gcloud dataproc clusters create my-presto-cluster \
--initialization-actions gs://$MY_BUCKET/presto.sh
```

This is also useful if you want to modify initialization actions to fit your needs.
You can decide when to sync your copy of the initialization action with any changes to the initialization action that occur in the GitHub repository. This is also useful if you want to modify initialization actions to fit your needs.

## Why these samples are provided

Expand Down
6 changes: 3 additions & 3 deletions beam/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,9 +111,9 @@ shown later).

```bash
CLUSTER_NAME="$1
INIT_ACTIONS="gs://dataproc-initialization-actions/docker/docker.sh"
INIT_ACTIONS+=",gs://dataproc-initialization-actions/flink/flink.sh"
INIT_ACTIONS+=",gs://dataproc-initialization-actions/beam/beam.sh"
INIT_ACTIONS="gs://$MY_BUCKET/docker/docker.sh"
INIT_ACTIONS+=",gs://$MY_BUCKET/flink/flink.sh"
INIT_ACTIONS+=",gs://$MY_BUCKET/beam/beam.sh"
FLINK_SNAPSHOT="https://archive.apache.org/dist/flink/flink-1.5.3/flink-1.5.3-bin-hadoop28-scala_2.11.tgz"
METADATA="beam-job-service-snapshot=<...>"
METADATA+=",beam-image-enable-pull=true"
Expand Down
6 changes: 3 additions & 3 deletions bigdl/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ You can use this initialization to create a new Dataproc cluster with BigDL's Sp

```
gcloud dataproc clusters create <CLUSTER_NAME> \
--initialization-actions gs://dataproc-initialization-actions/bigdl/bigdl.sh \
--initialization-actions gs://$MY_BUCKET/bigdl/bigdl.sh \
--initialization-action-timeout 10m
```

Expand All @@ -30,7 +30,7 @@ For example, for Dataproc 1.0 (Spark 1.6 and Scala 2.10) and BigDL v0.7.2:
```
gcloud dataproc clusters create <CLUSTER_NAME> \
--image-version 1.0 \
--initialization-actions gs://dataproc-initialization-actions/bigdl/bigdl.sh \
--initialization-actions gs://$MY_BUCKET/bigdl/bigdl.sh \
--initialization-action-timeout 10m \
--metadata 'bigdl-download-url=https://repo1.maven.org/maven2/com/intel/analytics/bigdl/dist-spark-1.6.2-scala-2.10.5-all/0.7.2/dist-spark-1.6.2-scala-2.10.5-all-0.7.2-dist.zip'
```
Expand All @@ -40,7 +40,7 @@ Or, for example, to download Analytics Zoo 0.4.0 with BigDL v0.7.2 for Dataproc
```
gcloud dataproc clusters create <CLUSTER_NAME> \
--image-version 1.3 \
--initialization-actions gs://dataproc-initialization-actions/bigdl/bigdl.sh \
--initialization-actions gs://$MY_BUCKET/bigdl/bigdl.sh \
--initialization-action-timeout 10m \
--metadata 'bigdl-download-url=https://repo1.maven.org/maven2/com/intel/analytics/zoo/analytics-zoo-bigdl_0.7.2-spark_2.3.1/0.4.0/analytics-zoo-bigdl_0.7.2-spark_2.3.1-0.4.0-dist-all.zip'
```
Expand Down
4 changes: 2 additions & 2 deletions bigtable/README.MD
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ You can use this initialization action to create a Dataproc cluster configured t

```bash
gcloud dataproc clusters create <CLUSTER_NAME> \
--initialization-actions gs://dataproc-initialization-actions/bigtable/bigtable.sh \
--initialization-actions gs://$MY_BUCKET/bigtable/bigtable.sh \
--metadata bigtable-instance=<BIGTABLE INSTANCE>
```
1. The cluster will have HBase libraries, the Bigtable client, and the [Apache Spark - Apache HBase Connector](https://github.com/hortonworks-spark/shc) installed.
Expand All @@ -29,7 +29,7 @@ You can use this initialization action to create a Dataproc cluster configured t
1. Submit the jar with dependecies as a Dataproc job. Note that `OUTPUT_TABLE` should not already exist. This job will create the table with the correct column family.

```bash
gcloud dataproc jobs submit hadoop --cluster <CLUSTER_NAME> --class com.example.bigtable.sample.WordCountDriver --jars target/wordcount-mapreduce-0-SNAPSHOT-jar-with-dependencies.jar -- wordcount-hbase gs://dataproc-initialization-actions/README.md <OUTPUT_TABLE>
gcloud dataproc jobs submit hadoop --cluster <CLUSTER_NAME> --class com.example.bigtable.sample.WordCountDriver --jars target/wordcount-mapreduce-0-SNAPSHOT-jar-with-dependencies.jar -- wordcount-hbase gs://$MY_BUCKET/README.md <OUTPUT_TABLE>
```

## Running an example Spark job on cluster using SHC
Expand Down
8 changes: 4 additions & 4 deletions cloud-sql-proxy/README.MD
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ shared hive metastore.
gcloud dataproc clusters create <CLUSTER_NAME> \
--region <REGION> \
--scopes sql-admin \
--initialization-actions gs://dataproc-initialization-actions/cloud-sql-proxy/cloud-sql-proxy.sh \
--initialization-actions gs://$MY_BUCKET/cloud-sql-proxy/cloud-sql-proxy.sh \
--properties hive:hive.metastore.warehouse.dir=gs://<HIVE_DATA_BUCKET>/hive-warehouse \
--metadata "hive-metastore-instance=<PROJECT_ID>:<REGION>:<INSTANCE_NAME>"
```
Expand All @@ -69,7 +69,7 @@ shared hive metastore.
```bash
gcloud dataproc clusters create <ANOTHER_CLUSTER_NAME> \
--scopes sql-admin \
--initialization-actions gs://dataproc-initialization-actions/cloud-sql-proxy/cloud-sql-proxy.sh \
--initialization-actions gs://$MY_BUCKET/cloud-sql-proxy/cloud-sql-proxy.sh \
--metadata "hive-metastore-instance=<PROJECT_ID>:<REGION>:<INSTANCE_NAME>"
```

Expand Down Expand Up @@ -105,7 +105,7 @@ write to Cloud SQL. Set the `enable-cloud-sql-hive-metastore` metadata key to
```bash
gcloud dataproc clusters create <CLUSTER_NAME> \
--scopes sql-admin \
--initialization-actions gs://dataproc-initialization-actions/cloud-sql-proxy/cloud-sql-proxy.sh \
--initialization-actions gs://$MY_BUCKET/cloud-sql-proxy/cloud-sql-proxy.sh \
--metadata "enable-cloud-sql-hive-metastore=false" \
--metadata "additional-cloud-sql-instances=<PROJECT_ID>:<REGION>:<ANOTHER_INSTANCE_NAME>"
```
Expand Down Expand Up @@ -180,7 +180,7 @@ additional setup.
```bash
gcloud dataproc clusters create <CLUSTER_NAME> \
--scopes sql-admin \
--initialization-actions gs://dataproc-initialization-actions/cloud-sql-proxy/cloud-sql-proxy.sh \
--initialization-actions gs://$MY_BUCKET/cloud-sql-proxy/cloud-sql-proxy.sh \
--properties hive:hive.metastore.warehouse.dir=gs://<HIVE_DATA_BUCKET>/hive-warehouse \
--metadata "hive-metastore-instance=<PROJECT_ID>:<REGION>:<INSTANCE_NAME>" \
--metadata "use-cloud-sql-private-ip=true" \
Expand Down
12 changes: 6 additions & 6 deletions conda/README.MD
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Please see the following tutorial for full details https://cloud.google.com/data

```
gcloud dataproc clusters create foo --initialization-actions \
gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,gs://dataproc-initialization-actions/conda/install-conda-env.sh
gs://$MY_BUCKET/conda/bootstrap-conda.sh,gs://$MY_BUCKET/conda/install-conda-env.sh
```

### Install extra conda and/or pip packages
Expand All @@ -38,7 +38,7 @@ You can add extra packages by using the metadata entries `CONDA_PACKAGES` and `P
gcloud dataproc clusters create foo \
--metadata 'CONDA_PACKAGES="numpy pandas",PIP_PACKAGES=pandas-gbq' \
--initialization-actions \
gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,gs://dataproc-initialization-actions/conda/install-conda-env.sh
gs://$MY_BUCKET/conda/bootstrap-conda.sh,gs://$MY_BUCKET/conda/install-conda-env.sh
```

Alternatively, you can use environment variables, e.g.:
Expand All @@ -53,8 +53,8 @@ Where `create-my-cluster.sh` specifies a list of conda and/or pip packages to in
```
#!/usr/bin/env bash
gsutil -m cp -r gs://dataproc-initialization-actions/conda/bootstrap-conda.sh .
gsutil -m cp -r gs://dataproc-initialization-actions/conda/install-conda-env.sh .
gsutil -m cp -r gs://$MY_BUCKET/conda/bootstrap-conda.sh .
gsutil -m cp -r gs://$MY_BUCKET/conda/install-conda-env.sh .
chmod 755 ./*conda*.sh
Expand All @@ -77,8 +77,8 @@ CONDA_ENV_YAML_GSC_LOC="gs://my-bucket/path/to/conda-environment.yml"
CONDA_ENV_YAML_PATH="/root/conda-environment.yml"
echo "Downloading conda environment at $CONDA_ENV_YAML_GSC_LOC to $CONDA_ENV_YAML_PATH ... "
gsutil -m cp -r $CONDA_ENV_YAML_GSC_LOC $CONDA_ENV_YAML_PATH
gsutil -m cp -r gs://dataproc-initialization-actions/conda/bootstrap-conda.sh .
gsutil -m cp -r gs://dataproc-initialization-actions/conda/install-conda-env.sh .
gsutil -m cp -r gs://$MY_BUCKET/conda/bootstrap-conda.sh .
gsutil -m cp -r gs://$MY_BUCKET/conda/install-conda-env.sh .
chmod 755 ./*conda*.sh
Expand Down
6 changes: 3 additions & 3 deletions connectors/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Google Cloud Storage and BigQuery connector installed:

```
gcloud dataproc clusters create <CLUSTER_NAME> \
--initialization-actions gs://dataproc-initialization-actions/connectors/connectors.sh \
--initialization-actions gs://$MY_BUCKET/connectors/connectors.sh \
--metadata gcs-connector-version=1.9.16 \
--metadata bigquery-connector-version=0.13.16
```
Expand All @@ -34,14 +34,14 @@ For example:
will be updated to 0.11.0 version:
```
gcloud dataproc clusters create <CLUSTER_NAME> \
--initialization-actions gs://dataproc-initialization-actions/connectors/connectors.sh \
--initialization-actions gs://$MY_BUCKET/connectors/connectors.sh \
--metadata gcs-connector-version=1.7.0
```
* if Google Cloud Storage connector 1.8.0 version is specified and BigQuery connector version is not
specified, then only Google Cloud Storage connector will be updated to 1.8.0 version and BigQuery
connector will be left intact:
```
gcloud dataproc clusters create <CLUSTER_NAME> \
--initialization-actions gs://dataproc-initialization-actions/connectors/connectors.sh \
--initialization-actions gs://$MY_BUCKET/connectors/connectors.sh \
--metadata gcs-connector-version=1.8.0
```
4 changes: 2 additions & 2 deletions datalab/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Dataproc cluster. You will need to connect to Datalab using an SSH tunnel.

```bash
gcloud dataproc clusters create <CLUSTER_NAME> \
--initialization-actions gs://dataproc-initialization-actions/datalab/datalab.sh \
--initialization-actions gs://$MY_BUCKET/datalab/datalab.sh \
--scopes cloud-platform
```

Expand Down Expand Up @@ -43,7 +43,7 @@ how to set up Python 3.5 on workers:
gcloud dataproc clusters create <CLUSTER_NAME> \
--metadata 'CONDA_PACKAGES="python==3.5"' \
--scopes cloud-platform \
--initialization-actions gs://dataproc-initialization-actions/conda/bootstrap-conda.sh,gs://dataproc-initialization-actions/conda/install-conda-env.sh,gs://dataproc-initialization-actions/datalab/datalab.sh
--initialization-actions gs://$MY_BUCKET/conda/bootstrap-conda.sh,gs://$MY_BUCKET/conda/install-conda-env.sh,gs://$MY_BUCKET/datalab/datalab.sh
```

In effect, this means that a particular Datalab-on-Dataproc cluster can only run
Expand Down
2 changes: 1 addition & 1 deletion docker/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ applications can access Docker.

```bash
gcloud dataproc clusters create <CLUSTER_NAME> \
--initialization-actions gs://dataproc-initialization-actions/docker/docker.sh
--initialization-actions gs://$MY_BUCKET/docker/docker.sh
```

1. Docker is installed and configured on all nodes of the cluster (both master
Expand Down
8 changes: 4 additions & 4 deletions drill/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,24 +12,24 @@ Check the variables set in the script to ensure they're to your liking.

```bash
gcloud dataproc clusters create <CLUSTER_NAME> \
--initialization-actions gs://dataproc-initialization-actions/zookeeper/zookeeper.sh \
--initialization-actions gs://dataproc-initialization-actions/drill/drill.sh
--initialization-actions gs://$MY_BUCKET/zookeeper/zookeeper.sh \
--initialization-actions gs://$MY_BUCKET/drill/drill.sh
```

High availability cluster (Zookeeper comes pre-installed)

```bash
gcloud dataproc clusters create <CLUSTER_NAME> \
--num-masters 3 \
--initialization-actions gs://dataproc-initialization-actions/drill/drill.sh
--initialization-actions gs://$MY_BUCKET/drill/drill.sh
```

Single node cluster (Zookeeper is unnecessary)

```bash
gcloud dataproc clusters create <CLUSTER_NAME> \
--single-node \
--initialization-actions gs://dataproc-initialization-actions/drill/drill.sh
--initialization-actions gs://$MY_BUCKET/drill/drill.sh
```

1. Once the cluster has been created, Drillbits will start on all nodes. You can log into any node of the cluster to run Drill queries. Drill is installed in `/usr/lib/drill` (unless you change the setting) which contains a `bin` directory with `sqlline`.
Expand Down
2 changes: 1 addition & 1 deletion flink/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Flink and start a Flink session running on YARN.

```bash
gcloud dataproc clusters create <CLUSTER_NAME> \
--initialization-actions gs://dataproc-initialization-actions/flink/flink.sh
--initialization-actions gs://$MY_BUCKET/flink/flink.sh
```

1. You can log into the master node of the cluster to submit jobs to Flink. Flink is installed in `/usr/lib/flink` (unless you change the setting) which contains a `bin` directory with Flink. **Note** - you need to specify `HADOOP_CONF_DIR=/etc/hadoop/conf` before your Flink commands for them to execute properly.
Expand Down
2 changes: 1 addition & 1 deletion ganglia/README.MD
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ This initialization action installs [Ganglia](http://ganglia.info/), a scalable

```bash
gcloud dataproc clusters create <CLUSTER_NAME> \
--initialization-actions gs://dataproc-initialization-actions/ganglia/ganglia.sh
--initialization-actions gs://$MY_BUCKET/ganglia/ganglia.sh
```

1. Once the cluster has been created, Ganglia is served on port `80` on the master node at `/ganglia`. To connect to the Ganglia web interface, you will need to create an SSH tunnel and use a SOCKS 5 Proxy with your web browser as described in the [dataproc web interfaces](https://cloud.google.com/dataproc/cluster-web-interfaces) documentation. In the opened web browser, go to `http://CLUSTER_NAME-m/ganglia` on Standard/Single Node clusters, or `http://CLUSTER_NAME-m-0/ganglia` on High Availability clusters.
2 changes: 1 addition & 1 deletion gobblin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ You can use this initialization action to create a new Dataproc cluster with Gob

```bash
gcloud dataproc clusters create <CLUSTER_NAME> \
--initialization-actions gs://dataproc-initialization-actions/gobblin/gobblin.sh
--initialization-actions gs://$MY_BUCKET/gobblin/gobblin.sh
```

1. Submit jobs
Expand Down
4 changes: 2 additions & 2 deletions gpu/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ GPU driver please visit NVIDIA [site](https://www.nvidia.com/Download/index.aspx
gcloud beta dataproc clusters create <CLUSTER_NAME> \
--master-accelerator type=nvidia-tesla-v100 \
--worker-accelerator type=nvidia-tesla-v100,count=4 \
--initialization-actions gs://dataproc-initialization-actions/gpu/install_gpu_driver.sh \
--initialization-actions gs://$MY_BUCKET/gpu/install_gpu_driver.sh \
--metadata install_gpu_agent=false
```

Expand All @@ -28,7 +28,7 @@ GPU driver please visit NVIDIA [site](https://www.nvidia.com/Download/index.aspx
gcloud beta dataproc clusters create <CLUSTER_NAME> \
--master-accelerator type=nvidia-tesla-v100 \
--worker-accelerator type=nvidia-tesla-v100,count=4 \
--initialization-actions gs://dataproc-initialization-actions/gpu/install_gpu_driver.sh \
--initialization-actions gs://$MY_BUCKET/gpu/install_gpu_driver.sh \
--metadata install_gpu_agent=true \
--scopes https://www.googleapis.com/auth/monitoring.write
```
Expand Down
8 changes: 4 additions & 4 deletions hbase/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ You can use this initialization action to create a new Dataproc cluster with Apa

```bash
gcloud dataproc clusters create <CLUSTER_NAME> \
--initialization-actions gs://dataproc-initialization-actions/hbase/hbase.sh \
--initialization-actions gs://$MY_BUCKET/hbase/hbase.sh \
--num-masters 3 --num-workers 2
```
1. You can validate your deployment by ssh into any node and running:
Expand Down Expand Up @@ -36,7 +36,7 @@ On dataproc clusters HBase uses HDFS as storage backend by default. This mode ca

```bash
gcloud dataproc clusters create <CLUSTER_NAME> \
--initialization-actions gs://dataproc-initialization-actions/hbase/hbase.sh \
--initialization-actions gs://$MY_BUCKET/hbase/hbase.sh \
--metadata 'hbase-root-dir=gs://<BUCKET_NAME>/' \
--num-masters 3 --num-workers 2
```
Expand All @@ -49,7 +49,7 @@ changes the necessary configurations and creates all keytabs necessary for HBase

```bash
gcloud beta dataproc clusters create <CLUSTER_NAME> \
--initialization-actions gs://dataproc-initialization-actions/hbase/hbase.sh \
--initialization-actions gs://$MY_BUCKET/hbase/hbase.sh \
--metadata 'enable-kerberos=true,keytab-bucket=gs://<BUCKET_NAME>' \
--num-masters 3 --num-workers 2 \
--kerberos-root-principal-password-uri="Cloud Storage URI of KMS-encrypted password for Kerberos root principal" \
Expand All @@ -76,6 +76,6 @@ changes the necessary configurations and creates all keytabs necessary for HBase
- In HA clusters, HBase is using Zookeeper that is pre-installed on master nodes.
- In standard and single node clusters, it is required to install and configure Zookeeper which could be done with zookeeper init action. You can pass additional init action when creating HBase standard cluster:
```bash
--initialization-actions gs://dataproc-initialization-actions/zookeeper/zookeeper.sh,gs://dataproc-initialization-actions/hbase/hbase.sh
--initialization-actions gs://$MY_BUCKET/zookeeper/zookeeper.sh,gs://$MY_BUCKET/hbase/hbase.sh
```
- The Kerberos version of this initialization action should be used in the HA mode. Otherwise, an additional zookeeper configuration is necessary.
2 changes: 1 addition & 1 deletion hive-hcatalog/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ You can use this initialization action to create a new Cloud Dataproc cluster wi

```bash
gcloud dataproc clusters create <CLUSTER-NAME> \
--initialization-actions gs://dataproc-initialization-actions/hive-hcatalog/hive-hcatalog.sh
--initialization-actions gs://$MY_BUCKET/hive-hcatalog/hive-hcatalog.sh
```

1. Once the cluster has been created HCatalog should be installed and configured for use with Pig.
Expand Down
2 changes: 1 addition & 1 deletion hue/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ You can use this initialization action to create a new Dataproc cluster with Hue

```bash
gcloud dataproc clusters create <CLUSTER_NAME> \
--initialization-actions gs://dataproc-initialization-actions/hue/hue.sh
--initialization-actions gs://$MY_BUCKET/hue/hue.sh
```

1. Once the cluster has been created, Hue is configured to run on port `8888` on the master node in a Dataproc cluster. To connect to the Hue web interface, you will need to create an SSH tunnel and use a SOCKS 5 Proxy with your web browser as described in the [dataproc web interfaces](https://cloud.google.com/dataproc/cluster-web-interfaces) documentation. In the opened web browser go to 'localhost:8888' and you should see the Hue UI.
Expand Down
Loading

0 comments on commit d2b8683

Please sign in to comment.