Name		Name	Last commit message	Last commit date
parent directory ..
data-demo		data-demo
README.md		README.md
backend.tf.sample		backend.tf.sample
bigquery.tftpl		bigquery.tftpl
dataflow.tftpl		dataflow.tftpl
datastorage.tf		datastorage.tf
diagram.png		diagram.png
kms.tf		kms.tf
main.tf		main.tf
outputs.tf		outputs.tf
serviceaccounts.tf		serviceaccounts.tf
terraform.tfvars.sample		terraform.tfvars.sample
variables.tf		variables.tf
versions.tf		versions.tf
vpc.tf		vpc.tf

README.md

Cloud Storage to Bigquery with Cloud Dataflow with least privileges

This example creates the infrastructure needed to run a Cloud Dataflow pipeline to import data from GCS to Bigquery. The example will create different service accounts with least privileges on resources. To run the pipeline, users listed in data_eng_principals can impersonate all those service accounts.

The solution will use:

internal IPs for GCE and Cloud Dataflow instances
Cloud NAT to let resources egress to the Internet, to run system updates and install packages
rely on Service Account Impersonation to avoid the use of service account keys
Service Accounts with least privilege on each resource
(Optional) CMEK encription for GCS bucket, DataFlow instances and BigQuery tables

The example is designed to match real-world use cases with a minimum amount of resources and some compromises listed below. It can be used as a starting point for more complex scenarios.

This is the high level diagram:

Move to real use case consideration

In the example we implemented some compromise to keep the example minimal and easy to read. On a real word use case, you may evaluate the option to:

Configure a Shared-VPC
Use only Identity Groups to assigne roles
Use Authorative IAM role assignement
Split resources in different project: Data Landing, Data Transformation, Data Lake, ...
Use VPC-SC to mitigate data exfiltration

Managed resources and services

This sample creates several distinct groups of resources:

projects
- Service Project configured for GCS buckets, Dataflow instances and BigQuery tables and orchestration
networking
- VPC network
- One subnet
- Firewall rules for SSH access via IAP and open communication within the VPC
IAM
- One service account for uploading data into the GCS landing bucket
- One service account for Orchestration
- One service account for Dataflow instances
- One service account for Bigquery tables
GCS
- One bucket
BQ
- One dataset
- One table. Tables are defined in Terraform for the porpuse of the example. Probably, in real scenario, would handle Tables creation in a separate Terraform State or using a different tool/pipeline (for example: Dataform).

In this example you can also configure users or group of user to assign them viewer role on the resources created and the ability to imprsonate service accounts to test dataflow pipelines before autometing them with Composer or any other orchestration systems.

Deploy your enviroment

We assume the identiy running the following steps has the following role:

resourcemanager.projectCreator in case a new project will be created.
owner on the project in case you use an existing project.

Run Terraform init:

$ terraform init

Configure the Terraform variable in your terraform.tfvars file. You need to spefify at least the following variables:

data_eng_principals = ["user:[email protected]"]
project_id      = "datalake-001"
prefix          = "prefix"

You can run now:

$ terraform apply

You should see the output of the Terraform script with resources created and some command pre-created for you to run the example following steps below.

Test your environment with Cloud Dataflow

We assume all those steps are run using a user listed on data_eng_principals. You can authenticate as the user using the following command:

$ gcloud init
$ gcloud auth application-default login

For the purpose of the example we will import from GCS to Bigquery a CSV file with the following structure:

name,surname,timestam

We need to create 3 file:

A person.csv file containing your data in the form name,surname,timestam. Here an example line `Lorenzo,Caggioni,1637771951'.
A person_udf.js containing the UDF javascript file used by the Dataflow template.
A person_schema.json file containing the table schema used to import the CSV.

You can find an example of those file in the folder ./data-demo. You can copy the example files in the GCS bucket using the command returned in the terraform output as command_01_gcs. Below an example:

gsutil -i [email protected] cp data-demo/* gs://LANDING_BUCKET

We can now run the Dataflow pipeline using the gcloud returned in the terraform output as command_02_dataflow. Below an example:

gcloud --impersonate-service-account=orch-test@PROJECT.iam.gserviceaccount.com dataflow jobs run test_batch_01 \
    --gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
    --project PROJECT \
    --region REGION \
    --disable-public-ips \
    --subnetwork https://www.googleapis.com/compute/v1/projects/PROJECT/regions/REGION/subnetworks/subnet \
    --staging-location gs://PREFIX-df-tmp \
    --service-account-email [email protected] \
    --parameters \
javascriptTextTransformFunctionName=transform,\
JSONPath=gs://PREFIX-data/person_schema.json,\
javascriptTextTransformGcsPath=gs://PREFIX-data/person_udf.js,\
inputFilePattern=gs://PREFIX-data/person.csv,\
outputTable=PROJECT:datalake.person,\
bigQueryLoadingTemporaryDirectory=gs://PREFIX-df-tmp

You can check data imported into Google BigQuery using the command returned in the terraform output as command_03_bq. Below an example:

bq query --use_legacy_sql=false 'SELECT * FROM `PROJECT.datalake.person` LIMIT 1000'

Variables

name	description	type	required	default
prefix	Unique prefix used for resource names. Not used for project if 'project_create' is null.	`string`	✓
project_id	Project id, references existing project if `project_create` is null.	`string`	✓
cmek_encryption	Flag to enable CMEK on GCP resources created.	`bool`		`false`
data_eng_principals	Groups with Service Account Token creator role on service accounts in IAM format, eg 'group:[email protected]'.	`list(string)`		`[]`
project_create	Provide values if project creation is needed, uses existing project if null. Parent is in 'folders/nnn' or 'organizations/nnn' format.	`object({…})`		`null`
region	The region where resources will be deployed.	`string`		`"europe-west1"`
vpc_subnet_range	Ip range used for the VPC subnet created for the example.	`string`		`"10.0.0.0/20"`

Outputs

name	description	sensitive
bq_tables	Bigquery Tables.
buckets	GCS bucket Cloud KMS crypto keys.
command_01_gcs	gcloud command to copy data into the created bucket impersonating the service account.
command_02_dataflow	Command to run Dataflow template impersonating the service account.
command_03_bq	BigQuery command to query imported data.
project_id	Project id.
service_accounts	Service account.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gcs-to-bq-with-least-privileges

gcs-to-bq-with-least-privileges

README.md

Cloud Storage to Bigquery with Cloud Dataflow with least privileges

Move to real use case consideration

Managed resources and services

Deploy your enviroment

Test your environment with Cloud Dataflow

Variables

Outputs

Files

gcs-to-bq-with-least-privileges

Directory actions

More options

Directory actions

More options

Latest commit

History

gcs-to-bq-with-least-privileges

Folders and files

parent directory

README.md

Cloud Storage to Bigquery with Cloud Dataflow with least privileges

Move to real use case consideration

Managed resources and services

Deploy your enviroment

Test your environment with Cloud Dataflow

Variables

Outputs