Skip to content

Latest commit

 

History

History

unmanaged-instances-healthcheck

TCP healthcheck and restart for unmanaged GCE instances

This blueprint shows how to leverage Serverless VPC Access and Cloud Functions to organize a highly performant TCP healthcheck for unmanaged GCE instances. Healthchecker Cloud Function uses goroutines to achieve parallel healthchecking for multiple instances and handles up to 1 thousand VMs checked in less than a second execution time.

NOTE: Managed Instance Groups has autohealing functionality out of the box, current blueprint is more applicable for standalone VMs or VMs in an unmanaged instance group.

The blueprint contains the following components:

  • Cloud Scheduler to initiate a healthcheck on a schedule.
  • Serverless VPC Connector to allow Cloud Functions TCP level access to private GCE instances.
  • Healthchecker Cloud Function to perform TCP checks against GCE instances.
  • Restarter PubSub topic to keep track of instances which are to be restarted.
  • Restarter Cloud Function to perform GCE instance reset for instances which are failing TCP healthcheck.

The resources created in this blueprint are shown in the high level diagram below:

Healthchecker configuration

Healthchecker cloud function has the following configuration options:

  • FILTER to filter list of GCE instances the health check will be targeted to. For instance (name = nginx-*) AND (labels.env = dev)
  • GRACE_PERIOD time period to prevent instance check of newly created instanced allowing services to start on the instance.
  • MAX_PARALLELISM - max amount of healthchecks performed in parallel, be aware that every check requires an open TCP connection which is limited.
  • PUBSUB_TOPIC topic to publish the message with instance metadata.
  • RECHECK_INTERVAL time period for performing recheck, when a check is failed it will be rechecked before marking as unhealthy.
  • TCP_PORT port used for health checking
  • TIMEOUT the timeout time of a TCP probe.

NOTE: In the current example healthchecker is used along with the restarter cloud function, but restarter can be replaced with another function like Pubsub2Inbox for email notifications.

Running the blueprint

Clone this repository or open it in cloud shell, then go through the following steps to create resources:

  • terraform init
  • terraform apply -var project_id=my-project-id

Once done testing, you can clean up resources by running terraform destroy. To persist state, check out the backend.tf.sample file.

Testing the blueprint

Configure gcloud with the project used for the deployment

gcloud config set project <MY-PROJECT-ID>

Wait until cloud scheduler executes the healthchecker function

gcloud scheduler jobs describe healthchecker-schedule

Check the healthchecker function logs to ensure instance is checked and healthy

gcloud functions logs read cf-healthchecker --region=europe-west1

#cf-healthchecker  ywn0mojbmgnw  2022-03-15 21:40:01.446  Function execution took 419 ms, finished with status code: 200
#cf-healthchecker  ywn0mojbmgnw  2022-03-15 21:40:01.442  1  instances found to be health checked.
#cf-healthchecker  ywn0mojbmgnw  2022-03-15 21:40:01.028  Function execution started

Stop nginx service on the test instance

gcloud compute ssh --zone europe-west1-b  nginx-test -- 'sudo systemctl stop nginx'

Wait a few minutes to allow scheduler to execute another healthcheck and examine the function logs

gcloud functions logs read cf-healthchecker --region=europe-west1

#cf-healthchecker  ywn0bmojtrji  2022-03-15 21:59:21.202  Instance restart task has been sent for instance nginx-test
#cf-healthchecker  ywn0bmojtrji  2022-03-15 21:59:21.201  Restart message published with id=4211063168407327
#cf-healthchecker  ywn0bmojtrji  2022-03-15 21:59:20.919  Healthcheck failed for instance nginx-test
#cf-healthchecker  ywn0bmojtrji  2022-03-15 21:59:10.914  Instance nginx-test is not responding, will recheck.
#cf-healthchecker  ywn0bmojtrji  2022-03-15 21:59:10.910  1  instances found to be health checked.
#cf-healthchecker  ywn0bmojtrji  2022-03-15 21:59:10.522  Function execution started

Examine cf-restarter function logs

gcloud functions logs read cf-restarter --region=europe-west1

#cf-restarter  yj6qiott5c4p  2022-03-15 21:59:24.625  Function execution took 975 ms, finished with status: 'ok'
#cf-restarter  yj6qiott5c4p  2022-03-15 21:59:24.623  Instance nginx-test has been reset.
#cf-restarter  yj6qiott5c4p  2022-03-15 21:59:23.653  Function execution started

Verify that nginx service is running again and uptime shows that instance has been reset

gcloud compute ssh --zone europe-west1-b  nginx-test -- 'sudo systemctl status nginx'
gcloud compute ssh --zone europe-west1-b  nginx-test -- 'uptime'

Variables

name description type required default
billing_account Billing account id used as default for new projects. string
project_id Project id to create a project when project_create is true, or to be used when false. string
grace_period Grace period for an instance startup. string "180s"
location App Engine location used in the example (required for CloudFunctions). string "europe-west"
project_create Create project instead of using an existing one. bool false
region Compute region used in the example. string "europe-west1"
root_node The resource name of the parent folder or organization for project creation, in 'folders/folder_id' or 'organizations/org_id' format. string null
schedule Cron schedule for executing compute instances healthcheck. string "*/5 * * * *" # every five minutes"
tcp_port TCP port to run healthcheck against. string "80" #http"
timeout TCP probe timeout. string "1000ms"

Outputs

name description sensitive
cloud-function-healthchecker Cloud Function Healthchecker instance details.
cloud-function-restarter Cloud Function Healthchecker instance details.
pubsub-topic Restarter PubSub topic.

Test

module "test" {
  source          = "./fabric/blueprints/cloud-operations/unmanaged-instances-healthcheck"
  project_id      = "project-1"
  billing_account = "123456-123456-123456"
  project_create  = true
}
# tftest modules=11 resources=46