Skip to content

Commit

Permalink
Implement Cassandra backup and restore. (#418)
Browse files Browse the repository at this point in the history
* First draft of icarus sidecar container usage.

* Refactor auth logic to use a separate struct that holds all auth data.

* Implement TLS support for icarus container. Pass the proper JMX credentials for icarus.

* Create skeleton for cassandrabackup controller.

* Fix auth logic. A fresh cluster couldn't init admin user.

* fix unit tests

* Create a separate icarus client. The generated had bugs and was inconvenient to work with.

* Implement first draft of the backup controller.
Tested and work with only with the file storage type.

* Track backup progress. Shows as an integer value from 0-100 in status. Do not use float as it's not recommended by the controller-gen tool.
Do a requeue until the backup is not finished to update status.

* Create the skeleton for restore controller.

* Regenerate manifestss

* Move icarus backup related methods into a separate file.

* Add restore methods for the icarus client.

* Get the list of restores to later check if there's one exiting already.

* Fix the check if the backup with the requested snapshot name exists already.
Fix typo.

* Implement restore logic. Tested with file storage type only.

* Create a serviceaccount with necessary roles for cassandra pods. Needed for icarus to allow reading k8s secrets.
Expose secret name arg option for icarus to be support storage providers other than file.

* Add description to the backup CR fields.

* Add backup duration option.

* Add bandwidth option.

* Add concurrentConnections option.

* Add dc option.

* Add entities, timeout and metadataDirective options.

* Add the rest of the backup options.

* Generate assets.

* Add most of the fields for restore CRD.

* Implement failed backup process restart if user changed the config and a failed backup exists in icarus. If the backup request is absent in icarus - tell the user to recrete the CR.

* Implement failed restore process restart if user changed the config and a failed restore exists in icarus. If the restore request is absent in icarus - tell the user to recreate the CR.

* Implement validating webhook for cassandrabackup.

* Implement validating webhook for cassandrarestore.

* Validate storage location in both controllers.

* validate duration

* Add more CRD fields validations.

* Fix docs.

* Move related backup search and failed backup reconcile logic into separate functions.

* Move status reconcile into separate func.

* Break up main func into smaller ones.

* Split controller into smaller functions.

* Move code around, rename vars and move icarus related funcs into separate file.

* Move main restore logic into separate file.

* Refactor restore logic.

* Track cluster readiness in the CassandraCluster status.

* Use CassandraCluster readiness status field in backup and restore controllers to block execution befor the cluster becomes ready.

* Remove restorationStrategyType as only HARDLINKS cam be supported. IMPORT available only on Cassandra 4 and IN_PLACE can be used only on a node that's down. We support only alive clusters (at least for now).
Remove singlePhase field as we don't plan to support (at least not yet) single phase restores.
For that reason the restorationPhase is also removed. Only INIT can be supported if singlePhase is false.
Remove the actualSnapshotTag status field from backup since icarus supports specifying only the tag name withouth appended schema version and timestamp.

* Add schemaVersion and exactSchemaVersion fields.

* Fix updating the active admin secret with the wrong role and password,

* Drop support for file storage.

* Fix tests and add checks for icarus container.

* Fix lint issues.

* Make the backup and restore controllers more testable.
Implement new controllers initialization for test manager.
Create icarus mock.
Create a simple test for backup logic.

* Cover with tests failure scenario.

* Add restore tests.
Hardcore doesnloaded sstables location on restore.
Fix CassandraRestore cleanup in tests

* Add docs.

* Implement storage secret validation.

* make manifests

* Fix a few bugs and descriptions.

* Fix tests and lint issues.

* Fix not using the duration field.
Removed not used field.

* Allow to override the snapshotTag name.

* Build and push icarus image in CI

* Fix trivy vulnerability issue.

* fix Dockerfile for icarus

* Run tests against k8s 1.24.2.

* Don't run against old k8s versions.

* Fix CRD cleanuo in CI script.

* Choose the container in `execPod`. Stopped working since we have 2 containers now, need to choose for request to succeed.
Make `utils.MergeMap` resilient to nil maps.

* Allow more processes during e2e tests.

* Rename vars to avoid struct name shadowing.
Don't mix value and pointer receiver methods declaration.

* Fix misuse of util.MergeMap. It used a sideeffect of a bug that populated the map passed as a first argument but only the resulting map should have the merged elements. The args should not change.

* Return a nil map if the inputs are nil in `MargeMap.`

* Use .Before instead of comparing timestamps.

* Fix compile errors after main merge.

* Revert to run integration tests against 1.20.2

* Don't output debug logs into stdout on failed e2e tests since it became very verbose and hard to read. User should download the logs in artifacts on Github actions or look at /tmp/debug-logs folder if running tests locally.

* Use constants to identify storage providers.

* Remove commented code.

* Don't parse time twice.

* Mark the network policies test as Serial since it uses host ports.

* Fix network policies e2e test. Set the correct container name.

* Fix circular dependencies.

* Fix networkpolicy for icarus and test them in the networkpolicy test.

* Fix deprecated io/ioutil package usage.

* Fix networkpolicy integration test.

* Upgrade icarus and re-enable trivy scanner for the image.

* Fix proxy registry URL.

* Replace string literals with constants.

* Apply suggestions from code review

Co-authored-by: Craig Ingram <[email protected]>

* Replace string literals with constants.

* Apply suggestions from code review

Co-authored-by: Craig Ingram <[email protected]>

* Apply suggestions from code review

Co-authored-by: Craig Ingram <[email protected]>

Co-authored-by: Craig Ingram <[email protected]>
  • Loading branch information
tomashibm and Craig Ingram authored Aug 29, 2022
1 parent 7fa827d commit 5da30c0
Show file tree
Hide file tree
Showing 77 changed files with 4,457 additions and 288 deletions.
75 changes: 70 additions & 5 deletions .github/workflows/pull_request.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ on:
env:
GO_VERSION: 1.18
HELM_VERSION: v3.9.2
ICARUS_VERSION: 2.0.4
PYTHON_VERSION: 3.7 # required for helm tester
IBM_CLOUD_API_KEY: ${{ secrets.IBM_CLOUD_API_KEY }}
IBM_CLOUD_REGION: us-south
Expand Down Expand Up @@ -59,7 +60,7 @@ jobs:
strategy:
fail-fast: true
matrix:
k8s: [1.20.2, 1.21.4, 1.22.1, 1.23.5, 1.24.2]
k8s: [1.20.2, 1.21.2, 1.22.1, 1.23.1, 1.24.2]
steps:
- name: Checkout
uses: actions/checkout@v3
Expand Down Expand Up @@ -281,6 +282,56 @@ jobs:
retention-days: 1


build-icarus:
runs-on: ubuntu-latest
needs: [run-unit-tests, run-integration-tests, validate-helm-charts]
steps:
- name: Checkout
uses: actions/checkout@v3

- name: Inject slug/short variables
uses: rlespinasse/github-slug-action@v4

- name: Modify GITHUB_REF_SLUG
run: echo "GITHUB_REF_SLUG=$GITHUB_REF_SLUG-${{ github.run_id }}" >> $GITHUB_ENV

- name: Setup Buildx
uses: docker/setup-buildx-action@v2

- name: Authenticate to Docker Proxy Registry
uses: docker/login-action@v2
with:
registry: ${{ secrets.DOCKER_PROXY_REGISTRY }}
username: ${{ secrets.ARTIFACTORY_USER }}
password: ${{ secrets.ARTIFACTORY_PASS }}

- name: Build icarus image
uses: docker/build-push-action@v3
with:
file: ./icarus/Dockerfile
context: ./icarus
build-args: |
ICARUS_VERSION: ${{ env.ICARUS_VERSION }}
DOCKER_PROXY_REGISTRY=${{ secrets.DOCKER_PROXY_REGISTRY }}/
tags: us.icr.io/${{ env.ICR_NAMESPACE }}/icarus:${{ env.GITHUB_REF_SLUG }}
outputs: type=docker,dest=icarus.tar

- name: Run Trivy vulnerability scanner
uses: aquasecurity/[email protected]
with:
input: "icarus.tar"
exit-code: "1"
ignore-unfixed: true
severity: ${{ env.TRIVY_SEVERITY }}

- name: Upload jolokia image artifact
uses: actions/upload-artifact@v3
with:
name: icarus
path: icarus.tar
retention-days: 1


validate-helm-charts:
runs-on: ubuntu-latest
steps:
Expand Down Expand Up @@ -337,7 +388,7 @@ jobs:
push-images-for-e2e:
if: "!contains(github.event.head_commit.message, 'e2e skip')"
needs: [build-operator, build-cassandra, build-prober, build-jolokia, validate-helm-charts]
needs: [build-operator, build-cassandra, build-prober, build-jolokia, build-icarus, validate-helm-charts]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
Expand Down Expand Up @@ -387,18 +438,25 @@ jobs:
with:
name: jolokia

- name: Download icarus image artifact
uses: actions/download-artifact@v3
with:
name: icarus

- name: Load container images
run: |
docker load -i cassandra-operator.tar
docker load -i cassandra.tar
docker load -i prober.tar
docker load -i jolokia.tar
docker load -i icarus.tar
- name: Push Images to ICR
run: |
docker push "us.icr.io/${{ env.ICR_NAMESPACE }}/cassandra-operator:$GITHUB_REF_SLUG"
docker push "us.icr.io/${{ env.ICR_NAMESPACE }}/prober:$GITHUB_REF_SLUG"
docker push "us.icr.io/${{ env.ICR_NAMESPACE }}/cassandra:$GITHUB_REF_SLUG"
docker push "us.icr.io/${{ env.ICR_NAMESPACE }}/jolokia:$GITHUB_REF_SLUG"
docker push "us.icr.io/${{ env.ICR_NAMESPACE }}/icarus:$GITHUB_REF_SLUG"
run-e2e-tests:
needs: [push-images-for-e2e]
Expand Down Expand Up @@ -459,6 +517,7 @@ jobs:
--set "proberImage=us.icr.io/${{ env.ICR_NAMESPACE }}/prober:$GITHUB_REF_SLUG" \
--set "cassandraImage=us.icr.io/${{ env.ICR_NAMESPACE }}/cassandra:$GITHUB_REF_SLUG" \
--set "jolokiaImage=us.icr.io/${{ env.ICR_NAMESPACE }}/jolokia:$GITHUB_REF_SLUG" \
--set "icarusImage=us.icr.io/${{ env.ICR_NAMESPACE }}/icarus:$GITHUB_REF_SLUG" \
--set "logFormat=console" \
--set "logLevel=debug" \
--set "container.imagePullSecret=$IMAGE_PULL_SECRET"
Expand Down Expand Up @@ -492,16 +551,22 @@ jobs:
run: helm uninstall cassandra-operator
- name: Remove CassandraCluster CRD
if: ${{ always() }}
run: kubectl delete -f cassandra-operator/crds/cassandracluster.yaml
run: kubectl delete -f cassandra-operator/crds/db.ibm.com_cassandraclusters.yaml
- name: Remove CassandraBackup CRD
if: ${{ always() }}
run: kubectl delete -f cassandra-operator/crds/db.ibm.com_cassandrabackups.yaml
- name: Remove CassandraRestore CRD
if: ${{ always() }}
run: kubectl delete -f cassandra-operator/crds/db.ibm.com_cassandrarestores.yaml
# We have below logic bc when multiple tags exist for the same image digest within a repository, the ibmcloud cr image-rm command removes the underlying image and all its tags. See details: https://cloud.ibm.com/docs/container-registry-cli-plugin?topic=container-registry-cli-plugin-containerregcli#bx_cr_image_rm
# We can also add a check if commit message contains `no_image_del` then skip the image deletion step
- name: Clenaup k8s namespace
- name: Cleanup k8s namespace
if: ${{ always() }}
run: kubectl delete namespace $IKS_NAMESPACE
- name: Cleanup Images
if: ${{ always() }}
run: |
for image_name in cassandra-operator prober cassandra jolokia; do
for image_name in cassandra-operator prober cassandra jolokia icarus; do
image_digest=$(ibmcloud cr image-list --restrict ${{ env.ICR_NAMESPACE }} --format "{{if and (eq .Repository \"us.icr.io/cassandra-operator/$image_name\") (eq .Tag \"$GITHUB_REF_SLUG\")}}{{.Digest}}{{end}}" --no-trunc)
image_tags=$(ibmcloud cr image-digests --restrict ${{ env.ICR_NAMESPACE }} --format "{{if and (eq .Digest \"$image_digest\")}}{{.Tags}}{{end}}" | sed -e 's/\[//g' -e 's/\]//g')
image_tags_arr=($image_tags)
Expand Down
71 changes: 70 additions & 1 deletion .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ env:
HELM_REPO_PASS: ${{ secrets.ARTIFACTORY_PASS }}
HELM_REPO: ${{ secrets.ARTIFACTORY_HELM_REPO }}
CASSANDRA_VERSION: 3.11.13
ICARUS_VERSION: 2.0.4
JMX_EXPORTER_VERSION: 0.17.0

jobs:
Expand Down Expand Up @@ -286,9 +287,76 @@ jobs:
labels: ${{ steps.meta.outputs.labels }}


icarus:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3

- name: Prepare image metadata
uses: docker/metadata-action@v4
id: meta
with:
images: |
us.icr.io/${{ env.ICR_NAMESPACE }}/icarus
uk.icr.io/${{ env.ICR_NAMESPACE }}/icarus
de.icr.io/${{ env.ICR_NAMESPACE }}/icarus
au.icr.io/${{ env.ICR_NAMESPACE }}/icarus
jp.icr.io/${{ env.ICR_NAMESPACE }}/icarus
tags: type=ref,event=tag

- name: Setup Buildx
uses: docker/setup-buildx-action@v2

- name: Login to IBM Cloud Container Registry US
uses: docker/login-action@v2
with:
registry: us.icr.io
username: ${{ env.ICR_USERNAME }}
password: ${{ env.ICR_PASSWORD }}

- name: Login to IBM Cloud Container Registry UK
uses: docker/login-action@v2
with:
registry: uk.icr.io
username: ${{ env.ICR_USERNAME }}
password: ${{ env.ICR_PASSWORD }}

- name: Login to IBM Cloud Container Registry DE
uses: docker/login-action@v2
with:
registry: de.icr.io
username: ${{ env.ICR_USERNAME }}
password: ${{ env.ICR_PASSWORD }}

- name: Login to IBM Cloud Container Registry AU
uses: docker/login-action@v2
with:
registry: au.icr.io
username: ${{ env.ICR_USERNAME }}
password: ${{ env.ICR_PASSWORD }}

- name: Login to IBM Cloud Container Registry JP
uses: docker/login-action@v2
with:
registry: jp.icr.io
username: ${{ env.ICR_USERNAME }}
password: ${{ env.ICR_PASSWORD }}

- name: Build and push icarus image
uses: docker/build-push-action@v3
with:
push: true
file: ./icarus/Dockerfile
context: ./icarus
build-args: |
ICARUS_VERSION: ${{ env.ICARUS_VERSION }}
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}


helm-release:
runs-on: ubuntu-latest
needs: [operator, cassandra, prober, jolokia]
needs: [operator, cassandra, prober, jolokia, icarus]
outputs:
tag: ${{ steps.get_release_tag.outputs.tag }}
steps:
Expand All @@ -311,6 +379,7 @@ jobs:
./bin/yq w -i cassandra-operator/values.yaml 'proberImage' $(./bin/yq r cassandra-operator/values.yaml 'proberImage' | sed "s/:.*/:${{ steps.get_release_tag.outputs.tag }}/")
./bin/yq w -i cassandra-operator/values.yaml 'cassandraImage' $(./bin/yq r cassandra-operator/values.yaml 'cassandraImage' | sed "s/:.*/:${{ env.CASSANDRA_VERSION }}-${{ steps.get_release_tag.outputs.tag }}/")
./bin/yq w -i cassandra-operator/values.yaml 'jolokiaImage' $(./bin/yq r cassandra-operator/values.yaml 'jolokiaImage' | sed "s/:.*/:${{ steps.get_release_tag.outputs.tag }}/")
./bin/yq w -i cassandra-operator/values.yaml 'icarusImage' $(./bin/yq r cassandra-operator/values.yaml 'icarusImage' | sed "s/:.*/:${{ steps.get_release_tag.outputs.tag }}/")
./bin/yq w -i cassandra-operator/Chart.yaml 'appVersion' $(./bin/yq r cassandra-operator/Chart.yaml 'appVersion' | sed "s/:.*/:${{ steps.get_release_tag.outputs.tag }}/")
./bin/yq w -i cassandra-operator/Chart.yaml 'version' $(./bin/yq r cassandra-operator/Chart.yaml 'version' | sed "s/:.*/:${{ steps.get_release_tag.outputs.tag }}/")
Expand Down
5 changes: 3 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ integration-tests:

# Run e2e tests
e2e-tests:
ginkgo -v --procs 11 --timeout=$(E2E_TIMEOUT) --always-emit-ginkgo-writer --progress --fail-fast ./tests/e2e/ -- \
ginkgo -v --procs 20 --timeout=$(E2E_TIMEOUT) --always-emit-ginkgo-writer --progress --fail-fast ./tests/e2e/ -- \
-test.v -test.timeout=$(E2E_TIMEOUT) \
-operatorNamespace=$(K8S_NAMESPACE) \
-imagePullSecret=$(IMAGE_PULL_SECRET) \
Expand Down Expand Up @@ -82,7 +82,7 @@ deploy: manifests kustomize
# Generate manifests e.g. CRD, RBAC etc.
manifests: controller-gen
$(CONTROLLER_GEN) $(CRD_OPTIONS) rbac:roleName=manager-role output:rbac:none paths="./..." output:crd:artifacts:config=config/crd/bases
kustomize build $(ROOT_DIR)config/crd > $(ROOT_DIR)cassandra-operator/crds/cassandracluster.yaml
$(CONTROLLER_GEN) $(CRD_OPTIONS) rbac:roleName=manager-role output:rbac:none paths="./..." output:crd:artifacts:config=$(ROOT_DIR)cassandra-operator/crds
$(CONTROLLER_GEN) $(CRD_OPTIONS) rbac:roleName=cassandra-operator paths="./..." output:crd:none output:rbac:stdout > $(ROOT_DIR)cassandra-operator/templates/clusterrole.yaml

# Run go fmt against code
Expand All @@ -100,6 +100,7 @@ generate: controller-gen
mockgen -package=mocks -source=./controllers/prober/prober.go -destination=./controllers/mocks/mock_prober.go
mockgen -package=mocks -source=./controllers/reaper/reaper.go -destination=./controllers/mocks/mock_reaper.go
mockgen -package=mocks -source=./controllers/nodectl/nodectl.go -destination=./controllers/mocks/mock_nodectl.go
mockgen -package=mocks -source=./controllers/icarus/icarus.go -destination=./controllers/mocks/mock_icarus.go

# Build the docker image
docker-build:
Expand Down
Loading

0 comments on commit 5da30c0

Please sign in to comment.