Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes: #236 - Running Zammad with replicas > 1 #243

Merged
merged 12 commits into from
Apr 26, 2024
Merged

Conversation

mgruner
Copy link
Collaborator

@mgruner mgruner commented Dec 18, 2023

What this PR does / why we need it

  • Splits up the chart from one StatefulSet into 4 Deployments and one Job.
    • The nginx and railsserver Deployments are freely scalable, the scheduler and websocket must remain at replicas: 1
    • The Job will be re-created on any chart update (via uuid in the name) and run the migrations. Deployments will fail until migrations are executed.
  • Add custom TMP dir handling to all deployment pods. Not just the railsserver needs the ability to create tmpfiles, but also the scheduler (e.g. image resizing for incoming emails processing). Use a consistent set of ENVs and the TMP volume for all Zammad deployment pods.
  • Refactor some code into helper templates to reduce redundancy / improve maintainability.

Depends on

Which issue this PR fixes

Special notes for your reviewer

Open questions / issues:

  • The persistentVolumeClaim will need ReadWriteMany access from now on. This may complicate deployments.
  • Also, upgrading will involve manual intervention in case the default PVC of the StatefulSet was used, because a new PVC will be created.
  • What about the sidecars? I don't see a place right now where to put them.

Checklist

  • Chart Version bumped
  • Upgrading instructions are documented in the zammad/README.md

Open Tasks

Preview Give feedback

@mgruner
Copy link
Collaborator Author

mgruner commented Dec 19, 2023

@klml @monotek this is not finished yet, but I would like to ask if you want to try this and provide first feedback. Works here with nginx and railsserver scaled to 3:

NAME                                                     READY   STATUS      RESTARTS   AGE
zammad-elasticsearch-master-0                            1/1     Running     0          3h33m
zammad-init-244dc656-5965-4a7b-b08f-97bf28b0fe33-hx545   0/4     Completed   0          2m14s
zammad-memcached-95c5bfd6b-lstns                         1/1     Running     0          3h33m
zammad-nginx-7f9ff54f86-56x5r                            1/1     Running     0          2m14s
zammad-nginx-7f9ff54f86-h4k6l                            1/1     Running     0          2m14s
zammad-nginx-7f9ff54f86-nn8jw                            1/1     Running     0          3h28m
zammad-postgresql-0                                      1/1     Running     0          3h33m
zammad-railsserver-58b64bb55d-267g2                      1/1     Running     0          2m14s
zammad-railsserver-58b64bb55d-9gzh9                      1/1     Running     0          8m14s
zammad-railsserver-58b64bb55d-bfndb                      1/1     Running     0          2m14s
zammad-redis-master-0                                    1/1     Running     0          3h33m
zammad-scheduler-7ddbcf4b4c-nffst                        1/1     Running     0          3h28m
zammad-websocket-86dcd556d7-5ng7s                        1/1     Running     0          3h28m

@klml
Copy link
Contributor

klml commented Dec 21, 2023

@mgruner Thank you very much. Openshift is working perfectly. We are still doing a few technical tests, but it looks very good.

HA works flawlessly. I just simulated a data center failure and we survived it without any measurable downtime ;)

Moving the inits to the k8s job is great, makes the rollout so much faster and also more stable, as I don't necessarily have to restart nginx and rails during a helm update.

zammad/templates/pvc.yaml Outdated Show resolved Hide resolved
zammad/values.yaml Outdated Show resolved Hide resolved
@mgruner mgruner self-assigned this Dec 22, 2023
@mgruner mgruner requested a review from monotek December 22, 2023 06:54
@mgruner mgruner marked this pull request as ready for review December 22, 2023 06:54
@mgruner mgruner requested a review from t-shehab December 22, 2023 08:15
@monotek
Copy link
Member

monotek commented Jan 8, 2024

sorry for the delay.
i did not use the computer for the last 4 weeks :D
i hope i can find some time next week...

@mgruner
Copy link
Collaborator Author

mgruner commented Jan 8, 2024

sorry for the delay. i did not use the computer for the last 4 weeks :D i hope i can find some time next week...

Thanks for the update @monotek. Looking forward to your feedback!

@monotek monotek self-assigned this Jan 8, 2024
@mgruner
Copy link
Collaborator Author

mgruner commented Jan 16, 2024

@monotek just added the custom TMP handling to all deployment pods, as it turned out that not only the railsserver needs to be able to create temporary files.

@klml
Copy link
Contributor

klml commented Jan 16, 2024

@mgruner from our side this branch works still very well. its running stable (and faster) on testing since before christmas 👍

zammad/README.md Outdated Show resolved Hide resolved
zammad/README.md Outdated Show resolved Hide resolved
zammad/README.md Outdated Show resolved Hide resolved
zammad/templates/service.yaml Show resolved Hide resolved
zammad/templates/statefulset.yaml Show resolved Hide resolved
zammad/values.yaml Outdated Show resolved Hide resolved
zammad/values.yaml Outdated Show resolved Hide resolved
@mgruner
Copy link
Collaborator Author

mgruner commented Jan 23, 2024

@monotek and I had a very productive call about this matter. We agreed that it is the right direction and will bring a major improvement for Zammad users on k8s.

However, there is an important consideration we need to make here. Zammad 6.2 currently requires /opt/zammad/var to be writable. This would mean we will require ReadWriteMany storage for all users with this change. That is not an good option on certain cloud providers / environments as @monotek already outlined.

Therefore we propose the following procedure and intermediate steps:

  • Put this PR on hold for the moment.
  • Add a new minor version to zammad-helm which adds support for S3 storage (based on the current StatefulSet). This will allow users to migrate their stored files from FS storage to S3, as a preparatory step for the laters witch to Deployments.
  • We will also implement a change in Zammad that modifies the way unprocessable mails are handled, so that the var/ folder will not be used any more. We will try to implement this for Zammad 6.3, but I can't make promises here.
  • Once the previous change in Zammad is implemented and released, we can rebase this PR, and eventually release it. Using ReadWriteMany storage would then be entirely optional, and require an existingVolumeClaim provided by the administrator. This also avoids the potential data loss issue on chart uninstallation.

@mgruner mgruner mentioned this pull request Jan 23, 2024
6 tasks
@klml
Copy link
Contributor

klml commented Jan 23, 2024

Despite the fact that the branch is running, we had the problem today that a manual rake zammad:searchindex:rebuild runs into the problem like #212 ;)

@klml
Copy link
Contributor

klml commented Jan 23, 2024

Using ReadWriteMany storage would then be entirely optional, and require an existingVolumeClaim provided by the administrator. This also avoids the potential data loss issue on chart uninstallation.

Perfekt for us. 👍

@mgruner
Copy link
Collaborator Author

mgruner commented Jan 24, 2024

Despite the fact that the branch is running, we had the problem today that a manual rake zammad:searchindex:rebuild runs into the problem like #212 ;)

I looked into this and found an issue with a missing entry point in the elasticsearch-init job container. This should be fixed by 409937d. Can you let me know if this helps, or otherwise send details about the error, please? It cannot be the same issue as in #212 because the StaticAssets handling is no longer present in Zammad 6.2.

@mgruner mgruner force-pushed the isse-236-scalability branch from 409937d to cffcd8b Compare January 29, 2024 13:35
@mgruner
Copy link
Collaborator Author

mgruner commented Feb 5, 2024

@klml @monotek the recent commit drops the creation of an internal PVC, and requires an externalClaim to be provided - but only IF File storage is used. Also, configuration for this moved from persistence to zammadConfig.storageVolume.

@klml this will not work any more correctly with Zammad 6.2 as it has no volume for var/ any more.

@klml
Copy link
Contributor

klml commented Feb 5, 2024

@mgruner thanks for leting me know. We had this branch only on dev-environment, and wait for 6.3

monotek
monotek previously approved these changes Feb 5, 2024
Copy link
Member

@monotek monotek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgruner
Nice, that the "var" dir issue is fixed 👍
From the code it looks good to me now :)

Had no time to test it by myself though.

@mgruner mgruner force-pushed the isse-236-scalability branch 2 times, most recently from 7489495 to c74c8d1 Compare February 14, 2024 10:42
@mgruner
Copy link
Collaborator Author

mgruner commented Apr 23, 2024

@klml can you please have a look at f3bfc0b? This should fix the issue. It makes the volume-permissions container's command configurable, so that you can replace it with something that works in OpenShift. I would suggest to use /opt/zammad/tmp/tmp now to have a writable directory in the tmpDir volume.

Please let me know if this works and if the updated description for OpenShift in the Readme is correct now.

monotek
monotek previously approved these changes Apr 23, 2024
Copy link
Member

@monotek monotek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After checking again, I need to ask: can you explain this please? We already use static container names. Only the pod names are dynamic because they are controlled by the deployments, and this is probably by design. Did I miss something here?

Sorry, i was confused because the "{{ .Chart.Name }}" var is used in the container name. Example:

- name: {{ .Chart.Name }}-nginx

The var is "zammad" all the time and should not change.
Therefore lets keep it as it is.

For the rest i'm also ok with keeping it for now and change later, if somebody complains :)

The "/opt/zammad/tmp/tmp" dir does look a bit weird. Can we use "/opt/zammad/var/tmp" again?

@mgruner
Copy link
Collaborator Author

mgruner commented Apr 23, 2024

The "/opt/zammad/tmp/tmp" dir does look a bit weird. Can we use "/opt/zammad/var/tmp" again?

I don't think so, as there is no var/ any more, and in a readonly FS we cannot create one. Besides that, using a subdirectory in a tmpDir should be much faster, as it is not a shared mount point.

Alternatively @klml you could try not modifying the tempdir, but instead using zammadConfig.tmpDirVolume.emptyDir.medium: memory as described in the comment. It would be faster and probably work permission wise without a custom subdirectory.

@monotek
Copy link
Member

monotek commented Apr 23, 2024

I would prefer the in memory workaround too, so we can just use "/opt/zammad/tmp".
If this is no option lets go with ugly "/opt/zammad/tmp/tmp".

@klml
Copy link
Contributor

klml commented Apr 23, 2024

Great zammadConfig.tmpDirVolume.emptyDir.medium: memory works! ;)

And I removed my mkdir -pv /opt/zammad/tmp/tmp && chmod -v +t /opt/zammad/tmp/tmp from customInit.

and I run with

    volumePermissions:
      enabled: false

I tested this on

  • existing instance (startversion 10.3.4)
  • fresh database

@klml
Copy link
Contributor

klml commented Apr 23, 2024

ingress/routes is missing

ingress/routes get removed on an existing instance and an fresh deployment

ingress:
  enabled: true
  className: "openshift-default"
  annotations:
    # generate openshift route
    route.openshift.io/termination: "edge"

and

ingress:
  hosts:
    - host: zammad-nginx-mpdz-zammad-dev.apps.k8s.example.de
      paths:
        - path: /
          pathType: ImplementationSpecific

@klml
Copy link
Contributor

klml commented Apr 23, 2024

@mgruner ingress/routes were missing, because the ingress still listened to the dynamic {{ $fullName }}, but the svc has changed to zammad-nginx

index f4b36e3..2b5223c
--- a/zammad/templates/ingress.yaml
+++ b/zammad/templates/ingress.yaml
@@ -49,7 +49,7 @@ spec:
             backend:
               {{- if semverCompare ">=1.19-0" $.Capabilities.KubeVersion.GitVersion }}
               service:
-                name: {{ $fullName }}
+                name: zammad-nginx

@mgruner mgruner force-pushed the isse-236-scalability branch from f3bfc0b to b88b569 Compare April 24, 2024 07:24
@mgruner
Copy link
Collaborator Author

mgruner commented Apr 24, 2024

@klml absolutely awesome, thank you. Can you please check 0682363? There I applied your Ingress patch and updated the OpenShift documentation in the Readme. Is this the correct and final state that we should merge now?

@klml
Copy link
Contributor

klml commented Apr 24, 2024

@mgruner works now oob ;) let me get the approval from my the functional department, then we can merge.

@klml
Copy link
Contributor

klml commented Apr 25, 2024

@mgruner looks good to me! thank you very much for this 🙏

@mgruner mgruner merged commit 0877f7d into main Apr 26, 2024
12 checks passed
@mgruner mgruner deleted the isse-236-scalability branch April 26, 2024 05:06
@klml
Copy link
Contributor

klml commented Apr 26, 2024

I guess we could also use static containernames. Makes stuff like logging easier and independent from the helm release name.

I agree and opend #265

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Running zammad with replicas > 1
3 participants