-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: site: add guide for setting up Pinniped with high availability, redundancy and best practices for scalability #2164
Comments
Hi @Dentrax, these are great questions and suggestions. Thanks for posting. It'll be quicker to try to share whatever info I have here rather than drafting a whole new document, so I'll start with that. At the moment, there are several settings to adjust. Both the Supervisor and the Concierge support running with multiple replicas in their deployments. Both have cpu and memory limits that can be adjusted. In general, both apps are very lean and efficient so they don't require a lot of resources, even at scale. One exception is documented in the section called "Performance implications of using OIDCClients in the Supervisor" in the document https://pinniped.dev/docs/howto/configure-auth-for-webapps. However, if you don't use that feature then things are quite efficient. We don't have specific recommendations at this time for adjusting the number of replicas or cpu/memory limits. Both apps should scale very well both out (when given more replicas) and up (when given more cpu/memory). Leader election happens automatically within each Deployment and is always enabled. Failover between the pods happens automatically by default because we use the regular Kubernetes health check mechanisms in each pod. We recommend that you keep an eye on the actual cpu and memory usage in your deployments and adjust accordingly. Make sure that the Kubernetes Service that is created for the Supervisor or the Concierge will route incoming https requests to all the available pods (e.g. round robin or similar) instead of always routing requests to the first pod, but hopefully Kubernetes will give you that behavior for free. A known limitation of the Supervisor is that it uses Kubernetes Secrets as a session storage mechanism. This is convenient because it does not require you to install any database with the Supervisor. Each end-user session causes several Secrets to be created after successful authentication. The Secrets are automatically deleted when the session expires. This works fine for simultaneous user sessions in the low thousands, but may not scale easily to very large numbers of simultaneous user sessions, depending on how you tune etcd on that cluster. We don't have documented support for running multiple Supervisor Deployments on separate clusters in a disaster recovery primary/secondary fail-over style configuration. However, this is theoretically possible by configuring both Deployments the same way with the same DNS name for the FederationDomain and the same TLS certs, and then synchronizing a select few of the auto-created Kubernetes Secrets from the first deployment to the second (e.g. the Supervisor's auto-generated signing keys). You would need to provide your own means of detecting that the primary deployment has gone offline and cutting over to the second deployment, e.g. at a load balancer. Once traffic cuts over, all active users would be prompted to log in again (unless they were using the techniques described in https://pinniped.dev/docs/howto/cicd) but would otherwise continue without knowing that they had switched to the backup deployment. Cutting back to the primary deployment would again cause users to log in again, assuming that you are not also synchronizing the end-user session Secrets back and forth between the primary and secondary. Since the cutover would only happen in the case of disaster (e.g. the whole Kubernetes cluster running the primary deployment goes offline) then the cost of causing end users to log in again should be acceptable in most cases.
I'm not sure if I understood this point. The Supervisor provides centralized login regardless of where the workload clusters reside, but maybe that's not what you were asking here. Please feel free to ask follow-up questions. |
Is your feature request related to a problem? Please describe.
I'd be great to have a guidance on configuring Pinniped for high availability at scale. There might be challenges when setting up features like leader election, multi-supervisor support, and multi-data center (multi-DC/Region/Zone) configurations without comprehensive documentation.
Providing this information will help the users confidently deploy in production environments with resiliency and reliability in mind.
Describe the solution you'd like
Create a new document that provides:
I've read the demo doc but I have some suspects if it covers the some open-questions or concerns above.
Describe alternatives you've considered
-
Are you considering submitting a PR for this feature?
-
Additional context
Include any insights into areas users frequently overlook or struggle with when deploying Pinniped at scale. Ensure documentation is concise and user-friendly.
Sorry if this seems a bit overwhelming due to the many questions. As a new user, I’d like to clarify things before actual usage. I’m happy to simplify if there’s too much context. Thank you!
/cc @developer-guy
The text was updated successfully, but these errors were encountered: