Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: site: add guide for setting up Pinniped with high availability, redundancy and best practices for scalability #2164

Open
Dentrax opened this issue Dec 24, 2024 · 1 comment

Comments

@Dentrax
Copy link

Dentrax commented Dec 24, 2024

Is your feature request related to a problem? Please describe.

I'd be great to have a guidance on configuring Pinniped for high availability at scale. There might be challenges when setting up features like leader election, multi-supervisor support, and multi-data center (multi-DC/Region/Zone) configurations without comprehensive documentation.

Providing this information will help the users confidently deploy in production environments with resiliency and reliability in mind.

Describe the solution you'd like

Create a new document that provides:

  • High Availability
    • Instructions for enabling leader election (if supported, if not - why? any ongoing plans?)
    • Recommended configurations for redundancy and fail-over
    • How many supervisors are recommended to run at scale (How many Pods / Clusters)
  • Scalability best practices
    • Fine-tune configurations to optimize performance for large-scale deployments (i.e, multi-supervisors with thousands of concierges)
    • Any known limitations or suggested tuning parameters?
  • Other recommendations
    • Highlight additional/important configurations or tools
    • Address common questions (in the issues) with a FAQ?
  • Extra: Multi-Cluster/DC/Region/Zone Setup:
    • Steps to configure Pinniped Supervisor for multiple clusters (for H/A) - (not per-cluster)
    • Guidance on certificate management for multi-dc usage?
    • Clarify if centralized login is possible across data centers (i.e., single login for eu-west and us-central)

I've read the demo doc but I have some suspects if it covers the some open-questions or concerns above.

Describe alternatives you've considered

-

Are you considering submitting a PR for this feature?

-

Additional context

Include any insights into areas users frequently overlook or struggle with when deploying Pinniped at scale. Ensure documentation is concise and user-friendly.

Sorry if this seems a bit overwhelming due to the many questions. As a new user, I’d like to clarify things before actual usage. I’m happy to simplify if there’s too much context. Thank you!

/cc @developer-guy

@cfryanr
Copy link
Member

cfryanr commented Jan 2, 2025

Hi @Dentrax, these are great questions and suggestions. Thanks for posting.

It'll be quicker to try to share whatever info I have here rather than drafting a whole new document, so I'll start with that.

At the moment, there are several settings to adjust. Both the Supervisor and the Concierge support running with multiple replicas in their deployments. Both have cpu and memory limits that can be adjusted. In general, both apps are very lean and efficient so they don't require a lot of resources, even at scale. One exception is documented in the section called "Performance implications of using OIDCClients in the Supervisor" in the document https://pinniped.dev/docs/howto/configure-auth-for-webapps. However, if you don't use that feature then things are quite efficient.

We don't have specific recommendations at this time for adjusting the number of replicas or cpu/memory limits. Both apps should scale very well both out (when given more replicas) and up (when given more cpu/memory). Leader election happens automatically within each Deployment and is always enabled. Failover between the pods happens automatically by default because we use the regular Kubernetes health check mechanisms in each pod. We recommend that you keep an eye on the actual cpu and memory usage in your deployments and adjust accordingly.

Make sure that the Kubernetes Service that is created for the Supervisor or the Concierge will route incoming https requests to all the available pods (e.g. round robin or similar) instead of always routing requests to the first pod, but hopefully Kubernetes will give you that behavior for free.

A known limitation of the Supervisor is that it uses Kubernetes Secrets as a session storage mechanism. This is convenient because it does not require you to install any database with the Supervisor. Each end-user session causes several Secrets to be created after successful authentication. The Secrets are automatically deleted when the session expires. This works fine for simultaneous user sessions in the low thousands, but may not scale easily to very large numbers of simultaneous user sessions, depending on how you tune etcd on that cluster.

We don't have documented support for running multiple Supervisor Deployments on separate clusters in a disaster recovery primary/secondary fail-over style configuration. However, this is theoretically possible by configuring both Deployments the same way with the same DNS name for the FederationDomain and the same TLS certs, and then synchronizing a select few of the auto-created Kubernetes Secrets from the first deployment to the second (e.g. the Supervisor's auto-generated signing keys). You would need to provide your own means of detecting that the primary deployment has gone offline and cutting over to the second deployment, e.g. at a load balancer. Once traffic cuts over, all active users would be prompted to log in again (unless they were using the techniques described in https://pinniped.dev/docs/howto/cicd) but would otherwise continue without knowing that they had switched to the backup deployment. Cutting back to the primary deployment would again cause users to log in again, assuming that you are not also synchronizing the end-user session Secrets back and forth between the primary and secondary. Since the cutover would only happen in the case of disaster (e.g. the whole Kubernetes cluster running the primary deployment goes offline) then the cost of causing end users to log in again should be acceptable in most cases.

Clarify if centralized login is possible across data centers (i.e., single login for eu-west and us-central)

I'm not sure if I understood this point. The Supervisor provides centralized login regardless of where the workload clusters reside, but maybe that's not what you were asking here.

Please feel free to ask follow-up questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants