Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of service accounts for BuildRuns #662

Closed
qu1queee opened this issue Mar 12, 2021 · 25 comments
Closed

Handling of service accounts for BuildRuns #662

qu1queee opened this issue Mar 12, 2021 · 25 comments
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.

Comments

@qu1queee
Copy link
Contributor

qu1queee commented Mar 12, 2021

Idea:

When dealing with BuildRuns we have different scenarios on the interactions with service accounts. I´m summarizing the two most prominent ones, where we have room for improvement.

[1] We rely on default service accounts in the cluster

When a user does not define an specific service account to use, neither specifies that the autogenerated feature, we will use the service account default in vanilla k8s, and pipeline(AFAIK) for openshift clusters.

  • Do we want to ensure a BuildRun is marked as Failed if neither the default/pipeline service accounts are found?
  • Do we assume this service accounts will always be there in the namespace?

[2] We autogenerate a service account for users when the generate field is used.

When failing a BuildRun before a TaskRun gets created, if we manage to create the service account, we will keep the service account together with a Failed Buildrun.

Some questions on the above:

  • Do we want to cleanup service accounts as soon as a BuildRun is marked as Failed?. Note: Keep in mind that we use ownerreferences, so as soon as a BuildRun is deleted, the service account will also be deleted.
  • Do we want to keep autogenerated service accounts after a BuildRun is failed?. Might be useful for debugging for users.
@qu1queee qu1queee added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Mar 12, 2021
@imjasonh
Copy link
Contributor

imjasonh commented Mar 12, 2021

Do we want to cleanup service accounts as soon as a BuildRun is marked as Failed?. Note: Keep in mind that we use ownerreferences, so as soon as a BuildRun is deleted, the service account will also be deleted.

I think so, yes. Keeping SAs around until the BuildRun could lead to scaling issues, and if those SAs still have sensitive creds attached, they could be reused to perform malicious actions.

Do we want to keep autogenerated service accounts after a BuildRun is failed?. Might be useful for debugging for users.

I think not, for reasons listed above. SAs are sensitive resources, and shouldn't exist if we don't expect to use them again. If the user wants to have their SA stick around, they can create an SA and reuse it, and more thoroughly audit its use.

TBH I'm still a little fuzzy on the use case for single-use ephemeral SAs at all, and I think it adds quite a bit of conceptual load to users/operators to support it. I'd like to understand what problem they solve today, and if there's a better way (such as attaching secrets to BuildRuns and TaskRuns directly?) that would let us get out of the business of creating/destroying SAs.

At the very least, some docs about when and why you might want to use generate:true would be useful.

@imjasonh
Copy link
Contributor

Did some spelunking and found #105 which added .serviceAccount.generate and #142 which documented it, but even with those I don't fully understand the motivating use case.

@zhangtbj is there another doc or issue I'm missing?

@zhangtbj
Copy link
Contributor

First, about the two questions:

Do we want to ensure a BuildRun is marked as Failed if neither the default/pipeline service accounts are found? Do we assume this service accounts will always be there in the namespace?

I think yes, because generate field is not used, it means the end user WANTS to point and use a specific SA, if it cannot be found or not in the namespace, it should be fail.

Did some spelunking and found #105 which added .serviceAccount.generate and #142 which documented it, but even with those I don't fully understand the motivating use case.

Hi @imjasonh , First of all, at the beginning, I remember we did some investigation, and found we cannot use the Secret for the Tekton TaskRun directly, we can ONLY use ServiceAccount (https://github.com/tektoncd/pipeline/blob/main/pkg/apis/pipeline/v1beta1/taskrun_types.go#L50), so we have to append all secrets into a SA then pass the SA to the generated TaskRun in the Shipwright: https://github.com/shipwright-io/build/blob/master/pkg/reconciler/buildrun/resources/taskrun.go#L216

If we HAVE TO use the SA, then we have some problems:

  • If we use the shared default or pipeline SA, it is very bad in the multiple tenant env, because all secrets are stored in one SA, then someone can use this default SA to pull/push from other end user registry.
  • It is hard to ask end user to create the SA manually before the build

So we provide a new .serviceAccount.generate to help end user to generate a new SA and append all secrets to it then pass it to the generated taskrun.
But in the work flow before, there is no any SA deletion logic, the end user can delete build or buildrun, but the SA is still kept there, so that is why we delete the SA after the buildrun is done.

@qu1queee
Copy link
Contributor Author

The only open question I have here is around an enhancement in the API via https://github.com/shipwright-io/build/pull/608/files (which didnt happen). We wanted to enhance the BuildRun Status field, with a .Status.ServiceAccount. So if we delete an autogenerated service account on failure, ideally that .Status.ServiceAccount might display a resource that does not longer exists. So I´m wondering on whats the policy for populating such .Status.ServiceAccountor if this have an effect on deleting the service account on failure. I do understand the rational on the security aspect.

@imjasonh
Copy link
Contributor

I think generating SAs on-demand and deleting them is an anti-pattern in general, and we should think about how we can stage the deprecation of this feature.

People configuring builds should be aware that there's a SA with Secrets that will run that Build, and they should know who that is and be able to configure RBAC correctly for that SA. By having ephemeral, temporary SAs, we might make the API a bit easier to use, but we make it a lot harder to secure and audit. I don't think the UX improvement is worth it TBH.

Enabling ephemeral SAs requires the Shipwright controller to have the power to create and destroy any SA in any namespace. This is pretty much god-level power, which Shipwright doesn't need to be able to run most builds. Operators can't opt out of ephemeral SA support today, except to remove that permission and see what breaks.

@sbose78
Copy link
Member

sbose78 commented Mar 17, 2021

pipeline(AFAIK) for openshift clusters.

I would actually remove the legacy ;) logic of looking for pipeline sa.

@qu1queee
Copy link
Contributor Author

@sbose78 oh, is good that you say that, did something changed in OpenShift clusters?

@adambkaplan
Copy link
Member

I believe the pipeline ServiceAccount is something that OpenShift's original Tekton distribution added to namespaces. @sbose78 @imjasonh is that something the upstream Tekton operator or pipeline controllers do today?

OpenShift builds has a similar challenge, and we addressed that by adding a dedicated service account - builder - which is added to all namespaces and is the default for builds. That account is the default SA for build pods, and gets the extra RBAC needed to execute a build.

@imjasonh
Copy link
Contributor

I believe the pipeline ServiceAccount is something that OpenShift's original Tekton distribution added to namespaces. @sbose78 @imjasonh is that something the upstream Tekton operator or pipeline controllers do today?

Nope, upstream Tekton doesn't create any SAs except the one it uses to run its own controller/webhook.

@gabemontero
Copy link
Member

by "OpenShift's original Tekton distriubtion" @adambkaplan was referring to the "downstream" version of Tekton that Vincent's team provides, i.e. "OpenShift Pipelines" @imjasonh

The OLM Operator for that "downstream" offering certainly does add the pipeline SA to each namespace when it is installed/added to an OpenShift cluster.

But I suspect what @sbose78 is getting at is that shipwright should be "vendor neutral" and not depend on something that RH's downstream version of Tekton provides. At most, a future RH "downstream" version of Shipwright might leverage the pipeline SA.

@gabemontero
Copy link
Member

And furthermore, as @adambkaplan alluded to, I am pretty sure one can say that the pipeline SA was modeled off the pattern that the old builder SA that OpenShift Builds established.

@qu1queee
Copy link
Contributor Author

qu1queee commented Mar 18, 2021

ok. Let me try to consolidate what is happening on this issue, so we might close it soon.

[1] We agree we do not longer need to maintain a pipeline service account and that we can safely remove this in our code. I can open an issue for this, so that [1] is fully reflected there.

[2] I think @imjasonh is challenging the whole service account auto-generation mechanism. I investigated this during this week. What we currently use is what Tekton calls creds-init or build-in authentication, where via a reference to a service account, tekton will mount the secrets at the pod level in the right path. The main concerns here are:

  • security
  • resources usage

It seems we can disable the Tekton build-in auth as seen in the docs in favor of we ensuring that the secrets are mounted at the pod level, as seen in the following examples: Via env vars, Via volumes, Via volumes + params.

If [2] sounds reasonable. We should move that spike work in a different issue. Then we can safely close this ongoing issue. Opinions?

@imjasonh
Copy link
Contributor

(2) sounds reasonable, and thanks for writing this up.

Please let me know if you have any questions while investigating, I'd be happy to help.

@gabemontero
Copy link
Member

+1 from me as well thanks

@SaschaSchwarze0
Copy link
Member

In the context of getting rid of the service account (which would in general be a nice idea), I just clicked through the Tekton samples from Jason. Based on them, I can understand how we would mount a single container registry secret to a container. Two scenarios I do not see covered:

  • If there are multiple secrets needed, for example to pull the builder image of the s2i build strategy and another one to push the output image; then if there is no creds-init logic from Tekton anymore, which code merges the two secrets into one single dockerconfigjson file?
  • We can only mount secrets to those containers that we define in the build strategy or create in the taskrun generation code. But how would one then assign the SSH secret to clone the git sources in the container that Tekton adds?

How are they resolved?

@imjasonh
Copy link
Contributor

If there are multiple secrets needed, for example to pull the builder image of the s2i build strategy and another one to push the output image; then if there is no creds-init logic from Tekton anymore, which code merges the two secrets into one single dockerconfigjson file?

Good question. 🤔

If there are multiple dockerconfig secrets available, we could merge them ourselves, perhaps using jq, in a step we prepend to the TaskRun.

We can only mount secrets to those containers that we define in the build strategy or create in the taskrun generation code. But how would one then assign the SSH secret to clone the git sources in the container that Tekton adds?

Another good one, you're on a roll. 😄

Instead of using the git PipelineResource, we could prepend our own step that calls git clone with the secret mounted.

Basically, creds-init and PipelineResources are both features of Tekton that are unlikely to exist past v1beta1, while steps and Volumes are locked in forever, so we can save ourselves a migration later by only relying on those more basic reliable features.

@SaschaSchwarze0
Copy link
Member

SaschaSchwarze0 commented Mar 18, 2021

If there are multiple secrets needed, for example to pull the builder image of the s2i build strategy and another one to push the output image; then if there is no creds-init logic from Tekton anymore, which code merges the two secrets into one single dockerconfigjson file?

Good question. 🤔

If there are multiple dockerconfig secrets available, we could merge them ourselves, perhaps using jq, in a step we prepend to the TaskRun.

We can only mount secrets to those containers that we define in the build strategy or create in the taskrun generation code. But how would one then assign the SSH secret to clone the git sources in the container that Tekton adds?

Another good one, you're on a roll. 😄

Instead of using the git PipelineResource, we could prepend our own step that calls git clone with the secret mounted.

Basically, creds-init and PipelineResources are both features of Tekton that are unlikely to exist past v1beta1, while steps and Volumes are locked in forever, so we can save ourselves a migration later by only relying on those more basic reliable features.

Right, that would all work. But then we would remove the usage of a couple of nice Tekton features (credential merging, taking care of running git clone, taking care of the credentials for the git clone) and re-implement them on our own. We're getting closer to kick out Tekton completely then. ;-)

I think a few weeks ago we had a chat and I had not yet time to follow up on this. But the idea was basically to keep all this in place but to get rid of the service account by enhancing the TaskRunSpec to allow to specify the Secrets there directly instead of indirectly through a service account. I personally still prefer this approach.

@imjasonh
Copy link
Contributor

Right, that would all work. But then we would would remove the usage of a couple of Tekton features (credential merging, taking care of running git clone, taking care of the credentials for the git clone) and re-implement them on our own.

My point is most of those features are not going to make it to Tekton v1 anyway, so we shouldn't rely on them long-term. If being able to take advantage of these features is one of the main reasons we generate SAs, which brings in a load of negative effects, then we should get out of that business.

We're getting closer to kick out Tekton completely then. ;-)

That would be fine too! I do think that Tekton still provides a useful TaskRun interface, that are much easier to use than Pods, but perhaps I'm biased. If over time Shipwright decides Tekton isn't providing value enough to surpass its costs, that's a reasonable move.

I think a few weeks ago we had a chat and I had not yet time to follow up on this. But the idea was basically to keep all this in place but to get rid of the service account by enhancing the TaskRunSpec to allow to specify the Secrets there directly instead of indirectly through a service account. I personally still prefer this approach.

If the TaskRun specifies Secret-backed Volumes, it must also specify a ServiceAccount that has permission to read those Secrets (or default must have that permission), because the Pod it creates will need that, otherwise you'd leak those Secrets. This example works because the default SA has permission to read that Secret.

@SaschaSchwarze0
Copy link
Member

My point is most of those features are not going to make it to Tekton v1 anyway, so we shouldn't rely on them long-term. If being able to take advantage of these features is one of the main reasons we generate SAs, which brings in a load of negative effects, then we should get out of that business.

The statement of removing Tekton was ironic. It still brings the logic to run non-init containers sequentially, for example.

@imjasonh
Copy link
Contributor

The statement of removing Tekton was ironic. It still brings the logic to run non-init containers sequentially, for example.

Phew, okay good. 😅 I don't want to reimplement those parts of Tekton ever again ...unless it's to push them up into k8s itself. 🤔

@qu1queee
Copy link
Contributor Author

I think the details on the implications of getting rid of the sa´s should be addressed in #679 . I´m gonna close this issue soonish.

@SaschaSchwarze0
Copy link
Member

I think the details on the implications of getting rid of the sa´s should be addressed in #679 . I´m gonna close this issue soonish.

Yeah, I think if Tekton is gonna remove the functionality we use today, then we have no choice than re-implementing some of the things and while doing that getting rid of the service account. I guess we will need to do a staged approach to cover all aspects as discussed above.

@imjasonh
Copy link
Contributor

@qu1queee I'm not sure why this was closed, I think there are a few somewhat large issues still unresolved, that might be able to be split into separate dependent issues:

  1. Don't rely on Tekton's creds-init, add our own code to mount creds.
  2. Don't rely on the git PipelineResource, add our own step to clone source, where we have more control.
  3. Remove support for serviceAccount.generate and replace the entire serviceAccount field with serviceAccountName (for consistency with Tekton, K8s, Knative APIs)

@qu1queee
Copy link
Contributor Author

Closed because of the comment in #662 (comment) , where I understood we where ok to move the particular discussion on what means to stop relying on sa´s via a separate issue.

We can split, yes.

@qu1queee
Copy link
Contributor Author

qu1queee commented Mar 19, 2021

@imjasonh so there is valuable information here, but my assumption was that we should move it all to #679 . Whats your suggestion? As long as we have a proper documentation on what we think we need to address, the rest is fine for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.
Projects
None yet
Development

No branches or pull requests

7 participants