Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate a temp certificate for OCP4 Trusted CA remediation #12226

Merged
merged 1 commit into from
Jul 31, 2024

Conversation

rhmdnd
Copy link
Collaborator

@rhmdnd rhmdnd commented Jul 26, 2024

Lately, we've been experiencing issues with manual remediations timing
out during functional testing. This manifests in the following error:

=== RUN   TestE2e/Apply_manual_remediations
 <snip>
 helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh'
 helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh'
 helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out

In this particular case, it looks like the remediation to add an
Identity Provider to the cluster failed, but this is actually an
unintended side-effect of another change that updated the
idp_is_configured remediation to use a more robust technique for
determining if the cluster applied the remediation successfully:

#12120
#12184

Because we updated the remediation to use oc adm wait-for-stable-cluster, we're effectively checking all cluster
operators to ensure they're healthy.

This started causing timeouts because a separate, unrelated remediation
was also getting applied in our testing that updated the default CA, but
didn't include a ConfigMap that contained the CA bundle. As a result,
one of the operators didn't come up because it was looking for a
ConfigMap that didn't exist. The oc adm wait-for-stable-cluster
command was hanging on a legitimate issue in a separate remediation.

This commit attempts to fix that issue by updating the trusted CA
remediation by generating a certificate for testing purposes, then
creates a ConfigMap called trusted-ca-bundle, before updating the
trusted CA.

@rhmdnd rhmdnd requested review from Vincent056 and yuumasato July 26, 2024 16:45
@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Jul 26, 2024

/test

Copy link

openshift-ci bot commented Jul 26, 2024

@rhmdnd: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

  • /test 4.12-e2e-aws-ocp4-cis
  • /test 4.12-e2e-aws-ocp4-cis-node
  • /test 4.12-e2e-aws-ocp4-e8
  • /test 4.12-e2e-aws-ocp4-high
  • /test 4.12-e2e-aws-ocp4-high-node
  • /test 4.12-e2e-aws-ocp4-moderate
  • /test 4.12-e2e-aws-ocp4-moderate-node
  • /test 4.12-e2e-aws-ocp4-pci-dss
  • /test 4.12-e2e-aws-ocp4-pci-dss-node
  • /test 4.12-e2e-aws-ocp4-stig
  • /test 4.12-e2e-aws-ocp4-stig-node
  • /test 4.12-e2e-aws-rhcos4-e8
  • /test 4.12-e2e-aws-rhcos4-high
  • /test 4.12-e2e-aws-rhcos4-moderate
  • /test 4.12-e2e-aws-rhcos4-stig
  • /test 4.12-images
  • /test 4.13-e2e-aws-ocp4-bsi
  • /test 4.13-e2e-aws-ocp4-bsi-node
  • /test 4.13-e2e-aws-ocp4-cis
  • /test 4.13-e2e-aws-ocp4-cis-node
  • /test 4.13-e2e-aws-ocp4-e8
  • /test 4.13-e2e-aws-ocp4-high
  • /test 4.13-e2e-aws-ocp4-high-node
  • /test 4.13-e2e-aws-ocp4-moderate
  • /test 4.13-e2e-aws-ocp4-moderate-node
  • /test 4.13-e2e-aws-ocp4-pci-dss
  • /test 4.13-e2e-aws-ocp4-pci-dss-node
  • /test 4.13-e2e-aws-ocp4-stig
  • /test 4.13-e2e-aws-ocp4-stig-node
  • /test 4.13-e2e-aws-rhcos4-bsi
  • /test 4.13-e2e-aws-rhcos4-e8
  • /test 4.13-e2e-aws-rhcos4-high
  • /test 4.13-e2e-aws-rhcos4-moderate
  • /test 4.13-e2e-aws-rhcos4-stig
  • /test 4.13-images
  • /test 4.14-e2e-aws-ocp4-bsi
  • /test 4.14-e2e-aws-ocp4-bsi-node
  • /test 4.14-e2e-aws-rhcos4-bsi
  • /test 4.14-images
  • /test 4.15-e2e-aws-ocp4-bsi
  • /test 4.15-e2e-aws-ocp4-bsi-node
  • /test 4.15-e2e-aws-ocp4-cis
  • /test 4.15-e2e-aws-ocp4-cis-node
  • /test 4.15-e2e-aws-ocp4-e8
  • /test 4.15-e2e-aws-ocp4-high
  • /test 4.15-e2e-aws-ocp4-high-node
  • /test 4.15-e2e-aws-ocp4-moderate
  • /test 4.15-e2e-aws-ocp4-moderate-node
  • /test 4.15-e2e-aws-ocp4-pci-dss
  • /test 4.15-e2e-aws-ocp4-pci-dss-node
  • /test 4.15-e2e-aws-ocp4-stig
  • /test 4.15-e2e-aws-ocp4-stig-node
  • /test 4.15-e2e-aws-rhcos4-bsi
  • /test 4.15-e2e-aws-rhcos4-e8
  • /test 4.15-e2e-aws-rhcos4-high
  • /test 4.15-e2e-aws-rhcos4-moderate
  • /test 4.15-e2e-aws-rhcos4-stig
  • /test 4.15-e2e-rosa-ocp4-cis-node
  • /test 4.15-e2e-rosa-ocp4-pci-dss-node
  • /test 4.15-images
  • /test 4.16-e2e-aws-ocp4-bsi
  • /test 4.16-e2e-aws-ocp4-bsi-node
  • /test 4.16-e2e-aws-ocp4-cis
  • /test 4.16-e2e-aws-ocp4-cis-node
  • /test 4.16-e2e-aws-ocp4-e8
  • /test 4.16-e2e-aws-ocp4-high
  • /test 4.16-e2e-aws-ocp4-high-node
  • /test 4.16-e2e-aws-ocp4-moderate
  • /test 4.16-e2e-aws-ocp4-moderate-node
  • /test 4.16-e2e-aws-ocp4-pci-dss
  • /test 4.16-e2e-aws-ocp4-pci-dss-node
  • /test 4.16-e2e-aws-ocp4-stig
  • /test 4.16-e2e-aws-ocp4-stig-node
  • /test 4.16-e2e-aws-rhcos4-bsi
  • /test 4.16-e2e-aws-rhcos4-e8
  • /test 4.16-e2e-aws-rhcos4-high
  • /test 4.16-e2e-aws-rhcos4-moderate
  • /test 4.16-e2e-aws-rhcos4-stig
  • /test 4.16-images
  • /test e2e-aws-ocp4-bsi
  • /test e2e-aws-ocp4-bsi-node
  • /test e2e-aws-ocp4-cis
  • /test e2e-aws-ocp4-cis-node
  • /test e2e-aws-ocp4-e8
  • /test e2e-aws-ocp4-high
  • /test e2e-aws-ocp4-high-node
  • /test e2e-aws-ocp4-moderate
  • /test e2e-aws-ocp4-moderate-node
  • /test e2e-aws-ocp4-pci-dss
  • /test e2e-aws-ocp4-pci-dss-node
  • /test e2e-aws-ocp4-stig
  • /test e2e-aws-ocp4-stig-node
  • /test e2e-aws-rhcos4-bsi
  • /test e2e-aws-rhcos4-e8
  • /test e2e-aws-rhcos4-high
  • /test e2e-aws-rhcos4-moderate
  • /test e2e-aws-rhcos4-stig
  • /test images

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-ComplianceAsCode-content-master-4.12-images
  • pull-ci-ComplianceAsCode-content-master-4.13-images
  • pull-ci-ComplianceAsCode-content-master-4.14-images
  • pull-ci-ComplianceAsCode-content-master-4.15-images
  • pull-ci-ComplianceAsCode-content-master-4.16-images
  • pull-ci-ComplianceAsCode-content-master-images

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

Start a new ephemeral environment with changes proposed in this pull request:

Fedora Environment
Open in Gitpod

Oracle Linux 8 Environment
Open in Gitpod

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Jul 26, 2024

/test 4.12-e2e-aws-ocp4-moderate
/test 4.13-e2e-aws-ocp4-moderate
/test e2e-aws-ocp4-moderate
/test 4.15-e2e-aws-ocp4-moderate
/test 4.16-e2e-aws-ocp4-moderate

Copy link

github-actions bot commented Jul 26, 2024

🤖 A k8s content image for this PR is available at:
ghcr.io/complianceascode/k8scontent:12226
This image was built from commit: 6696642

Click here to see how to deploy it

If you alread have Compliance Operator deployed:
utils/build_ds_container.py -i ghcr.io/complianceascode/k8scontent:12226

Otherwise deploy the content and operator together by checking out ComplianceAsCode/compliance-operator and:
CONTENT_IMAGE=ghcr.io/complianceascode/k8scontent:12226 make deploy-local

@rhmdnd rhmdnd added this to the 0.1.74 milestone Jul 26, 2024
@rhmdnd rhmdnd force-pushed the fix-ca-manual-remediation branch from e7a5e89 to d9290cf Compare July 26, 2024 21:50
@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Jul 26, 2024

/test 4.12-e2e-aws-ocp4-moderate
/test 4.13-e2e-aws-ocp4-moderate
/test e2e-aws-ocp4-moderate
/test 4.15-e2e-aws-ocp4-moderate
/test 4.16-e2e-aws-ocp4-moderate

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Jul 26, 2024

/test 4.12-e2e-aws-ocp4-moderate
/test 4.13-e2e-aws-ocp4-moderate
/test e2e-aws-ocp4-moderate
/test 4.15-e2e-aws-ocp4-moderate
/test 4.16-e2e-aws-ocp4-moderate
/test 4.12-images
/test 4.13-images
/test 4.14-images
/test 4.15-images
/test 4.16-images

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Jul 26, 2024

/test images

@jan-cerny jan-cerny added the OpenShift OpenShift product related. label Jul 29, 2024
@yuumasato
Copy link
Member

/test e2e-aws-ocp4-pci-dss
/test e2e-aws-ocp4-pci-dss-node

@yuumasato
Copy link
Member

I see that the profiles that don't include default-ingress-ca-replaced are not failing, even without this PR.
I think the changes make sense, but the timeout is still happening, 🫠

@yuumasato
Copy link
Member

/test 4.12-e2e-aws-ocp4-pci-dss
/test 4.14-e2e-aws-ocp4-pci-dss

@Mab879 Mab879 modified the milestones: 0.1.74, 0.1.75 Jul 29, 2024
@yuumasato
Copy link
Member

/test 4.15-e2e-aws-ocp4-pci-dss
/test 4.16-e2e-aws-ocp4-pci-dss

@yuumasato
Copy link
Member

yuumasato commented Jul 30, 2024

@rhmdnd The ocp4-pci-dss doesn't select rule default-ingress-ca-replaced, yet, this week's 4.12 ocp4-pci-dss profile didn't time out while the CI run on this PR timed out.
To add more weird flakyness, the 4.16 ocp4-pci-dss didn't time out while the 4.15 ocp4-pci-dss timed out.

I'm running 4.15 and 4.16 ocp4-pci-dss test runs on this PR to gather more data.

@yuumasato
Copy link
Member

To add more weird flakyness, the 4.16 ocp4-pci-dss didn't time out while the 4.15 ocp4-pci-dss timed out.

And more quirkiness:
4.16 ocp4-pci-dss on this PR timed out, while 4.15 ocp4-pci-dss didn't time out.

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Jul 30, 2024

ComplianceAsCode/ocp4e2e#48 landed so let's rekick some of the jobs and see if we can get some more information from the CI clusters.

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Jul 30, 2024

/test 4.12-e2e-aws-ocp4-moderate
/test 4.13-e2e-aws-ocp4-moderate
/test e2e-aws-ocp4-moderate
/test 4.15-e2e-aws-ocp4-moderate
/test 4.16-e2e-aws-ocp4-moderate

@rhmdnd rhmdnd force-pushed the fix-ca-manual-remediation branch from d9290cf to 3b73daa Compare July 30, 2024 22:11
@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Jul 30, 2024

/test 4.12-e2e-aws-ocp4-moderate
/test 4.13-e2e-aws-ocp4-moderate
/test e2e-aws-ocp4-moderate
/test 4.15-e2e-aws-ocp4-moderate
/test 4.16-e2e-aws-ocp4-moderate

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Jul 30, 2024

I had to rewrite a good portion of the remediation so that it 1.) pointed to an actual config map and 2.) reused an existing certificate so the change propagated through the various components.

I was able to get this working on a local cluster, so hopefully CI works, too.

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Jul 31, 2024

Still seeing the timeout issue. Checking to see if it's due to a lower context timeout in ComplianceAsCode/ocp4e2e#50.

@yuumasato
Copy link
Member

/test 4.15-e2e-aws-ocp4-moderate
/test 4.16-e2e-aws-ocp4-moderate
/test 4.15-e2e-aws-ocp4-pci-dss
/test 4.16-e2e-aws-ocp4-pci-dss

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Jul 31, 2024

The 4.13 and 4.12 tests failed because the networking operator was in a degraded state, timing out after an hour.

The 4.16 test failed because the timeout was reached even though the cluster operators eventually stabilized.

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Jul 31, 2024

I was able to dig this out of the networking operator logs:

I0730 23:51:41.008637       1 log.go:198] Operconfig Controller complete
I0730 23:51:58.381293       1 log.go:198] Reconciling additional trust bundle configmap 'openshift-config/trusted-ca-bundle'
I0730 23:51:58.438717       1 log.go:198] httpProxy, httpsProxy and noProxy not defined for proxy 'trusted-ca-bundle'; validation will be skipped
I0730 23:51:58.478438       1 log.go:198] Reconciling additional trust bundle configmap 'openshift-config/trusted-ca-bundle' complete
I0730 23:51:58.527582       1 log.go:198] Failed to sync additional trust bundle configmap openshift-config-managed/trusted-ca-bundle: failed to update trusted CA bundle configmap 'openshift-config-managed/trusted-ca-bundle': ConfigMap "trusted-ca-bundle" is invalid: []: Too long: must have at most 1048576 bytes
I0730 23:52:20.478712       1 log.go:198] Reconciling Network.operator.openshift.io cluster

The actual certificate is huge, which might be blowing out the size limits of the config map depending on how the syncing is handled. Locally, I needed to use oc create instead of oc apply for that reason.

Lately, we've been experiencing issues with manual remediations timing
out during functional testing. This manifests in the following error:

   === RUN   TestE2e/Apply_manual_remediations
    <snip>
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh'
    helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh'
    helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out

In this particular case, it looks like the remediation to add an
Identity Provider to the cluster failed, but this is actually an
unintended side-effect of another change that updated the
idp_is_configured remediation to use a more robust technique for
determining if the cluster applied the remediation successfully:

  ComplianceAsCode#12120
  ComplianceAsCode#12184

Because we updated the remediation to use `oc adm
wait-for-stable-cluster`, we're effectively checking all cluster
operators to ensure they're healthy.

This started causing timeouts because a separate, unrelated remediation
was also getting applied in our testing that updated the default CA, but
didn't include a ConfigMap that contained the CA bundle. As a result,
one of the operators didn't come up because it was looking for a
ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster`
command was hanging on a legitimate issue in a separate remediation.

This commit attempts to fix that issue by updating the trusted CA
remediation by creating a configmap for the expected certificate bundle.
@rhmdnd rhmdnd force-pushed the fix-ca-manual-remediation branch from 3b73daa to 6696642 Compare July 31, 2024 13:38
ca-bundle.crt: $BUNDLE
metadata:
name: trusted-ca-bundle
namespace: openshift-config-managed
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Experimenting with this to see if it works around the following issue:

I0730 23:51:58.527582       1 log.go:198] Failed to sync additional trust bundle configmap openshift-config-managed/trusted-ca-bundle: failed to update trusted CA bundle configmap 'openshift-config-managed/trusted-ca-bundle': ConfigMap "trusted-ca-bundle" is invalid: []: Too long: must have at most 1048576 bytes

By creating it manually instead of relying on the sync logic in the networking operator.

Copy link

openshift-ci bot commented Jul 31, 2024

@rhmdnd: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ocp4-pci-dss d9290cf link true /test e2e-aws-ocp4-pci-dss
ci/prow/e2e-aws-ocp4-pci-dss-node d9290cf link true /test e2e-aws-ocp4-pci-dss-node
ci/prow/4.12-e2e-aws-ocp4-pci-dss d9290cf link true /test 4.12-e2e-aws-ocp4-pci-dss
ci/prow/e2e-aws-ocp4-moderate 3b73daa link true /test e2e-aws-ocp4-moderate
ci/prow/4.13-e2e-aws-ocp4-moderate 3b73daa link true /test 4.13-e2e-aws-ocp4-moderate
ci/prow/4.12-e2e-aws-ocp4-moderate 3b73daa link true /test 4.12-e2e-aws-ocp4-moderate

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Jul 31, 2024

/retest-required

@rhmdnd
Copy link
Collaborator Author

rhmdnd commented Jul 31, 2024

/test 4.15-e2e-aws-ocp4-moderate
/test 4.16-e2e-aws-ocp4-moderate
/test 4.15-e2e-aws-ocp4-pci-dss
/test 4.16-e2e-aws-ocp4-pci-dss

Copy link

codeclimate bot commented Jul 31, 2024

Code Climate has analyzed commit 6696642 and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 100.0% (50% is the threshold).

This pull request will bring the total coverage in the repository to 59.4% (0.0% change).

View more on Code Climate.

@yuumasato yuumasato self-assigned this Jul 31, 2024
Copy link
Member

@yuumasato yuumasato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhmdnd Thank you for this fix, it was quite a journey to figure out the problem and address it.

@yuumasato yuumasato merged commit 4707824 into ComplianceAsCode:master Jul 31, 2024
100 of 101 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
OpenShift OpenShift product related.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants