Generate a temp certificate for OCP4 Trusted CA remediation #12226

rhmdnd · 2024-07-26T16:45:00Z

Lately, we've been experiencing issues with manual remediations timing
out during functional testing. This manifests in the following error:

=== RUN   TestE2e/Apply_manual_remediations
 <snip>
 helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh'
 helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh'
 helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out

In this particular case, it looks like the remediation to add an
Identity Provider to the cluster failed, but this is actually an
unintended side-effect of another change that updated the
idp_is_configured remediation to use a more robust technique for
determining if the cluster applied the remediation successfully:

#12120
#12184

Because we updated the remediation to use oc adm wait-for-stable-cluster, we're effectively checking all cluster
operators to ensure they're healthy.

This started causing timeouts because a separate, unrelated remediation
was also getting applied in our testing that updated the default CA, but
didn't include a ConfigMap that contained the CA bundle. As a result,
one of the operators didn't come up because it was looking for a
ConfigMap that didn't exist. The oc adm wait-for-stable-cluster
command was hanging on a legitimate issue in a separate remediation.

This commit attempts to fix that issue by updating the trusted CA
remediation by generating a certificate for testing purposes, then
creates a ConfigMap called trusted-ca-bundle, before updating the
trusted CA.

rhmdnd · 2024-07-26T16:46:31Z

/test

openshift-ci · 2024-07-26T16:46:34Z

@rhmdnd: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

/test 4.12-e2e-aws-ocp4-cis
/test 4.12-e2e-aws-ocp4-cis-node
/test 4.12-e2e-aws-ocp4-e8
/test 4.12-e2e-aws-ocp4-high
/test 4.12-e2e-aws-ocp4-high-node
/test 4.12-e2e-aws-ocp4-moderate
/test 4.12-e2e-aws-ocp4-moderate-node
/test 4.12-e2e-aws-ocp4-pci-dss
/test 4.12-e2e-aws-ocp4-pci-dss-node
/test 4.12-e2e-aws-ocp4-stig
/test 4.12-e2e-aws-ocp4-stig-node
/test 4.12-e2e-aws-rhcos4-e8
/test 4.12-e2e-aws-rhcos4-high
/test 4.12-e2e-aws-rhcos4-moderate
/test 4.12-e2e-aws-rhcos4-stig
/test 4.12-images
/test 4.13-e2e-aws-ocp4-bsi
/test 4.13-e2e-aws-ocp4-bsi-node
/test 4.13-e2e-aws-ocp4-cis
/test 4.13-e2e-aws-ocp4-cis-node
/test 4.13-e2e-aws-ocp4-e8
/test 4.13-e2e-aws-ocp4-high
/test 4.13-e2e-aws-ocp4-high-node
/test 4.13-e2e-aws-ocp4-moderate
/test 4.13-e2e-aws-ocp4-moderate-node
/test 4.13-e2e-aws-ocp4-pci-dss
/test 4.13-e2e-aws-ocp4-pci-dss-node
/test 4.13-e2e-aws-ocp4-stig
/test 4.13-e2e-aws-ocp4-stig-node
/test 4.13-e2e-aws-rhcos4-bsi
/test 4.13-e2e-aws-rhcos4-e8
/test 4.13-e2e-aws-rhcos4-high
/test 4.13-e2e-aws-rhcos4-moderate
/test 4.13-e2e-aws-rhcos4-stig
/test 4.13-images
/test 4.14-e2e-aws-ocp4-bsi
/test 4.14-e2e-aws-ocp4-bsi-node
/test 4.14-e2e-aws-rhcos4-bsi
/test 4.14-images
/test 4.15-e2e-aws-ocp4-bsi
/test 4.15-e2e-aws-ocp4-bsi-node
/test 4.15-e2e-aws-ocp4-cis
/test 4.15-e2e-aws-ocp4-cis-node
/test 4.15-e2e-aws-ocp4-e8
/test 4.15-e2e-aws-ocp4-high
/test 4.15-e2e-aws-ocp4-high-node
/test 4.15-e2e-aws-ocp4-moderate
/test 4.15-e2e-aws-ocp4-moderate-node
/test 4.15-e2e-aws-ocp4-pci-dss
/test 4.15-e2e-aws-ocp4-pci-dss-node
/test 4.15-e2e-aws-ocp4-stig
/test 4.15-e2e-aws-ocp4-stig-node
/test 4.15-e2e-aws-rhcos4-bsi
/test 4.15-e2e-aws-rhcos4-e8
/test 4.15-e2e-aws-rhcos4-high
/test 4.15-e2e-aws-rhcos4-moderate
/test 4.15-e2e-aws-rhcos4-stig
/test 4.15-e2e-rosa-ocp4-cis-node
/test 4.15-e2e-rosa-ocp4-pci-dss-node
/test 4.15-images
/test 4.16-e2e-aws-ocp4-bsi
/test 4.16-e2e-aws-ocp4-bsi-node
/test 4.16-e2e-aws-ocp4-cis
/test 4.16-e2e-aws-ocp4-cis-node
/test 4.16-e2e-aws-ocp4-e8
/test 4.16-e2e-aws-ocp4-high
/test 4.16-e2e-aws-ocp4-high-node
/test 4.16-e2e-aws-ocp4-moderate
/test 4.16-e2e-aws-ocp4-moderate-node
/test 4.16-e2e-aws-ocp4-pci-dss
/test 4.16-e2e-aws-ocp4-pci-dss-node
/test 4.16-e2e-aws-ocp4-stig
/test 4.16-e2e-aws-ocp4-stig-node
/test 4.16-e2e-aws-rhcos4-bsi
/test 4.16-e2e-aws-rhcos4-e8
/test 4.16-e2e-aws-rhcos4-high
/test 4.16-e2e-aws-rhcos4-moderate
/test 4.16-e2e-aws-rhcos4-stig
/test 4.16-images
/test e2e-aws-ocp4-bsi
/test e2e-aws-ocp4-bsi-node
/test e2e-aws-ocp4-cis
/test e2e-aws-ocp4-cis-node
/test e2e-aws-ocp4-e8
/test e2e-aws-ocp4-high
/test e2e-aws-ocp4-high-node
/test e2e-aws-ocp4-moderate
/test e2e-aws-ocp4-moderate-node
/test e2e-aws-ocp4-pci-dss
/test e2e-aws-ocp4-pci-dss-node
/test e2e-aws-ocp4-stig
/test e2e-aws-ocp4-stig-node
/test e2e-aws-rhcos4-bsi
/test e2e-aws-rhcos4-e8
/test e2e-aws-rhcos4-high
/test e2e-aws-rhcos4-moderate
/test e2e-aws-rhcos4-stig
/test images

Use /test all to run the following jobs that were automatically triggered:

pull-ci-ComplianceAsCode-content-master-4.12-images
pull-ci-ComplianceAsCode-content-master-4.13-images
pull-ci-ComplianceAsCode-content-master-4.14-images
pull-ci-ComplianceAsCode-content-master-4.15-images
pull-ci-ComplianceAsCode-content-master-4.16-images
pull-ci-ComplianceAsCode-content-master-images

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

github-actions · 2024-07-26T16:47:18Z

Start a new ephemeral environment with changes proposed in this pull request:

Fedora Environment

Oracle Linux 8 Environment

rhmdnd · 2024-07-26T16:47:50Z

/test 4.12-e2e-aws-ocp4-moderate
/test 4.13-e2e-aws-ocp4-moderate
/test e2e-aws-ocp4-moderate
/test 4.15-e2e-aws-ocp4-moderate
/test 4.16-e2e-aws-ocp4-moderate

github-actions · 2024-07-26T17:04:57Z

🤖 A k8s content image for this PR is available at:
ghcr.io/complianceascode/k8scontent:12226
This image was built from commit: 6696642

Click here to see how to deploy it

If you alread have Compliance Operator deployed:
utils/build_ds_container.py -i ghcr.io/complianceascode/k8scontent:12226

Otherwise deploy the content and operator together by checking out ComplianceAsCode/compliance-operator and:
CONTENT_IMAGE=ghcr.io/complianceascode/k8scontent:12226 make deploy-local

applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh

rhmdnd · 2024-07-26T22:05:42Z

/test 4.12-e2e-aws-ocp4-moderate
/test 4.13-e2e-aws-ocp4-moderate
/test e2e-aws-ocp4-moderate
/test 4.15-e2e-aws-ocp4-moderate
/test 4.16-e2e-aws-ocp4-moderate

rhmdnd · 2024-07-26T23:20:00Z

/test 4.12-e2e-aws-ocp4-moderate
/test 4.13-e2e-aws-ocp4-moderate
/test e2e-aws-ocp4-moderate
/test 4.15-e2e-aws-ocp4-moderate
/test 4.16-e2e-aws-ocp4-moderate
/test 4.12-images
/test 4.13-images
/test 4.14-images
/test 4.15-images
/test 4.16-images

rhmdnd · 2024-07-26T23:20:23Z

/test images

yuumasato · 2024-07-29T16:31:17Z

/test e2e-aws-ocp4-pci-dss
/test e2e-aws-ocp4-pci-dss-node

yuumasato · 2024-07-29T16:37:17Z

I see that the profiles that don't include default-ingress-ca-replaced are not failing, even without this PR.
I think the changes make sense, but the timeout is still happening, 🫠

yuumasato · 2024-07-29T16:46:58Z

/test 4.12-e2e-aws-ocp4-pci-dss
/test 4.14-e2e-aws-ocp4-pci-dss

yuumasato · 2024-07-30T09:51:10Z

/test 4.15-e2e-aws-ocp4-pci-dss
/test 4.16-e2e-aws-ocp4-pci-dss

yuumasato · 2024-07-30T10:04:06Z

@rhmdnd The ocp4-pci-dss doesn't select rule default-ingress-ca-replaced, yet, this week's 4.12 ocp4-pci-dss profile didn't time out while the CI run on this PR timed out.
To add more weird flakyness, the 4.16 ocp4-pci-dss didn't time out while the 4.15 ocp4-pci-dss timed out.

I'm running 4.15 and 4.16 ocp4-pci-dss test runs on this PR to gather more data.

yuumasato · 2024-07-30T12:38:38Z

To add more weird flakyness, the 4.16 ocp4-pci-dss didn't time out while the 4.15 ocp4-pci-dss timed out.

And more quirkiness:
4.16 ocp4-pci-dss on this PR timed out, while 4.15 ocp4-pci-dss didn't time out.

rhmdnd · 2024-07-30T16:11:07Z

ComplianceAsCode/ocp4e2e#48 landed so let's rekick some of the jobs and see if we can get some more information from the CI clusters.

rhmdnd · 2024-07-30T16:11:41Z

/test 4.12-e2e-aws-ocp4-moderate
/test 4.13-e2e-aws-ocp4-moderate
/test e2e-aws-ocp4-moderate
/test 4.15-e2e-aws-ocp4-moderate
/test 4.16-e2e-aws-ocp4-moderate

rhmdnd · 2024-07-30T22:12:49Z

/test 4.12-e2e-aws-ocp4-moderate
/test 4.13-e2e-aws-ocp4-moderate
/test e2e-aws-ocp4-moderate
/test 4.15-e2e-aws-ocp4-moderate
/test 4.16-e2e-aws-ocp4-moderate

rhmdnd · 2024-07-30T22:14:37Z

I had to rewrite a good portion of the remediation so that it 1.) pointed to an actual config map and 2.) reused an existing certificate so the change propagated through the various components.

I was able to get this working on a local cluster, so hopefully CI works, too.

rhmdnd · 2024-07-31T01:08:18Z

Still seeing the timeout issue. Checking to see if it's due to a lower context timeout in ComplianceAsCode/ocp4e2e#50.

yuumasato · 2024-07-31T08:05:08Z

/test 4.15-e2e-aws-ocp4-moderate
/test 4.16-e2e-aws-ocp4-moderate
/test 4.15-e2e-aws-ocp4-pci-dss
/test 4.16-e2e-aws-ocp4-pci-dss

rhmdnd · 2024-07-31T12:37:53Z

The 4.13 and 4.12 tests failed because the networking operator was in a degraded state, timing out after an hour.

The 4.16 test failed because the timeout was reached even though the cluster operators eventually stabilized.

rhmdnd · 2024-07-31T13:14:31Z

I was able to dig this out of the networking operator logs:

I0730 23:51:41.008637       1 log.go:198] Operconfig Controller complete
I0730 23:51:58.381293       1 log.go:198] Reconciling additional trust bundle configmap 'openshift-config/trusted-ca-bundle'
I0730 23:51:58.438717       1 log.go:198] httpProxy, httpsProxy and noProxy not defined for proxy 'trusted-ca-bundle'; validation will be skipped
I0730 23:51:58.478438       1 log.go:198] Reconciling additional trust bundle configmap 'openshift-config/trusted-ca-bundle' complete
I0730 23:51:58.527582       1 log.go:198] Failed to sync additional trust bundle configmap openshift-config-managed/trusted-ca-bundle: failed to update trusted CA bundle configmap 'openshift-config-managed/trusted-ca-bundle': ConfigMap "trusted-ca-bundle" is invalid: []: Too long: must have at most 1048576 bytes
I0730 23:52:20.478712       1 log.go:198] Reconciling Network.operator.openshift.io cluster

The actual certificate is huge, which might be blowing out the size limits of the config map depending on how the syncing is handled. Locally, I needed to use oc create instead of oc apply for that reason.

Lately, we've been experiencing issues with manual remediations timing out during functional testing. This manifests in the following error: === RUN TestE2e/Apply_manual_remediations <snip> helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh' helpers.go:1225: Running manual remediation '/tmp/content-3345141771/applications/openshift/general/file_integrity_notification_enabled/tests/ocp4/e2e-remediation.sh' helpers.go:1231: Command '/tmp/content-3345141771/applications/openshift/authentication/idp_is_configured/tests/ocp4/e2e-remediation.sh' timed out In this particular case, it looks like the remediation to add an Identity Provider to the cluster failed, but this is actually an unintended side-effect of another change that updated the idp_is_configured remediation to use a more robust technique for determining if the cluster applied the remediation successfully: ComplianceAsCode#12120 ComplianceAsCode#12184 Because we updated the remediation to use `oc adm wait-for-stable-cluster`, we're effectively checking all cluster operators to ensure they're healthy. This started causing timeouts because a separate, unrelated remediation was also getting applied in our testing that updated the default CA, but didn't include a ConfigMap that contained the CA bundle. As a result, one of the operators didn't come up because it was looking for a ConfigMap that didn't exist. The `oc adm wait-for-stable-cluster` command was hanging on a legitimate issue in a separate remediation. This commit attempts to fix that issue by updating the trusted CA remediation by creating a configmap for the expected certificate bundle.

rhmdnd · 2024-07-31T13:41:23Z

applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh

+  ca-bundle.crt: $BUNDLE
+metadata:
+  name: trusted-ca-bundle
+  namespace: openshift-config-managed


Experimenting with this to see if it works around the following issue:

I0730 23:51:58.527582 1 log.go:198] Failed to sync additional trust bundle configmap openshift-config-managed/trusted-ca-bundle: failed to update trusted CA bundle configmap 'openshift-config-managed/trusted-ca-bundle': ConfigMap "trusted-ca-bundle" is invalid: []: Too long: must have at most 1048576 bytes

By creating it manually instead of relying on the sync logic in the networking operator.

openshift-ci · 2024-07-31T13:41:28Z

@rhmdnd: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ocp4-pci-dss	`d9290cf`	link	true	`/test e2e-aws-ocp4-pci-dss`
ci/prow/e2e-aws-ocp4-pci-dss-node	`d9290cf`	link	true	`/test e2e-aws-ocp4-pci-dss-node`
ci/prow/4.12-e2e-aws-ocp4-pci-dss	`d9290cf`	link	true	`/test 4.12-e2e-aws-ocp4-pci-dss`
ci/prow/e2e-aws-ocp4-moderate	`3b73daa`	link	true	`/test e2e-aws-ocp4-moderate`
ci/prow/4.13-e2e-aws-ocp4-moderate	`3b73daa`	link	true	`/test 4.13-e2e-aws-ocp4-moderate`
ci/prow/4.12-e2e-aws-ocp4-moderate	`3b73daa`	link	true	`/test 4.12-e2e-aws-ocp4-moderate`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

rhmdnd · 2024-07-31T13:45:33Z

/retest-required

rhmdnd · 2024-07-31T13:45:57Z

/test 4.15-e2e-aws-ocp4-moderate
/test 4.16-e2e-aws-ocp4-moderate
/test 4.15-e2e-aws-ocp4-pci-dss
/test 4.16-e2e-aws-ocp4-pci-dss

codeclimate · 2024-07-31T14:24:42Z

Code Climate has analyzed commit 6696642 and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 100.0% (50% is the threshold).

This pull request will bring the total coverage in the repository to 59.4% (0.0% change).

View more on Code Climate.

yuumasato

@rhmdnd Thank you for this fix, it was quite a journey to figure out the problem and address it.

rhmdnd requested review from Vincent056 and yuumasato July 26, 2024 16:45

rhmdnd mentioned this pull request Jul 26, 2024

🧹 upgrade golang to 1.22 ComplianceAsCode/ocp4e2e#47

Open

rhmdnd added this to the 0.1.74 milestone Jul 26, 2024

Vincent056 reviewed Jul 26, 2024

View reviewed changes

applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh Outdated Show resolved Hide resolved

rhmdnd commented Jul 26, 2024

View reviewed changes

applications/openshift/networking/default_ingress_ca_replaced/tests/ocp4/e2e-remediation.sh Outdated Show resolved Hide resolved

rhmdnd force-pushed the fix-ca-manual-remediation branch from e7a5e89 to d9290cf Compare July 26, 2024 21:50

jan-cerny added the OpenShift OpenShift product related. label Jul 29, 2024

Mab879 modified the milestones: 0.1.74, 0.1.75 Jul 29, 2024

This was referenced Jul 30, 2024

Use OpenShift 4.16 for ComplianceAsCode/content CI openshift/release#49323

Merged

Add more details to remediation timeout ComplianceAsCode/ocp4e2e#48

Merged

rhmdnd force-pushed the fix-ca-manual-remediation branch from d9290cf to 3b73daa Compare July 30, 2024 22:11

rhmdnd force-pushed the fix-ca-manual-remediation branch from 3b73daa to 6696642 Compare July 31, 2024 13:38

rhmdnd commented Jul 31, 2024

View reviewed changes

yuumasato self-assigned this Jul 31, 2024

yuumasato approved these changes Jul 31, 2024

View reviewed changes

yuumasato merged commit 4707824 into ComplianceAsCode:master Jul 31, 2024
100 of 101 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate a temp certificate for OCP4 Trusted CA remediation #12226

Generate a temp certificate for OCP4 Trusted CA remediation #12226

rhmdnd commented Jul 26, 2024 •

edited

Loading

rhmdnd commented Jul 26, 2024

openshift-ci bot commented Jul 26, 2024

github-actions bot commented Jul 26, 2024

rhmdnd commented Jul 26, 2024

github-actions bot commented Jul 26, 2024 •

edited

Loading

rhmdnd commented Jul 26, 2024

rhmdnd commented Jul 26, 2024

rhmdnd commented Jul 26, 2024

yuumasato commented Jul 29, 2024

yuumasato commented Jul 29, 2024

yuumasato commented Jul 29, 2024

yuumasato commented Jul 30, 2024

yuumasato commented Jul 30, 2024 •

edited

Loading

yuumasato commented Jul 30, 2024

rhmdnd commented Jul 30, 2024

rhmdnd commented Jul 30, 2024

rhmdnd commented Jul 30, 2024

rhmdnd commented Jul 30, 2024 •

edited

Loading

rhmdnd commented Jul 31, 2024

yuumasato commented Jul 31, 2024

rhmdnd commented Jul 31, 2024

rhmdnd commented Jul 31, 2024

rhmdnd Jul 31, 2024

openshift-ci bot commented Jul 31, 2024 •

edited

Loading

rhmdnd commented Jul 31, 2024

rhmdnd commented Jul 31, 2024

codeclimate bot commented Jul 31, 2024

yuumasato left a comment •

edited

Loading

Generate a temp certificate for OCP4 Trusted CA remediation #12226

Generate a temp certificate for OCP4 Trusted CA remediation #12226

Conversation

rhmdnd commented Jul 26, 2024 • edited Loading

rhmdnd commented Jul 26, 2024

openshift-ci bot commented Jul 26, 2024

github-actions bot commented Jul 26, 2024

rhmdnd commented Jul 26, 2024

github-actions bot commented Jul 26, 2024 • edited Loading

rhmdnd commented Jul 26, 2024

rhmdnd commented Jul 26, 2024

rhmdnd commented Jul 26, 2024

yuumasato commented Jul 29, 2024

yuumasato commented Jul 29, 2024

yuumasato commented Jul 29, 2024

yuumasato commented Jul 30, 2024

yuumasato commented Jul 30, 2024 • edited Loading

yuumasato commented Jul 30, 2024

rhmdnd commented Jul 30, 2024

rhmdnd commented Jul 30, 2024

rhmdnd commented Jul 30, 2024

rhmdnd commented Jul 30, 2024 • edited Loading

rhmdnd commented Jul 31, 2024

yuumasato commented Jul 31, 2024

rhmdnd commented Jul 31, 2024

rhmdnd commented Jul 31, 2024

rhmdnd Jul 31, 2024

Choose a reason for hiding this comment

openshift-ci bot commented Jul 31, 2024 • edited Loading

rhmdnd commented Jul 31, 2024

rhmdnd commented Jul 31, 2024

codeclimate bot commented Jul 31, 2024

yuumasato left a comment • edited Loading

Choose a reason for hiding this comment

rhmdnd commented Jul 26, 2024 •

edited

Loading

github-actions bot commented Jul 26, 2024 •

edited

Loading

yuumasato commented Jul 30, 2024 •

edited

Loading

rhmdnd commented Jul 30, 2024 •

edited

Loading

openshift-ci bot commented Jul 31, 2024 •

edited

Loading

yuumasato left a comment •

edited

Loading