Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alarm api_unauthorized for HeadBucket/Object from SSM agent #6141

Open
dsotirho-ucsc opened this issue Apr 8, 2024 · 24 comments
Open

Alarm api_unauthorized for HeadBucket/Object from SSM agent #6141

dsotirho-ucsc opened this issue Apr 8, 2024 · 24 comments
Assignees
Labels
- [priority] Medium bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost groomed [process] Issue was recently looked at during backlog grooming infra [subject] Project infrastructure like CI/CD, build and deployment scripts noise [subject] Causing many false alarms orange [process] Done by the Azul team spike:2 [process] Spike estimate of two points

Comments

@dsotirho-ucsc
Copy link
Contributor

dsotirho-ucsc commented Apr 8, 2024

[
    {
        "@timestamp": "2024-04-06 07:44:55.908",
        "@message": {
            "eventVersion": "1.09",
            "userIdentity": {
                "type": "AssumedRole",
                "principalId": "AROARZFZ7W77QPVQJ2ZTQ:i-0795dacb9f30cf2a3",
                "arn": "arn:aws:sts::122796619775:assumed-role/azul-gitlab/i-0795dacb9f30cf2a3",
                "accountId": "122796619775",
                "accessKeyId": "ASIARZFZ7W776HXE5O7F",
                "sessionContext": {
                    "sessionIssuer": {
                        "type": "Role",
                        "principalId": "AROARZFZ7W77QPVQJ2ZTQ",
                        "arn": "arn:aws:iam::122796619775:role/azul-gitlab",
                        "accountId": "122796619775",
                        "userName": "azul-gitlab"
                    },
                    "attributes": {
                        "creationDate": "2024-04-06T07:25:44Z",
                        "mfaAuthenticated": "false"
                    },
                    "ec2RoleDelivery": "2.0"
                }
            },
            "eventTime": "2024-04-06T07:44:51Z",
            "eventSource": "s3.amazonaws.com",
            "eventName": "HeadObject",
            "awsRegion": "us-east-1",
            "sourceIPAddress": "172.21.0.99",
            "userAgent": "[aws-sdk-go/1.44.260 (go1.20.12; linux; amd64) amazon-ssm-agent/]",
            "errorCode": "AccessDenied",
            "errorMessage": "Access Denied",
            "requestParameters": {
                "bucketName": "amazon-ssm-packages-us-east-1",
                "Host": "amazon-ssm-packages-us-east-1.s3.us-east-1.amazonaws.com",
                "key": "active-birdwatcher-fallback"
            },
            "responseElements": null,
            "additionalEventData": {
                "SignatureVersion": "SigV4",
                "CipherSuite": "TLS_AES_128_GCM_SHA256",
                "bytesTransferredIn": 0,
                "AuthenticationMethod": "AuthHeader",
                "x-amz-id-2": "ykhtuVu1GynRZwj/o+JvZd8g2EDIo/zyER6uWwgTXxppgrZnh4Fssvgq2Q0XJSaBIwz7nRGye3k=",
                "bytesTransferredOut": 243
            },
            "requestID": "979KC8EJK7B5RMPE",
            "eventID": "fa1e86a0-5892-440a-ab91-8f1c40475d3e",
            "readOnly": true,
            "resources": [
                {
                    "type": "AWS::S3::Object",
                    "ARN": "arn:aws:s3:::amazon-ssm-packages-us-east-1/active-birdwatcher-fallback"
                },
                {
                    "accountId": "HIDDEN_DUE_TO_SECURITY_REASONS",
                    "type": "AWS::S3::Bucket",
                    "ARN": "arn:aws:s3:::amazon-ssm-packages-us-east-1"
                }
            ],
            "eventType": "AwsApiCall",
            "managementEvent": false,
            "recipientAccountId": "122796619775",
            "sharedEventID": "7e1b0c03-e624-4ab0-9fa0-5c7e61bbc5e8",
            "vpcEndpointId": "vpce-08e682b19051915de",
            "eventCategory": "Data",
            "tlsDetails": {
                "tlsVersion": "TLSv1.3",
                "cipherSuite": "TLS_AES_128_GCM_SHA256",
                "clientProvidedHostHeader": "amazon-ssm-packages-us-east-1.s3.us-east-1.amazonaws.com"
            }
        }
    }
]
[
    {
        "@timestamp": "2024-04-06 07:49:56.107",
        "@message": {
            "eventVersion": "1.09",
            "userIdentity": {
                "type": "AssumedRole",
                "principalId": "AROARZFZ7W77QPVQJ2ZTQ:i-0795dacb9f30cf2a3",
                "arn": "arn:aws:sts::122796619775:assumed-role/azul-gitlab/i-0795dacb9f30cf2a3",
                "accountId": "122796619775",
                "accessKeyId": "ASIARZFZ7W776HXE5O7F",
                "sessionContext": {
                    "sessionIssuer": {
                        "type": "Role",
                        "principalId": "AROARZFZ7W77QPVQJ2ZTQ",
                        "arn": "arn:aws:iam::122796619775:role/azul-gitlab",
                        "accountId": "122796619775",
                        "userName": "azul-gitlab"
                    },
                    "attributes": {
                        "creationDate": "2024-04-06T07:25:44Z",
                        "mfaAuthenticated": "false"
                    },
                    "ec2RoleDelivery": "2.0"
                }
            },
            "eventTime": "2024-04-06T07:44:51Z",
            "eventSource": "s3.amazonaws.com",
            "eventName": "HeadBucket",
            "awsRegion": "us-east-1",
            "sourceIPAddress": "172.21.0.99",
            "userAgent": "[aws-sdk-go/1.44.260 (go1.20.12; linux; amd64)]",
            "errorCode": "AccessDenied",
            "errorMessage": "Access Denied",
            "requestParameters": {
                "bucketName": "amazon-ssm-packages-us-east-1",
                "Host": "amazon-ssm-packages-us-east-1.s3.us-east-1.amazonaws.com"
            },
            "responseElements": null,
            "additionalEventData": {
                "SignatureVersion": "SigV4",
                "CipherSuite": "TLS_AES_128_GCM_SHA256",
                "bytesTransferredIn": 0,
                "AuthenticationMethod": "AuthHeader",
                "x-amz-id-2": "AYMJx28uNpYT5C2H4cehU8v5eAxFq5/dPl01jOzzbpy/1fFtY0ryoly5/AEfnXJ3eGmpEk6v/d4=",
                "bytesTransferredOut": 243
            },
            "requestID": "979R18NQ47RNWCMQ",
            "eventID": "965d098f-b7ba-4cc9-a338-10c402fb83cd",
            "readOnly": true,
            "resources": [
                {
                    "type": "AWS::S3::Object",
                    "ARNPrefix": "arn:aws:s3:::amazon-ssm-packages-us-east-1/"
                },
                {
                    "accountId": "HIDDEN_DUE_TO_SECURITY_REASONS",
                    "type": "AWS::S3::Bucket",
                    "ARN": "arn:aws:s3:::amazon-ssm-packages-us-east-1"
                }
            ],
            "eventType": "AwsApiCall",
            "managementEvent": false,
            "recipientAccountId": "122796619775",
            "sharedEventID": "7bcaf387-e3aa-4703-b913-874b103e8191",
            "vpcEndpointId": "vpce-08e682b19051915de",
            "eventCategory": "Data",
            "tlsDetails": {
                "tlsVersion": "TLSv1.3",
                "cipherSuite": "TLS_AES_128_GCM_SHA256",
                "clientProvidedHostHeader": "amazon-ssm-packages-us-east-1.s3.us-east-1.amazonaws.com"
            }
        }
    }
]
@dsotirho-ucsc dsotirho-ucsc added the orange [process] Done by the Azul team label Apr 8, 2024
@dsotirho-ucsc
Copy link
Contributor Author

Index: terraform/gitlab/gitlab.tf.json.template.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/terraform/gitlab/gitlab.tf.json.template.py b/terraform/gitlab/gitlab.tf.json.template.py
--- a/terraform/gitlab/gitlab.tf.json.template.py	(revision f1a3d58efe03021f754f89a1f8f03484574e3aaf)
+++ b/terraform/gitlab/gitlab.tf.json.template.py	(date 1712694300553)
@@ -345,7 +345,10 @@
                                     'edu-ucsc-gi-azul-*',
                                     '*.azul.data.humancellatlas.org',
                                 ]
-                            )
+                            ) + [
+                                f'amazon-ssm-packages-{aws.region_name}',
+                                f'aws-ssm-document-attachments-{aws.region_name}'
+                            ]
                         )
                     },
 
@@ -949,7 +952,9 @@
                             's3:HeadObject'
                         ],
                         'resources': [
+                            f'arn:aws:s3:::amazon-ssm-packages-{aws.region_name}',
                             f'arn:aws:s3:::amazon-ssm-packages-{aws.region_name}/*',
+                            f'arn:aws:s3:::aws-ssm-document-attachments-{aws.region_name}',
                             f'arn:aws:s3:::aws-ssm-document-attachments-{aws.region_name}/*'
                         ]
                     }

@dsotirho-ucsc dsotirho-ucsc self-assigned this Apr 9, 2024
@dsotirho-ucsc dsotirho-ucsc added bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost infra [subject] Project infrastructure like CI/CD, build and deployment scripts + [priority] High noise [subject] Causing many false alarms labels Apr 9, 2024
@dsotirho-ucsc dsotirho-ucsc changed the title Alarm api_unauthorized for HeadBucket/Object from AssumedRole azul-gitlab Alarm api_unauthorized for HeadBucket/Object from SSM agent Apr 10, 2024
@hannes-ucsc
Copy link
Member

hannes-ucsc commented Apr 10, 2024

For demo, show absence of matching trail log events for one week after this lands in a main deployment.

@hannes-ucsc hannes-ucsc added the demo [process] To be demonstrated at the end of the sprint label Apr 10, 2024
@dsotirho-ucsc dsotirho-ucsc removed the demo [process] To be demonstrated at the end of the sprint label Apr 15, 2024
@dsotirho-ucsc
Copy link
Contributor Author

https://docs.aws.amazon.com/systems-manager/latest/userguide/ssm-agent-minimum-s3-permissions.html

Assignee to chase PR with another one that mentions all buckets as documented above.

@hannes-ucsc
Copy link
Member

For demo, show absence of matching log events for one week after this lands in a main deployment.

@hannes-ucsc
Copy link
Member

hannes-ucsc commented Jun 26, 2024

Originally posted on another issue #6134 (comment):

From the spike experiments for #6134 it appears that AccessDenied requests occur in the following cases: When a newly created instance first starts up (two AccessDenied requests), when a new version of the SSM agent is available and automatically installed by the agent (new versions are checked twice daily, the uninstallation of an old version incurs two AccessDenied requests, the installation of the new version incurs another two). We typically observe one SSM updated per week, so we expect four false AccessDenied alarms. If new versions are released more frequently, we could observe up to eight false AccessDenied alarms (2 * (2 + 2)).

Assignee to try s3:* in the IAM policy.

@achave11-ucsc
Copy link
Member

Originally posted on another issue #6134 (comment):

Assignee to draft a AWS support request on Google Docs.

@achave11-ucsc
Copy link
Member

AWS Support case has been created.

@dsotirho-ucsc
Copy link
Contributor Author

Assignee to monitor AWS Support ticket and follow up if necessary.

@achave11-ucsc
Copy link
Member

AWS Support responded, they've mentioned that this is a known issued and that there's nothing we could do to prevent it. They also said that they've urge the service team responsible for this to look into it and that they'll keep us posted with the details.

@achave11-ucsc
Copy link
Member

Followed up with some questions, awaiting response.

@dsotirho-ucsc dsotirho-ucsc added spike:2 [process] Spike estimate of two points and removed no demo [process] Not to be demonstrated at the end of the sprint labels Jul 2, 2024
@dsotirho-ucsc
Copy link
Contributor Author

Note that this was already closed with a fix in stable, but the fix was ineffective so we went back to AWS Support. Spike to continue to monitor the AWS Support ticket.

@dsotirho-ucsc
Copy link
Contributor Author

@hannes-ucsc: "AWS responded stating that they are working on a fix, but can't release an ETA for it. Assignee to continue to monitor the support ticket."

@achave11-ucsc
Copy link
Member

AWS responded,

Sadly, I have not yet received an update regarding the fix for this issue. I have escalated the matter internally in the hope that we can drive this to resolution soonest.

I apologise for the delay here and appreciate your continued patience.

still waiting for an upstream resolution.

@dsotirho-ucsc
Copy link
Contributor Author

Assignee to periodically check with AWS Support.

@achave11-ucsc
Copy link
Member

AWS replied:

Good news, the issue has finally been resolved. I have tested this using SSM Distributor and I can confirm that the AccessDenied error is no longer present in debug logs. Please update to the latest available SSM agent version (3.3.1230.0) and let me know if you are still seeing this error.

@dsotirho-ucsc
Copy link
Contributor Author

dsotirho-ucsc commented Nov 14, 2024

Assignee to verify Amazon's fix in two weeks time.

@achave11-ucsc
Copy link
Member

After AWS release the fix, we haven't observed any of the 'weekly' SSM AccessDenied alarms we were previously seeing. Furthermore, rebooting or scrapping the instance no longer causes these same AccessDenied alarms from SSM.

@achave11-ucsc
Copy link
Member

AWS Support ticket has been marked as resolved.

@dsotirho-ucsc dsotirho-ucsc added the no demo [process] Not to be demonstrated at the end of the sprint label Dec 9, 2024
@achave11-ucsc achave11-ucsc reopened this Dec 10, 2024
@achave11-ucsc
Copy link
Member

@hannes-ucsc: "It turns out that rebooting GitLab did trigger more AccessDenied trail events. We are using a version that is several months old (3.3.987) while the newest version is 3.3.1345.0. First, we need to install the latest version and reboot several times with that version installed. Before that though, we need to reboot with the old version in order to ensure that we have a reliable reproduction. We should also investigate why the currently use is so old. We may need to hard-code the version in the Terraform config."

@achave11-ucsc achave11-ucsc removed the no demo [process] Not to be demonstrated at the end of the sprint label Dec 10, 2024
@achave11-ucsc
Copy link
Member

A recent upgrades PR updated the instance AMI to a more recent version, which confirmed that when the instance is booted, the latest version of the amazon-ssm-agent package isn't installed (still uses 3.3.987). We associate this outdated version with causing false positive alarms, this update also confirmed that the package manager being used by the instance isn't updating at least amazon-ssm-agent to the latest available version.
Because the issue isn't reproducible via reboot, we require explicitly installing the package (with the desired version) to scrap the instance for 1) confirm the most recent version is actually being installed in the instance and 2) ensuring the resolution the the problem with SSM Agent causing false positive alarms is resolved. To be tested with the following patch:

Index: terraform/gitlab/gitlab.tf.json.template.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/terraform/gitlab/gitlab.tf.json.template.py b/terraform/gitlab/gitlab.tf.json.template.py
--- a/terraform/gitlab/gitlab.tf.json.template.py	(revision e39203ce85b62e52483c49716357ffb70e962120)
+++ b/terraform/gitlab/gitlab.tf.json.template.py	(date 1734720734323)
@@ -1609,7 +1609,8 @@
                         'docker',
                         'amazon-cloudwatch-agent',
                         'amazon-ecr-credential-helper',
-                        'dracut-fips'
+                        'dracut-fips',
+                        ['amazon-ssm-agent', '3.3.1345.0'],
                     ],
                     'ssh_authorized_keys': [] if config.deployment.is_stable else operator_keys,
                     'bootcmd': [

@achave11-ucsc
Copy link
Member

Assignee to try specifying the fully qualified URL in cloud-config user data, as suggested by https://docs.aws.amazon.com/systems-manager/latest/userguide/agent-install-al2.html#quick-install-al2

@achave11-ucsc achave11-ucsc added - [priority] Medium and removed + [priority] High labels Dec 20, 2024
@achave11-ucsc
Copy link
Member

Using the following patch installed amazon-ssm-agent - v3.3.1345.0 (latest version),

diff --git a/terraform/gitlab/gitlab.tf.json.template.py b/terraform/gitlab/gitlab.tf.json.template.py
index e9fabf394..d10ddda9d 100644
--- a/terraform/gitlab/gitlab.tf.json.template.py
+++ b/terraform/gitlab/gitlab.tf.json.template.py
@@ -1609,7 +1609,8 @@ emit_tf({} if config.terraform_component != 'gitlab' else {
                         'docker',
                         'amazon-cloudwatch-agent',
                         'amazon-ecr-credential-helper',
-                        'dracut-fips'
+                        'dracut-fips',
+                        'https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm'
                     ],
                     'ssh_authorized_keys': [] if config.deployment.is_stable else operator_keys,
                     'bootcmd': [

… which did't prevent the false positive AccessDenied alarms caused by the SSM agent.

SSM

@achave11-ucsc
Copy link
Member

achave11-ucsc commented Jan 7, 2025

@hannes-ucsc: "Assignee to open new support ticket, referring to the old one, providing the above evidence that the issue persists."

@hannes-ucsc hannes-ucsc added the groomed [process] Issue was recently looked at during backlog grooming label Jan 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
- [priority] Medium bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost groomed [process] Issue was recently looked at during backlog grooming infra [subject] Project infrastructure like CI/CD, build and deployment scripts noise [subject] Causing many false alarms orange [process] Done by the Azul team spike:2 [process] Spike estimate of two points
Projects
None yet
Development

No branches or pull requests

3 participants