Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: VMSS agents go offline unexpectedly due to 401 auth error #5023

Open
2 of 4 tasks
angaaruriakhil opened this issue Oct 25, 2024 · 6 comments
Open
2 of 4 tasks

Comments

@angaaruriakhil
Copy link

angaaruriakhil commented Oct 25, 2024

What happened?

Context: We're hosting VMSS in Azure with images built with runner-images code as Azure DevOps agents. This is the source code for Azure DevOps MS hosted agents. We use the operating systems: Windows 2022, Windows 2019, Ubuntu 22.04, Ubuntu 20.04 and they are all affected by this issue. The issue happens intermittently across scale sets hosted in multiple regions across all the operating systems mentioned.

We intermittently are seeing VMs failing with the error for at least the last month in the Diagnostics tab for each pool in Azure DevOps:

Pipeline agent went offline unexpectedly

Which then will cause the VM to go offline and skip the jobs it may be running. This is causing big problems for us as VMs unexpectedly go offline and if unlucky, this could be while they are running a business critical job. This has been happening for at least a month.

We have checked our proxy/all firewalls for networking blocks and there are no blocks reported from our VMSS subnets to any destination or port. All the outbound traffic being executed is allowed.

Saving an unhealthy Ubuntu 22.04 agent for investigation and investigating the logs under /agent/_diag shows that there is a 401 error (scrubbed excepts attached in log box, I don't want to share sensitive information). See the logs box for relevant logs.

We have similarly looked at the log files under:

  • /var/log/waagent.log
  • /var/log/azure/Microsoft.VisualStudio.Services.TeamServicesAgentLinux/enableagent.log
  • /var/log/azure/Microsoft.VisualStudio.Services.TeamServicesAgentLinux/extension.log

All of which report nothing out of the ordinary.

There are similar issues reporting this problem here. #4826

As reported in #4826 , running the ./run.sh --diagnostics command, also reports an error writing to the log.

Versions

Azure DevOps Services
Images built with runner-images on October 15th 2024
Azure Pipelines Agent v3.246.0
WA Linux Agent v2.11.1.12

Environment type (Please select at least one enviroment where you face this issue)

  • Self-Hosted
  • Microsoft Hosted
  • VMSS Pool
  • Container

Azure DevOps Server type

dev.azure.com (formerly visualstudio.com)

Azure DevOps Server Version (if applicable)

No response

Operation system

Ubuntu 22.04. Ubuntu 20.04, Windows 2022, Windows 2019

Version controll system

No response

Relevant log output

[2024-10-25 10:35:40Z WARN VisualStudioServices] GET request to https://dev.azure.com/{scrubbed}/_apis/distributedtask/pools/1166/messages timed out after 60 seconds.
[2024-10-25 10:36:15Z WARN VisualStudioServices] GET request to https://dev.azure.com/{scrubbed}/_apis/distributedtask/pools/1166/messages timed out after 60 seconds.
[2024-10-25 10:36:15Z ERR  MessageListener] System.TimeoutException: The HTTP request timed out after 00:01:00.
[2024-10-25 10:36:15Z INFO MessageListener] Retriable exception: The HTTP request timed out after 00:01:00.
[2024-10-25 10:36:15Z ERR  Terminal] WRITE ERROR: 2024-10-25 10:36:15Z: Agent connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
[2024-10-25 11:20:38Z INFO MessageListener] Sleeping for 13.802 seconds before retrying.
[2024-10-25 11:21:42Z INFO MessageListener] Sleeping for 7.404 seconds before retrying.
[2024-10-25 11:21:43Z INFO MessageListener] Sent GetAgentMessage to keep alive agent 926980, session '{scrubbed}'.
[2024-10-25 11:22:13Z WARN VisualStudioServices] Authentication failed with status code 401.

[2024-10-25 11:31:36Z INFO MessageListener] No message retrieved from session '{scrubbed}' within last 30 minutes.
Results from ./run.sh --diagnostics:

System.UnauthorizedAccessException: Access to the path '/agent/_diag/Agent_20241025-190446-utc.log' is denied. ---> System.IO.IOException: Permission denied --- End of inner exception stack trace --- at Interop.ThrowExceptionForIoErrno(ErrorInfo errorInfo, String path, Boolean isDirectory, Func2 errorRewriter)
at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String path, OpenFlags flags, Int32 mode)
at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize)
at System.IO.Strategies.OSFileStreamStrategy..ctor(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize)
at Microsoft.VisualStudio.Services.Agent.HostTraceListener.CreatePageLogWriter() in /mnt/vss/_work/1/s/src/Microsoft.VisualStudio.Services.Agent/HostTraceListener.cs:line 178
at Microsoft.VisualStudio.Services.Agent.HostTraceListener..ctor(String logFileDirectory, String logFilePrefix, Int32 pageSizeLimit, Int32 retentionDays) in /mnt/vss/_work/1/s/src/Microsoft.VisualStudio.Services.Agent/HostTraceListener.cs:line 50
at Microsoft.VisualStudio.Services.Agent.HostContext..ctor(HostType hostType, String logFile) in /mnt/vss/_work/1/s/src/Microsoft.VisualStudio.Services.Agent/HostContext.cs:line 135
at Microsoft.VisualStudio.Services.Agent.Listener.Program.Main(String[] args) in /mnt/vss/_work/1/s/src/Agent.Listener/Program.cs:line 28
./run.sh: line 68: 24614 Aborted (core dumped) "$DIR"/bin/Agent.Listener run $*
`

@DenisNikulin5
Copy link
Contributor

DenisNikulin5 commented Oct 29, 2024

Hi @angaaruriakhil
The agent should acquire a new token if the old one is expired

Do you see any attempts in the logs before the VM is terminated?
It should look like this:

[2024-10-28 12:39:06Z WARN VisualStudioServices] Authentication failed with status code 401.
[2024-10-28 12:39:06Z VERB VisualStudioServices] Started authentication
[2024-10-28 12:39:06Z VERB VisualStudioServices] OAuth issued token provider instance 4920083 is acquiring a token
...
[2024-10-28 12:39:07Z VERB VisualStudioServices] OAuth issued token provider instance 4920083 acquired new token instance 38778824
[2024-10-28 12:39:07Z VERB VisualStudioServices] Finished authentication
[2024-10-28 12:39:07Z VERB VisualStudioServices] OAuth issued token provider instance 4920083 validated token instance 38778824

@angaaruriakhil
Copy link
Author

Hi @DenisNikulin5 , No I don't see that in the logs.

It basically loops through this block of code and goes back to 401.

Let me know if more details are required, I scrubbed a lot to protect the details.

[2024-10-24 19:42:05Z WARN VisualStudioServices] Authentication failed with status code 401.
Cache-Control: no-cache
Pragma: no-cache
P3P: CP="CAO DSP COR ADMa DEV CONo TELo CUR PSA PSD TAI IVDo OUR SAMi BUS DEM NAV STA UNI COM INT PHY ONL FIN PUR LOC CNT"
WWW-Authenticate: Bearer authorization_uri=https://login.microsoftonline.com/{scrubbed}, Basic realm="https://tfsprodcus6.visualstudio.com/", TFS-Federated
X-TFS-ProcessId: {scrubbed}
Strict-Transport-Security: max-age=31536000; includeSubDomains
ActivityId:  {scrubbed}
X-TFS-Session:  {scrubbed}
X-VSS-E2EID:  {scrubbed}
X-VSS-SenderDeploymentId:  {scrubbed}
X-Frame-Options: SAMEORIGIN
X-TFS-FedAuthRealm: https://tfsprodcus6.visualstudio.com/
X-TFS-FedAuthIssuer: https://dev.azure.com/{scrubbed}
X-VSS-AuthorizationEndpoint: https://vssps.dev.azure.com/{scrubbed}
X-VSS-ResourceTenant:  {scrubbed}
X-VSS-S2STargetService:  {scrubbed}/visualstudio.com
X-TFS-FedAuthRedirect: https://spsprodcus5.vssps.visualstudio.com/_signin?realm={scrubbed}
Request-Context: appId=cid-v1:{scrubbed}
Access-Control-Expose-Headers: Request-Context
X-Content-Type-Options: nosniff
X-Cache: CONFIG_NOCACHE
X-MSEdge-Ref: Ref A: {scrubbed} Ref B: CHGEDGE1009 Ref C: 2024-10-24T19:42:03Z
Date: Thu, 24 Oct 2024 19:42:03 GMT

[2024-10-24 19:42:03Z INFO RSAFileKeyManager] Loading RSA key parameters from file /agent/.credentials_rsaparams
[2024-10-24 19:42:03Z INFO RSAFileKeyManager] Loading RSA key parameters from file /agent/.credentials_rsaparams
[2024-10-24 19:42:04Z INFO VisualStudioServices] AAD Correlation ID for this token request: Unknown
[2024-10-24 19:42:04Z INFO VisualStudioServices] Finished operation Location.GetConnectionData
[2024-10-24 19:42:04Z INFO AgentServer] Establish connection with 60 seconds timeout.
[2024-10-24 19:42:04Z INFO VisualStudioServices] Starting operation Location.GetConnectionData
[2024-10-24 19:42:04Z INFO AgentServer] Establish connection with 60 seconds timeout.
[2024-10-24 19:42:04Z INFO VisualStudioServices] Starting operation Location.GetConnectionData
[2024-10-24 19:42:04Z INFO VisualStudioServices] Finished operation Location.GetConnectionData
[2024-10-24 19:42:05Z INFO VisualStudioServices] Finished operation Location.GetConnectionData
[2024-10-24 19:42:05Z INFO CommandSettings] Arg 'work': ''
[2024-10-24 19:42:05Z INFO CommandSettings] Flag 'unattended': 'True'
[2024-10-24 19:42:05Z INFO PromptManager] ReadValue
[2024-10-24 19:42:05Z INFO CommandSettings] Flag 'disableloguploads': 'False'
[2024-10-24 19:42:05Z INFO CommandSettings] Flag 'restreamlogstofiles': 'False'
[2024-10-24 19:42:05Z INFO CommandSettings] Flag 'alwaysextracttask': 'False'
[2024-10-24 19:42:05Z INFO ConfigurationStore] Saving agent settings.
[2024-10-24 19:42:05Z INFO ConfigurationStore] Settings Saved.
[2024-10-24 19:42:05Z INFO Terminal] WRITE LINE: 2024-10-24 19:42:05Z: Settings Saved.
[2024-10-24 19:42:05Z INFO SystemDControlManager] Service name 'vsts.agent. {scrubbed}.Scale Set Pool - Linux {scrubbed}.vmsslnap{scrubbed}.service' display name 'Azure Pipelines Agent ({scrubbed}.Scale Set Pool - Linux {scrubbed}.vmsslnap{scrubbed})' will be used for service configuration.
[2024-10-24 19:42:05Z INFO HostContext] Well known directory 'Bin': '/agent/bin'
[2024-10-24 19:42:05Z INFO HostContext] Well known directory 'Root': '/agent'
[2024-10-24 19:42:05Z INFO HostContext] Well known directory 'Bin': '/agent/bin'
[2024-10-24 19:42:05Z INFO HostContext] Well known directory 'Bin': '/agent/bin'
[2024-10-24 19:42:05Z INFO HostContext] Well known directory 'Root': '/agent'
[2024-10-24 19:42:05Z INFO UnixUtil] Which: 'chmod'
[2024-10-24 19:42:05Z INFO UnixUtil] Location: '{matches.First()}'
[2024-10-24 19:42:05Z INFO UnixUtil] Running /usr/bin/chmod 755 "/agent/svc.sh"
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper] Starting process:
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   File name: '/usr/bin/chmod'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Arguments: '755 "/agent/svc.sh"'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Working directory: '/agent'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Require exit code zero: 'True'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Encoding web name:  ; code page: ''
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Force kill process on cancellation: 'False'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Redirected STDIN: 'False'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Persist current code page: 'False'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Keep redirected STDIN open: 'False'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   High priority process: 'False'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   ContinueAfterCancelProcessTreeKillAttempt: 'False'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Sigint timeout: '00:00:07.5000000'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Sigterm timeout: '00:00:02.5000000'
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper]   Try to use graceful shutdown: False
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper] Updated oom_score_adj to 500 for PID: 4583.
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper] Process started with process id 4583, waiting for process exit.
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper] STDOUT/STDERR stream read finished.
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper] Exited process 4583 with exit code 0
[2024-10-24 19:42:05Z INFO ProcessInvokerWrapper] Finished process 4583 with exit code 0, and elapsed time 00:00:00.0023222.
[2024-10-24 19:42:05Z INFO ConfigurationStore] HasCredentials()
[2024-10-24 19:42:05Z INFO ConfigurationStore] stored True
[2024-10-24 19:42:05Z INFO CredentialManager] GetCredentialProvider
[2024-10-24 19:42:05Z INFO CredentialManager] Creating type OAuth
[2024-10-24 19:42:05Z INFO CredentialManager] Creating credential type: OAuth
[2024-10-24 19:42:05Z INFO RSAFileKeyManager] Loading RSA key parameters from file /agent/.credentials_rsaparams
[2024-10-24 19:42:05Z INFO ConfigurationManager] LoadSettings
[2024-10-24 19:42:05Z INFO ConfigurationStore] IsConfigured()
[2024-10-24 19:42:05Z INFO ConfigurationStore] IsConfigured: True
[2024-10-24 19:42:05Z INFO ConfigurationManager] Is configured: True
[2024-10-24 19:42:05Z INFO ConfigurationStore] Read setting file: 225 chars
[2024-10-24 19:42:05Z INFO ConfigurationManager] Settings Loaded
[2024-10-24 19:42:05Z WARN VisualStudioServices] Authentication failed with status code 401.

@HAldermo-Magnet
Copy link

We're seeing the same issue with a new VMSS Agent Pool. Existing pools seems unaffected.
Ubuntu 24.04 built using Azure Image builder. The same image works fine in other pools.

2024-12-10 14:19:34Z INFO AgentCapabilitiesProvider] Adding 'Agent.Name': 'xxxxxd00000E'
[2024-12-10 14:19:34Z INFO AgentCapabilitiesProvider] Adding 'Agent.OS': 'Linux'
[2024-12-10 14:19:34Z INFO AgentCapabilitiesProvider] Adding 'Agent.OSArchitecture': 'X64'
[2024-12-10 14:19:34Z INFO AgentCapabilitiesProvider] Adding 'InteractiveSession': 'True'
[2024-12-10 14:19:34Z INFO AgentCapabilitiesProvider] Adding 'Agent.Version': '3.248.0'
[2024-12-10 14:19:34Z INFO AgentCapabilitiesProvider] Adding 'Agent.ComputerName': 'xxxxxd00000E'
[2024-12-10 14:19:34Z INFO HostContext] Well known directory 'Bin': '/agent/bin'
[2024-12-10 14:19:34Z INFO HostContext] Well known directory 'Root': '/agent'
[2024-12-10 14:19:34Z INFO AgentCapabilitiesProvider] Adding 'Agent.HomeDirectory': '/agent'
[2024-12-10 14:19:34Z INFO Terminal] WRITE LINE: Connecting to the server.
[2024-12-10 14:19:35Z INFO Terminal] WRITE LINE: Successfully added the agent
[2024-12-10 14:19:35Z INFO ConfigurationManager] Agent server url resolve by server: 'https://dev.azure.com/xxxxx-dev/'.
[2024-12-10 14:19:35Z INFO ConfigurationStore] Saving OAuth credential @ /agent/.credentials
[2024-12-10 14:19:35Z INFO ConfigurationStore] Credentials Saved.
[2024-12-10 14:19:35Z INFO Terminal] WRITE LINE: Testing agent connection.
[2024-12-10 14:19:35Z INFO ConfigurationStore] HasCredentials()
[2024-12-10 14:19:35Z INFO ConfigurationStore] stored True
[2024-12-10 14:19:35Z INFO CredentialManager] GetCredentialProvider
[2024-12-10 14:19:35Z INFO CredentialManager] Creating type OAuth
[2024-12-10 14:19:35Z INFO CredentialManager] Creating credential type: OAuth
[2024-12-10 14:19:35Z INFO RSAFileKeyManager] Loading RSA key parameters from file /agent/.credentials_rsaparams
[2024-12-10 14:19:35Z INFO AgentServer] Establish connection with 100 seconds timeout.
[2024-12-10 14:19:35Z INFO VisualStudioServices] Starting operation Location.GetConnectionData
[2024-12-10 14:19:35Z WARN VisualStudioServices] Authentication failed with status code 401.

@Steveiwonder
Copy link

Hey all, is there any temporary work around for this issue?

@mjblue
Copy link

mjblue commented Dec 18, 2024

Also seeing this issue, have rebuilt on latest ubuntu2204 images on 5/12 and the problem is still existing

@sicil1ano
Copy link

sicil1ano commented Jan 7, 2025

Hi. Like @angaaruriakhil mentioned in the OP, there's an issue with AzDevOps user used by the agent when running the ./run.sh --diagnostics command, because the user does not have enough permissions to use the agent directory.
The 401 errors are a red herring. The agents don't go offline due to 401 errors, but instead because of lack of adequate permissions not granted to AzDevOps to use agent directory.

adminuser@vmss-agents-142:/agent$ ./run.sh --diagnostics
Unhandled exception. System.UnauthorizedAccessException: Access to the path '/agent/_diag/Agent_20240604-105127-utc.log' is denied.
 ---> System.IO.IOException: Permission denied
   --- End of inner exception stack trace ---
   at Interop.ThrowExceptionForIoErrno(ErrorInfo errorInfo, String path, Boolean isDirectory, Func`2 errorRewriter)
   at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String path, OpenFlags flags, Int32 mode)
   at Microsoft.Win32.SafeHandles.SafeFileHandle.Open(String fullPath, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize)
   at System.IO.Strategies.OSFileStreamStrategy..ctor(String path, FileMode mode, FileAccess access, FileShare share, FileOptions options, Int64 preallocationSize)
   at Microsoft.VisualStudio.Services.Agent.HostTraceListener.CreatePageLogWriter() in /mnt/vss/_work/1/s/src/Microsoft.VisualStudio.Services.Agent/HostTraceListener.cs:line 178
   at Microsoft.VisualStudio.Services.Agent.HostTraceListener..ctor(String logFileDirectory, String logFilePrefix, Int32 pageSizeLimit, Int32 retentionDays) in /mnt/vss/_work/1/s/src/Microsoft.VisualStudio.Services.Agent/HostTraceListener.cs:line 50
   at Microsoft.VisualStudio.Services.Agent.HostContext..ctor(HostType hostType, String logFile) in /mnt/vss/_work/1/s/src/Microsoft.VisualStudio.Services.Agent/HostContext.cs:line 135
   at Microsoft.VisualStudio.Services.Agent.Listener.Program.Main(String[] args) in /mnt/vss/_work/1/s/src/Agent.Listener/Program.cs:line 28
./run.sh: line 68:  3425 Aborted                 (core dumped) "$DIR"/bin/Agent.Listener run $*

I executed the sudo chmod -R a+rwx command for the agent directory (be careful: it assigns read/write/execute permissions to ALL users recursively) and the agent finally worked properly.
Please note that the 401 errors are STILL reported in the logs, but they are not blocking, like I mentioned here.
I created another issue to solve this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants