-
Notifications
You must be signed in to change notification settings - Fork 600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AC/FC integration #2995
base: dev
Are you sure you want to change the base?
AC/FC integration #2995
Changes from 2 commits
671edf2
a510eb2
e5ad084
f333e68
aa4409d
95808e8
0d55f7d
20cdad8
9538c38
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -198,6 +198,7 @@ def connect(options = {}) | |
rescue NewRelic::Agent::UnrecoverableAgentException => e | ||
handle_unrecoverable_agent_error(e) | ||
rescue StandardError, Timeout::Error, NewRelic::Agent::ServerConnectionException => e | ||
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::FAILED_TO_CONNECT) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've struggled to write tests to verify the status is updated when we expect it to be updated. If we think tests are valuable for these updates, I could use some help! |
||
retry if retry_from_error?(e, opts) | ||
rescue Exception => e | ||
::NewRelic::Agent.logger.error('Exception of unexpected type during Agent#connect():', e) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,125 @@ | ||
# This file is distributed under New Relic's license terms. | ||
# See https://github.com/newrelic/newrelic-ruby-agent/blob/main/LICENSE for complete details. | ||
# frozen_string_literal: true | ||
|
||
module NewRelic | ||
module Agent | ||
class HealthCheck | ||
def initialize | ||
@start_time = nano_time | ||
@fleet_id = ENV['NEW_RELIC_AGENT_CONTROL_FLEET_ID'] | ||
# The spec states file paths for the delivery location will begin with file:// | ||
# This does not create a valid path in Ruby, so remove the prefix when present | ||
@delivery_location = ENV['NEW_RELIC_AGENT_CONTROL_HEALTH_DELIVERY_LOCATION']&.gsub('file://', '') | ||
@frequency = ENV['NEW_RELIC_AGENT_CONTROL_HEALTH_FREQUENCY'].to_i | ||
kaylareopelle marked this conversation as resolved.
Show resolved
Hide resolved
|
||
@continue = true | ||
@status = HEALTHY | ||
end | ||
|
||
HEALTHY = {healthy: true, last_error: 'NR-APM-000', message: 'Healthy'} | ||
INVALID_LICENSE_KEY = {healthy: false, last_error: 'NR-APM-001', message: 'Invalid liense key (HTTP status code 401)'} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are any of these errors recoverable for the Ruby Agent? i.e. where we could potentially need to set the status back to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great question! Some of these errors are recoverable. From what I'm seeing in the code, a successful HTTP request made by the However, I may have missed some scenarios and there could be other places where explicitly setting a |
||
MISSING_LICENSE_KEY = {healthy: false, last_error: 'NR-APM-002', message: 'License key missing in configuration'} | ||
FORCED_DISCONNECT = {healthy: false, last_error: 'NR-APM-003', message: 'Forced disconnect received from New Relic (HTTP status code 410)'} | ||
HTTP_ERROR = {healthy: false, last_error: 'NR-APM-004', message: 'HTTP error response code [%s] recevied from New Relic while sending data type [%s]'} | ||
MISSING_APP_NAME = {healthy: false, last_error: 'NR-APM-005', message: 'Missing application name in agent configuration'} | ||
APP_NAME_EXCEEDED = {healthy: false, last_error: 'NR-APM-006', message: 'The maximum number of configured app names (3) exceeded'} | ||
PROXY_CONFIG_ERROR = {healthy: false, last_error: 'NR-APM-007', message: 'HTTP Proxy configuration error; response code [%s]'} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The spec defines these two error codes, but our agent doesn't have any behavior to recognize when these problems occur. As things stand currently, we don't need to update our agent to record these states. I left them here to make sure we match the spec, but would also be open to removing them, since the status will never be updated to use these constants. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it makes sense to leave them, to be complete. Plus, if we ever update the agent to handle those things, then these will already be here ready to use. And if we never do make those updates, at least in the future when we're comparing the code to the spec to look into something we won't be confused by why some would be missing. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great! Do you think we would benefit from a comment stating they're not called anywhere? |
||
AGENT_DISABLED = {healthy: false, last_error: 'NR-APM-008', message: 'Agent is disabled via configuration'} | ||
FAILED_TO_CONNECT = {healthy: false, last_error: 'NR-APM-009', message: 'Failed to connect to New Relic data collector'} | ||
FAILED_TO_PARSE_CONFIG = {healthy: false, last_error: 'NR-APM-010', message: 'Agent config file is not able to be parsed'} | ||
SHUTDOWN = {healthy: true, last_error: 'NR-APM-099', message: 'Agent has shutdown'} | ||
|
||
def create_and_run_health_check_loop | ||
unless health_check_enabled? | ||
@continue = false | ||
end | ||
|
||
return NewRelic::Agent.logger.debug('NEW_RELIC_AGENT_CONTROL_FLEET_ID not found, skipping health checks') unless @fleet_id | ||
return NewRelic::Agent.logger.debug('NEW_RELIC_AGENT_CONTROL_HEALTH_DELIVERY_LOCATION not found, skipping health checks') unless @delivery_location | ||
return NewRelic::Agent.logger.debug('NEW_RELIC_AGENT_CONTROL_HEALTH_FREQUENCY zero or less, skipping health checks') unless @frequency > 0 | ||
|
||
NewRelic::Agent.logger.debug('Agent control health check conditions met. Starting health checks.') | ||
NewRelic::Agent.record_metric('Supportability/AgentControl/Health/enabled', 1) | ||
|
||
Thread.new do | ||
while @continue | ||
begin | ||
sleep @frequency | ||
write_file | ||
@continue = false if @status == SHUTDOWN | ||
rescue StandardError => e | ||
NewRelic::Agent.logger.error("Aborting agent control health check. Error raised: #{e}") | ||
@continue = false | ||
end | ||
end | ||
end | ||
end | ||
|
||
def update_status(status, options = []) | ||
return unless @continue | ||
|
||
@status = status | ||
update_message(options) unless options.empty? | ||
end | ||
|
||
private | ||
|
||
def contents | ||
<<~CONTENTS | ||
healthy: #{@status[:healthy]} | ||
status: #{@status[:message]}#{last_error} | ||
start_time_unix_nano: #{@start_time} | ||
status_time_unix_nano: #{nano_time} | ||
CONTENTS | ||
end | ||
|
||
def last_error | ||
@status[:healthy] ? '' : "\nlast_error: #{@status[:last_error]}" | ||
end | ||
|
||
def nano_time | ||
Process.clock_gettime(Process::CLOCK_REALTIME, :nanosecond) | ||
end | ||
|
||
def file_name | ||
"health-#{NewRelic::Agent::GuidGenerator.generate_guid(32)}.yml" | ||
end | ||
|
||
def write_file | ||
@path ||= create_file_path | ||
|
||
File.write("#{@path}/#{file_name}", contents) | ||
rescue StandardError => e | ||
NewRelic::Agent.logger.error("Agent control health check raised an error while writing a file: #{e}") | ||
@continue = false | ||
end | ||
|
||
def create_file_path | ||
for abs_path in [File.expand_path(@delivery_location), | ||
File.expand_path(File.join('', @delivery_location))] do | ||
if File.directory?(abs_path) || (Dir.mkdir(abs_path) rescue nil) | ||
return abs_path[%r{^(.*?)/?$}] | ||
end | ||
end | ||
nil | ||
rescue StandardError => e | ||
NewRelic::Agent.logger.error( | ||
'Agent control health check raised an error while finding or creating the file path defined in ' \ | ||
"NEW_RELIC_AGENT_CONTROL_HEALTH_DELIVERY_LOCATION: #{e}" | ||
) | ||
@continue = false | ||
end | ||
|
||
def health_check_enabled? | ||
@fleet_id && @delivery_location && (@frequency > 0) | ||
end | ||
|
||
def update_message(options) | ||
@status[:message] = sprintf(@status[:message], **options) | ||
rescue StandardError => e | ||
NewRelic::Agent.logger.debug("Error raised while updating agent control health check message: #{e}." \ | ||
"Reverting to original message.\noptions = #{options}, @status[:message] = #{@status[:message]}") | ||
end | ||
end | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting to hear if there's a product-approved changelog entry, so this may need to be updated.