Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AC/FC integration #2995

Open
wants to merge 9 commits into
base: dev
Choose a base branch
from
Open
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,10 @@

The agent now supports Ruby 3.4.0. We've made incremental changes throughout the preview stage to reach compatibility. This release includes an update to the Thread Profiler for compatibility with Ruby 3.4.0's new backtrace format. [Issue#2992](https://github.com/newrelic/newrelic-ruby-agent/issues/2992) [PR#2997](https://github.com/newrelic/newrelic-ruby-agent/pull/2997)

- **Feature: Add health checks when the agent runs within Agent Control**

When the agent is started with a within an agent control environment, automatic health check files will be created within the configured file destination at the configured frequency. [PR#2995](https://github.com/newrelic/newrelic-ruby-agent/pull/2995)
Comment on lines +9 to +11
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting to hear if there's a product-approved changelog entry, so this may need to be updated.


- **Bugfix: Stop emitting inaccurate debug-level log about deprecated configuration options**

In the previous major release, we dropped support for `disable_<library_name>` configuration options in favor of `instrumentation.<library_name>`. Previously, a DEBUG level log warning appeared whenever `disable_*` options were set to `true`, even for libraries (e.g. Action Dispatch) without equivalent `instrumentation.*` options:
Expand Down
4 changes: 4 additions & 0 deletions lib/new_relic/agent/agent.rb
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
require 'new_relic/coerce'
require 'new_relic/agent/autostart'
require 'new_relic/agent/harvester'
require 'new_relic/agent/health_check'
require 'new_relic/agent/hostname'
require 'new_relic/agent/new_relic_service'
require 'new_relic/agent/pipe_service'
Expand Down Expand Up @@ -88,6 +89,7 @@ def init_basics
end

def init_components
@health_check = HealthCheck.new
@service = NewRelicService.new
@events = EventListener.new
@stats_engine = StatsEngine.new
Expand Down Expand Up @@ -139,6 +141,8 @@ def instance
# Holds all the methods defined on NewRelic::Agent::Agent
# instances
module InstanceMethods
# the agent control health check file generator
attr_reader :health_check
# the statistics engine that holds all the timeslice data
attr_reader :stats_engine
# the transaction sampler that handles recording transactions
Expand Down
1 change: 1 addition & 0 deletions lib/new_relic/agent/agent_helpers/connect.rb
Original file line number Diff line number Diff line change
Expand Up @@ -198,6 +198,7 @@ def connect(options = {})
rescue NewRelic::Agent::UnrecoverableAgentException => e
handle_unrecoverable_agent_error(e)
rescue StandardError, Timeout::Error, NewRelic::Agent::ServerConnectionException => e
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::FAILED_TO_CONNECT)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've struggled to write tests to verify the status is updated when we expect it to be updated. If we think tests are valuable for these updates, I could use some help!

retry if retry_from_error?(e, opts)
rescue Exception => e
::NewRelic::Agent.logger.error('Exception of unexpected type during Agent#connect():', e)
Expand Down
3 changes: 3 additions & 0 deletions lib/new_relic/agent/agent_helpers/harvest.rb
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ def send_data_to_endpoint(endpoint, payload, container)
rescue UnrecoverableServerException => e
NewRelic::Agent.logger.warn("#{endpoint} data was rejected by remote service, discarding. Error: ", e)
rescue ServerConnectionException => e
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::FAILED_TO_CONNECT)
log_remote_unavailable(endpoint, e)
container.merge!(payload)
rescue => e
Expand All @@ -133,9 +134,11 @@ def check_for_and_handle_agent_commands
rescue ForceRestartException, ForceDisconnectException
raise
rescue UnrecoverableServerException => e
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::FAILED_TO_CONNECT)
NewRelic::Agent.logger.warn('get_agent_commands message was rejected by remote service, discarding. ' \
'Error: ', e)
rescue ServerConnectionException => e
NewRelic::Agent.health_check.update_status(NewRelic::Agent::HealthCheck::FAILED_TO_CONNECT)
log_remote_unavailable(:get_agent_commands, e)
rescue => e
NewRelic::Agent.logger.info('Error during check_for_and_handle_agent_commands, will retry later: ', e)
Expand Down
1 change: 1 addition & 0 deletions lib/new_relic/agent/agent_helpers/shutdown.rb
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ def shutdown
revert_to_default_configuration

@started = nil
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::SHUTDOWN)
Control.reset
end

Expand Down
1 change: 1 addition & 0 deletions lib/new_relic/agent/agent_helpers/start_worker_thread.rb
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@ def handle_force_restart(error)
# is the worker thread that gathers data and talks to the
# server.
def handle_force_disconnect(error)
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::FORCED_DISCONNECT)
::NewRelic::Agent.logger.warn('Agent received a ForceDisconnectException from the server, disconnecting. ' \
"(#{error.message})")
disconnect
Expand Down
5 changes: 5 additions & 0 deletions lib/new_relic/agent/agent_helpers/startup.rb
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ def check_config_and_start_agent
# Treatment of @started and env report is important to get right.
def setup_and_start_agent(options = {})
@started = true
@health_check.create_and_run_health_check_loop
@harvester.mark_started

unless in_resque_child_process?
Expand Down Expand Up @@ -129,6 +130,7 @@ def monitoring?
if Agent.config[:monitor_mode]
true
else
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::AGENT_DISABLED)
::NewRelic::Agent.logger.warn('Agent configured not to send data in this environment.')
false
end
Expand All @@ -140,6 +142,7 @@ def has_license_key?
if Agent.config[:license_key] && Agent.config[:license_key].length > 0
true
else
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::MISSING_LICENSE_KEY)
::NewRelic::Agent.logger.warn('No license key found. ' +
'This often means your newrelic.yml file was not found, or it lacks a section for the running ' \
"environment, '#{NewRelic::Control.instance.env}'. You may also want to try linting your newrelic.yml " \
Expand All @@ -160,6 +163,7 @@ def correct_license_length
if key.length == 40
true
else
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::INVALID_LICENSE_KEY)
::NewRelic::Agent.logger.error("Invalid license key: #{key}")
false
end
Expand All @@ -180,6 +184,7 @@ def agent_should_start?
end

unless app_name_configured?
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::MISSING_APP_NAME)
NewRelic::Agent.logger.error('No application name configured.',
'The agent cannot start without at least one. Please check your ',
'newrelic.yml and ensure that it is valid and has at least one ',
Expand Down
24 changes: 24 additions & 0 deletions lib/new_relic/agent/configuration/default_source.rb
Original file line number Diff line number Diff line change
Expand Up @@ -2188,6 +2188,30 @@ def self.notify
:transform => DefaultSource.method(:convert_to_constant_list),
:description => 'Specify a list of exceptions you do not want the agent to strip when [strip_exception_messages](#strip_exception_messages-enabled) is `true`. Separate exceptions with a comma. For example, `"ImportantException,PreserveMessageException"`.'
},
# Agent Control
:'agent_control.fleet_id' => {
:default => nil,
:allow_nil => true,
:public => true,
:type => String,
:allowed_from_server => false,
:description => 'This assigns a fleet id to the language agent. This id is generated by agent control. If this setting is present, it indicates the agent is running in a super agent/fleet environment and health file(s) will be generated.'
kaylareopelle marked this conversation as resolved.
Show resolved Hide resolved
},
:'agent_control.health.delivery_location' => {
:default => nil,
:allow_nil => true,
:public => true,
:type => String,
:allowed_from_server => false,
:description => 'A `file:` URI that specifies the fully qualified directory path for health file(s) to be written to. For example: `file:///var/lib/newrelic-super-agent/fleet/agents.d/<fleet_id>`. This configuration will be set by agent control, or one of its components, prior to agent startup.'
},
:'agent_control.health.frequency' => {
:default => 5,
:public => true,
:type => Integer,
:allowed_from_server => false,
:description => 'The interval, in seconds, of how often the health file(s) will be written to. This configuration will be set by agent control, or one of its components, prior to agent startup.'
},
# Thread profiler
:'thread_profiler.enabled' => {
:default => DefaultSource.thread_profiler_enabled,
Expand Down
2 changes: 2 additions & 0 deletions lib/new_relic/agent/configuration/yaml_source.rb
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ def initialize(path, env)
erb_file = process_erb(raw_file)
config = process_yaml(erb_file, env, config, @file_path)
rescue ScriptError, StandardError => e
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::FAILED_TO_PARSE_CONFIG)
log_failure("Failed to read or parse configuration file at #{path}", e)
end

Expand Down Expand Up @@ -99,6 +100,7 @@ def process_erb(file)
file.gsub!(/^\s*#.*$/, '#')
ERB.new(file).result(binding)
rescue ScriptError, StandardError => e
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::FAILED_TO_PARSE_CONFIG)
message = 'Failed ERB processing configuration file. This is typically caused by a Ruby error in <% %> templating blocks in your newrelic.yml file.'
failure_array = [message, e]
failure_array << e.backtrace[0] if Gem::Version.new(RUBY_VERSION) >= Gem::Version.new('3.4.0')
Expand Down
125 changes: 125 additions & 0 deletions lib/new_relic/agent/health_check.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# This file is distributed under New Relic's license terms.
# See https://github.com/newrelic/newrelic-ruby-agent/blob/main/LICENSE for complete details.
# frozen_string_literal: true

module NewRelic
module Agent
class HealthCheck
def initialize
@start_time = nano_time
@fleet_id = ENV['NEW_RELIC_AGENT_CONTROL_FLEET_ID']
# The spec states file paths for the delivery location will begin with file://
# This does not create a valid path in Ruby, so remove the prefix when present
@delivery_location = ENV['NEW_RELIC_AGENT_CONTROL_HEALTH_DELIVERY_LOCATION']&.gsub('file://', '')
@frequency = ENV['NEW_RELIC_AGENT_CONTROL_HEALTH_FREQUENCY'].to_i
kaylareopelle marked this conversation as resolved.
Show resolved Hide resolved
@continue = true
@status = HEALTHY
end

HEALTHY = {healthy: true, last_error: 'NR-APM-000', message: 'Healthy'}
INVALID_LICENSE_KEY = {healthy: false, last_error: 'NR-APM-001', message: 'Invalid liense key (HTTP status code 401)'}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are any of these errors recoverable for the Ruby Agent? i.e. where we could potentially need to set the status back to HEALTHY?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question! Some of these errors are recoverable. From what I'm seeing in the code, a successful HTTP request made by the NewRelicService class to an external endpoint would erase the error state. As of now, this is the only place where I'll update the status to HEALTHY:
https://github.com/newrelic/newrelic-ruby-agent/pull/2995/files#diff-ecdbe8dde33106ec28a206dcdbfca4ea2c647caff89504a82c58dbceaf749e9aR644

However, I may have missed some scenarios and there could be other places where explicitly setting a HEALTHY status would demonstrate recovery.

MISSING_LICENSE_KEY = {healthy: false, last_error: 'NR-APM-002', message: 'License key missing in configuration'}
FORCED_DISCONNECT = {healthy: false, last_error: 'NR-APM-003', message: 'Forced disconnect received from New Relic (HTTP status code 410)'}
HTTP_ERROR = {healthy: false, last_error: 'NR-APM-004', message: 'HTTP error response code [%s] recevied from New Relic while sending data type [%s]'}
MISSING_APP_NAME = {healthy: false, last_error: 'NR-APM-005', message: 'Missing application name in agent configuration'}
APP_NAME_EXCEEDED = {healthy: false, last_error: 'NR-APM-006', message: 'The maximum number of configured app names (3) exceeded'}
PROXY_CONFIG_ERROR = {healthy: false, last_error: 'NR-APM-007', message: 'HTTP Proxy configuration error; response code [%s]'}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec defines these two error codes, but our agent doesn't have any behavior to recognize when these problems occur. As things stand currently, we don't need to update our agent to record these states. I left them here to make sure we match the spec, but would also be open to removing them, since the status will never be updated to use these constants.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes sense to leave them, to be complete. Plus, if we ever update the agent to handle those things, then these will already be here ready to use. And if we never do make those updates, at least in the future when we're comparing the code to the spec to look into something we won't be confused by why some would be missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Do you think we would benefit from a comment stating they're not called anywhere?

AGENT_DISABLED = {healthy: false, last_error: 'NR-APM-008', message: 'Agent is disabled via configuration'}
FAILED_TO_CONNECT = {healthy: false, last_error: 'NR-APM-009', message: 'Failed to connect to New Relic data collector'}
FAILED_TO_PARSE_CONFIG = {healthy: false, last_error: 'NR-APM-010', message: 'Agent config file is not able to be parsed'}
SHUTDOWN = {healthy: true, last_error: 'NR-APM-099', message: 'Agent has shutdown'}

def create_and_run_health_check_loop
unless health_check_enabled?
@continue = false
end

return NewRelic::Agent.logger.debug('NEW_RELIC_AGENT_CONTROL_FLEET_ID not found, skipping health checks') unless @fleet_id
return NewRelic::Agent.logger.debug('NEW_RELIC_AGENT_CONTROL_HEALTH_DELIVERY_LOCATION not found, skipping health checks') unless @delivery_location
return NewRelic::Agent.logger.debug('NEW_RELIC_AGENT_CONTROL_HEALTH_FREQUENCY zero or less, skipping health checks') unless @frequency > 0

NewRelic::Agent.logger.debug('Agent control health check conditions met. Starting health checks.')
NewRelic::Agent.record_metric('Supportability/AgentControl/Health/enabled', 1)

Thread.new do
while @continue
begin
sleep @frequency
write_file
@continue = false if @status == SHUTDOWN
rescue StandardError => e
NewRelic::Agent.logger.error("Aborting agent control health check. Error raised: #{e}")
@continue = false
end
end
end
end

def update_status(status, options = [])
return unless @continue

@status = status
update_message(options) unless options.empty?
end

private

def contents
<<~CONTENTS
healthy: #{@status[:healthy]}
status: #{@status[:message]}#{last_error}
start_time_unix_nano: #{@start_time}
status_time_unix_nano: #{nano_time}
CONTENTS
end

def last_error
@status[:healthy] ? '' : "\nlast_error: #{@status[:last_error]}"
end

def nano_time
Process.clock_gettime(Process::CLOCK_REALTIME, :nanosecond)
end

def file_name
"health-#{NewRelic::Agent::GuidGenerator.generate_guid(32)}.yml"
end

def write_file
@path ||= create_file_path

File.write("#{@path}/#{file_name}", contents)
rescue StandardError => e
NewRelic::Agent.logger.error("Agent control health check raised an error while writing a file: #{e}")
@continue = false
end

def create_file_path
for abs_path in [File.expand_path(@delivery_location),
File.expand_path(File.join('', @delivery_location))] do
if File.directory?(abs_path) || (Dir.mkdir(abs_path) rescue nil)
return abs_path[%r{^(.*?)/?$}]
end
end
nil
rescue StandardError => e
NewRelic::Agent.logger.error(
'Agent control health check raised an error while finding or creating the file path defined in ' \
"NEW_RELIC_AGENT_CONTROL_HEALTH_DELIVERY_LOCATION: #{e}"
)
@continue = false
end

def health_check_enabled?
@fleet_id && @delivery_location && (@frequency > 0)
end

def update_message(options)
@status[:message] = sprintf(@status[:message], **options)
rescue StandardError => e
NewRelic::Agent.logger.debug("Error raised while updating agent control health check message: #{e}." \
"Reverting to original message.\noptions = #{options}, @status[:message] = #{@status[:message]}")
end
end
end
end
10 changes: 8 additions & 2 deletions lib/new_relic/agent/new_relic_service.rb
Original file line number Diff line number Diff line change
Expand Up @@ -455,6 +455,8 @@ def attempt_request(request, opts)
end

def handle_error_response(response, endpoint)
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::HTTP_ERROR, [response.code, endpoint])

case response
when Net::HTTPRequestTimeOut,
Net::HTTPTooManyRequests,
Expand Down Expand Up @@ -637,9 +639,13 @@ def check_post_size(post_string, endpoint)
def send_request(opts)
request = prep_request(opts)
response = relay_request(request, opts)
return response if response.is_a?(Net::HTTPSuccess) || response.is_a?(Net::HTTPAccepted)

handle_error_response(response, opts[:endpoint])
if response.is_a?(Net::HTTPSuccess) || response.is_a?(Net::HTTPAccepted)
NewRelic::Agent.agent.health_check.update_status(NewRelic::Agent::HealthCheck::HEALTHY)
response
else
handle_error_response(response, opts[:endpoint])
end
end

def log_response(response)
Expand Down
7 changes: 7 additions & 0 deletions test/agent_helper.rb
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,13 @@ def assert_log_contains(log, message)
"Could not find message: '#{message.inspect}'. Log contained: #{lines.join("\n")}"
end

def refute_log_contains(log, message)
lines = log.array

refute (lines.any? { |line| line.match(message) }),
"Found message: '#{message.inspect}'. Log contained: #{lines.join("\n")}"
end

def assert_audit_log_contains(audit_log_contents, needle)
# Original request bodies dumped to the log have symbol keys, but once
# they go through a dump/load, they're strings again, so we strip
Expand Down
3 changes: 3 additions & 0 deletions test/new_relic/agent/agent/start_test.rb
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ class NewRelic::Agent::Agent::StartTest < Minitest::Test

def setup
@harvester = stub('dummy harvester')
@health_check = stub('dummy health check')
@harvest_samplers = stub('dummy sampler collection')
end

Expand Down Expand Up @@ -62,6 +63,7 @@ def test_check_config_and_start_agent_forking
def test_check_config_and_start_agent_normal
@harvester.expects(:mark_started)
@harvest_samplers.expects(:load_samplers)
@health_check.expects(:create_and_run_health_check_loop)
self.expects(:start_worker_thread)
self.expects(:install_exit_handler)
self.expects(:environment_for_connect)
Expand All @@ -74,6 +76,7 @@ def test_check_config_and_start_agent_normal
def test_check_config_and_start_agent_sync
@harvester.expects(:mark_started)
@harvest_samplers.expects(:load_samplers)
@health_check.expects(:create_and_run_health_check_loop)
self.expects(:connect_in_foreground)
self.expects(:start_worker_thread)
self.expects(:install_exit_handler)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,15 @@ class OrphanedConfigTest < Minitest::Test
include NewRelic::TestHelpers::ConfigScanning

# :automatic_custom_instrumentation_method_list - the tranform proc handles all processing, no other reference exists
IGNORED_KEYS = %i[automatic_custom_instrumentation_method_list]
# :'agent_control.fleet_id' - the config is set by environment variable in agent control, the symbol config is not used
# :'agent_control.health.delivery_location - the config is set by environment variable in agent control, the symbol config is not used
# :'agent_control.health.frequency' - the config is set by environment variable in agent control, the symbol config is not used
IGNORED_KEYS = %i[
automatic_custom_instrumentation_method_list
agent_control.fleet_id
agent_control.health.delivery_location
agent_control.health.frequency
]

def setup
@default_keys = ::NewRelic::Agent::Configuration::DEFAULTS.keys
Expand Down
Loading
Loading