Skip to content

Commit

Permalink
Merge pull request #15489 from newrelic/daily-release/Dec-12-2023-1_48
Browse files Browse the repository at this point in the history
Daily release/dec 12 2023 1 48
  • Loading branch information
akristen authored Dec 12, 2023
2 parents 30a8b9c + 4d5ce4a commit c1279c1
Show file tree
Hide file tree
Showing 27 changed files with 203 additions and 171 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,6 @@ The following PHP versions are supported:
7.0 - 7.4
</td>
<td>
-
</td>
</tr>
<tr>
Expand Down Expand Up @@ -99,7 +98,7 @@ The following processors are supported:
When vendors announce end of life (such as on [Ubunto's End of Standard Support page](https://wiki.ubuntu.com/Releases)), we will continue to support those latest versions for one year. However, if the PHP version you're using is no longer officially supported, then support could end sooner than one year.

This is why we recommend always using the latest version of the OS that is officially supported by the vendor.
The latest versions of our agent may work on OS versions that are past End of Life, but we longer test or officially support the PHP agent with older versions.
The latest versions of our agent may work on OS versions that are past End of Life, but we no longer test or officially support the PHP agent with older versions.

The PHP agent supports the operating systems listed in the table below.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,133 +10,148 @@ freshnessValidatedDate: 2023-08-10

import infrastructureNVIDIAGPUDashboard from 'images/infrastructure_screenshot-full_NVIDIA-GPU-dashboard.webp'


Our NVIDIA GPU integration assists you in monitoring the status of GPUs. This integration leverages our infrastructure agent and the Flex integration, which is seamlessly integrated with NVIDIA's SMI utility. It provides you with a pre-built dashboard containing crucial GPU metrics, including GPU utilization, ECC error counts, active compute processes, clock and performance states, temperature, fan speed, as well as dynamic and static information about each supported device.
Our NVIDIA GPU integration lets you monitor the status of your GPUs. This integration uses our infrastructure agent with the Flex integration, which lets us access NVIDIA's SMI utility.

<img
title="NVIDIA GPUs dashboard"
alt="NVIDIA GPUs dashboard"
src={infrastructureNVIDIAGPUDashboard}
/>

<figcaption>
After you set up our NVIDIA GPU integration, we give you a dashboard for your GPU metrics.
</figcaption>

When you install, you'll get a pre-built dashboard containing crucial GPU metrics:

* GPU utilization
* ECC error counts
* Active compute processes
* Clock and performance states
* Temperature and fan speed
* Dynamic and static information about each supported device

## Install the infrastructure agent [#infra]
<Steps>
<Step>
## Install the infrastructure agent

To get data into New Relic, install our infrastructure agent. Our infrastructure agent collects and ingests data so you can keep track of your GPUs performance.
To capture data with New Relic, install our infrastructure agent. Our infrastructure agent collects and ingests data so you can keep track of your GPUs performance.

You can install the infrastructure agent two different ways:

* Our [guided install](https://one.newrelic.com/nr1-core?state=4f81feab-35f7-e97e-9903-52510f8542bd) is a CLI tool that inspects your system and installs the infrastructure agent alongside the application monitoring agent that best works for your system. To learn more about how our guided install works, check out our [Guided install overview](/docs/infrastructure/host-integrations/installation/new-relic-guided-install-overview).
* If you'd rather install our infrastructure agent manually, you can follow a tutorial for manual installation for [Linux](/docs/infrastructure/install-infrastructure-agent/linux-installation/install-infrastructure-monitoring-agent-linux), [Windows](/docs/infrastructure/install-infrastructure-agent/windows-installation/install-infrastructure-monitoring-agent-windows/).

## Configure Flex integration for NVIDIA GPUs
</Step>
<Step>
## Configure Flex integration for NVIDIA GPUs

Flex comes bundled with the New Relic infrastructure agent and it can be integrated with the [NVIDIA SMI](https://developer.nvidia.com/nvidia-management-library-nvml), a command line utility to monitor NVIDIA GPU devices.

<Callout variant="important">
NVIDIA-smi ships pre-installed with NVIDIA GPU display drivers on Linux and Windows Server.
nvidia-smi ships pre-installed with NVIDIA GPU display drivers on Linux and Windows Server.
</Callout>

To configure Flex follow these steps:
Follow these steps to configure Flex:

1. Create a file named `nvidia-smi-gpu-monitoring.yml` in this path:


```shell
/etc/newrelic-infra/integrations.d/nvidia-smi-gpu-monitoring.yml
```shell
sudo touch /etc/newrelic-infra/integrations.d/nvidia-smi-gpu-monitoring.yml
```

You may also download from the [git repository](https://github.com/newrelic/nri-flex/blob/master/examples/nvidia-smi-gpu-monitoring.yml).

2. Update the `nvidia-smi-gpu-monitoring.yml` file with the integration config:


```yml
---
integrations:
- name: nri-flex
# interval: 30s
config:
name: NvidiaSMI
variable_store:
metrics:
"name,driver_version,count,serial,pci.bus_id,pci.domain,pci.bus,\
pci.device_id,pci.sub_device_id,pcie.link.gen.current,pcie.link.gen.max,\
pcie.link.width.current,pcie.link.width.max,index,display_mode,display_active,\
persistence_mode,accounting.mode,accounting.buffer_size,driver_model.current,\
driver_model.pending,vbios_version,inforom.img,inforom.oem,inforom.ecc,inforom.pwr,\
gom.current,gom.pending,fan.speed,pstate,clocks_throttle_reasons.supported,\
clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,\
clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.hw_thermal_slowdown,\
clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.sw_thermal_slowdown,\
clocks_throttle_reasons.sync_boost,memory.total,memory.used,memory.free,compute_mode,\
utilization.gpu,utilization.memory,encoder.stats.sessionCount,encoder.stats.averageFps,\
encoder.stats.averageLatency,ecc.mode.current,ecc.mode.pending,ecc.errors.corrected.volatile.device_memory,\
ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.register_file,ecc.errors.corrected.volatile.l1_cache,\
ecc.errors.corrected.volatile.l2_cache,ecc.errors.corrected.volatile.texture_memory,ecc.errors.corrected.volatile.cbu,\
ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total,ecc.errors.corrected.aggregate.device_memory,\
ecc.errors.corrected.aggregate.dram,ecc.errors.corrected.aggregate.register_file,ecc.errors.corrected.aggregate.l1_cache,\
ecc.errors.corrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.corrected.aggregate.cbu,\
ecc.errors.corrected.aggregate.sram,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.volatile.device_memory,\
ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.register_file,ecc.errors.uncorrected.volatile.l1_cache,\
ecc.errors.uncorrected.volatile.l2_cache,ecc.errors.uncorrected.volatile.texture_memory,ecc.errors.uncorrected.volatile.cbu,\
ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total,ecc.errors.uncorrected.aggregate.device_memory,\
ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.register_file,ecc.errors.uncorrected.aggregate.l1_cache,\
ecc.errors.uncorrected.aggregate.l2_cache,ecc.errors.uncorrected.aggregate.texture_memory,ecc.errors.uncorrected.aggregate.cbu,\
ecc.errors.uncorrected.aggregate.sram,ecc.errors.uncorrected.aggregate.total,retired_pages.single_bit_ecc.count,\
retired_pages.double_bit.count,retired_pages.pending,temperature.gpu,temperature.memory,power.management,power.draw,\
power.limit,enforced.power.limit,power.default_limit,power.min_limit,power.max_limit,clocks.current.graphics,clocks.current.sm,\
clocks.current.memory,clocks.current.video,clocks.applications.graphics,clocks.applications.memory,\
clocks.default_applications.graphics,clocks.default_applications.memory,clocks.max.graphics,clocks.max.sm,clocks.max.memory,\
mig.mode.current,mig.mode.pending"
apis:
- name: NvidiaGpu
commands:
- run: nvidia-smi --query-gpu=${var:metrics} --format=csv # update this if you have an alternate path
output: csv
rename_keys:
" ": ""
"\\[MiB\\]": ".MiB"
"\\[%\\]": ".percent"
"\\[W\\]": ".watts"
"\\[MHz\\]": ".MHz"
value_parser:
"clocks|power|fan|memory|temp|util|ecc|stats|gom|mig|count|pcie": '\d*\.?\d+'
'.': '\[N\/A\]|N\/A|Not Active|Disabled|Enabled|Default'
```
</Step>
<Step>
## Confirm GPU metrics are being ingested
The Flex configuration will be automatically detected and executed by the infrastructure agent, there's no need to restart the agent. You can confirm metrics are being ingested by running this NRQL query:
```sql
SELECT * FROM NvidiaGpuSample
```

2. Update the file `nvidia-smi-gpu-monitoring.yml` with these configurations:
* `EVENT_TYPE`: You can consider `EVENT_TYPE` as a New Relic database table that you can query using NRQL.
* `COMMAND`: This contains the command which is used to print metrics on the terminal.

Once your configuration file is updated, it will look like below:

```yml
---
integrations:
- name: nri-flex
# interval: 30s
config:
name: NvidiaSMI
variable_store:
metrics:
"name,driver_version,count,serial,pci.bus_id,pci.domain,pci.bus,\
pci.device_id,pci.sub_device_id,pcie.link.gen.current,pcie.link.gen.max,\
pcie.link.width.current,pcie.link.width.max,index,display_mode,display_active,\
persistence_mode,accounting.mode,accounting.buffer_size,driver_model.current,\
driver_model.pending,vbios_version,inforom.img,inforom.oem,inforom.ecc,inforom.pwr,\
gom.current,gom.pending,fan.speed,pstate,clocks_throttle_reasons.supported,\
clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,\
clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.hw_thermal_slowdown,\
clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.sw_thermal_slowdown,\
clocks_throttle_reasons.sync_boost,memory.total,memory.used,memory.free,compute_mode,\
utilization.gpu,utilization.memory,encoder.stats.sessionCount,encoder.stats.averageFps,\
encoder.stats.averageLatency,ecc.mode.current,ecc.mode.pending,ecc.errors.corrected.volatile.device_memory,\
ecc.errors.corrected.volatile.dram,ecc.errors.corrected.volatile.register_file,ecc.errors.corrected.volatile.l1_cache,\
ecc.errors.corrected.volatile.l2_cache,ecc.errors.corrected.volatile.texture_memory,ecc.errors.corrected.volatile.cbu,\
ecc.errors.corrected.volatile.sram,ecc.errors.corrected.volatile.total,ecc.errors.corrected.aggregate.device_memory,\
ecc.errors.corrected.aggregate.dram,ecc.errors.corrected.aggregate.register_file,ecc.errors.corrected.aggregate.l1_cache,\
ecc.errors.corrected.aggregate.l2_cache,ecc.errors.corrected.aggregate.texture_memory,ecc.errors.corrected.aggregate.cbu,\
ecc.errors.corrected.aggregate.sram,ecc.errors.corrected.aggregate.total,ecc.errors.uncorrected.volatile.device_memory,\
ecc.errors.uncorrected.volatile.dram,ecc.errors.uncorrected.volatile.register_file,ecc.errors.uncorrected.volatile.l1_cache,\
ecc.errors.uncorrected.volatile.l2_cache,ecc.errors.uncorrected.volatile.texture_memory,ecc.errors.uncorrected.volatile.cbu,\
ecc.errors.uncorrected.volatile.sram,ecc.errors.uncorrected.volatile.total,ecc.errors.uncorrected.aggregate.device_memory,\
ecc.errors.uncorrected.aggregate.dram,ecc.errors.uncorrected.aggregate.register_file,ecc.errors.uncorrected.aggregate.l1_cache,\
ecc.errors.uncorrected.aggregate.l2_cache,ecc.errors.uncorrected.aggregate.texture_memory,ecc.errors.uncorrected.aggregate.cbu,\
ecc.errors.uncorrected.aggregate.sram,ecc.errors.uncorrected.aggregate.total,retired_pages.single_bit_ecc.count,\
retired_pages.double_bit.count,retired_pages.pending,temperature.gpu,temperature.memory,power.management,power.draw,\
power.limit,enforced.power.limit,power.default_limit,power.min_limit,power.max_limit,clocks.current.graphics,clocks.current.sm,\
clocks.current.memory,clocks.current.video,clocks.applications.graphics,clocks.applications.memory,\
clocks.default_applications.graphics,clocks.default_applications.memory,clocks.max.graphics,clocks.max.sm,clocks.max.memory,\
mig.mode.current,mig.mode.pending"
apis:
- name: NvidiaGpu
commands:
- run: nvidia-smi --query-gpu=${var:metrics} --format=csv # update this if you have an alternate path
output: csv
rename_keys:
" ": ""
"\\[MiB\\]": ".MiB"
"\\[%\\]": ".percent"
"\\[W\\]": ".watts"
"\\[MHz\\]": ".MHz"
value_parser:
"clocks|power|fan|memory|temp|util|ecc|stats|gom|mig|count|pcie": '\d*\.?\d+'
'.': '\[N\/A\]|N\/A|Not Active|Disabled|Enabled|Default'
```
## Confirm GPU metrics are being ingested
The Flex configuration will be automatically detected and executed by the infrastructure agent, there's no need to restart the agent.
You can confirm metrics are being ingested by running this NRQL query:
```
SELECT * FROM NvidiaSMI
```

## Monitor your application
</Step>
<Step>
## Monitor your application

You can use our pre-built dashboard template to monitor your GPU metrics. Follow these steps:

1. Go to **[one.newrelic.com](https://one.newrelic.com/)** and click on **Dashboards**.
2. Click on the **Import dashboard** tab.
3. Copy the file content (`.json`) from the [NVIDIA GPU dashboard](https://raw.githubusercontent.com/newrelic/nri-flex/master/examples/nvidia-smi-gpu-monitoring-dashboard.json).
4. Select the target account where the dashboard needs to be imported.
5. Click on **Import dashboard** to confirm the action.
1. Go to **[one.newrelic.com](https://one.newrelic.com/)** and click on **Dashboards**.
2. Click on the **Import dashboard** tab.
3. Copy the file content (`.json`) from the [NVIDIA GPU dashboard](https://raw.githubusercontent.com/newrelic/nri-flex/master/examples/nvidia-smi-gpu-monitoring-dashboard.json).
4. Select the target account where the dashboard needs to be imported.
5. Click on **Import dashboard** to confirm the action.

Your `NVIDIA GPU Monitoring` dashboard is considered a custom dashboard and can be found in the **Dashboards** UI. For docs on using and editing dashboards, see [our dashboard docs](/docs/query-your-data/explore-query-data/dashboards/introduction-dashboards).

Here is a NRQL query to view all the telemetry available:

```sql
SELECT * FROM NvidiaSMI
```
```sql
SELECT * FROM NvidiaGpuSample
```

</Step>
</Steps>

## What's next?
## What's next? [#next]

You can adapt the Flex configuration to include or exclude information available from the NVIDIA SMI utility.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -512,69 +512,13 @@ For an example of how all these variables can be used, see our [sample configura
</tbody>
</table>
</Collapser>
<Collapser
className="freq-link"
id="logging-retry-limit"
title="logging_retry_limit"
>
Enables [Agent Retry for Log Transmission](/docs/infrastructure/install-infrastructure-agent/manage-your-agent/infrastructure-agent-behavior/#retry).

<table>
<thead>
<tr>
<th style={{ width: "200px" }}>
YML option name
</th>

<th style={{ width: "200px" }}>
Environment variable
</th>

<th>
Type
</th>

<th>
Default
</th>

<th>
Version
</th>
</tr>
</thead>

<tbody>
<tr>
<td>
`logging_retry_limit`
</td>

<td>
`NRIA_LOGGING_RETRY_LIMIT`
</td>

<td>
integer
</td>

<td>
`5`
</td>

<td>
[1.29.1](/docs/release-notes/infrastructure-release-notes/infrastructure-agent-release-notes/new-relic-infrastructure-agent-1291)
</td>
</tr>
</tbody>
</table>
</Collapser>
<Collapser
className="freq-link"
id="logging_retry-limit"
title="logging_retry_limit"
>
Enables the retry limit for the embedded fluentbit logging forwarder. Integer values are for the number of intended retries. Other possible values include `False` to set the number of retries to infinite and `no_retries` to turn off the retry functionality entirely.
Enables [Agent Retry for Log Transmission](/docs/infrastructure/install-infrastructure-agent/manage-your-agent/infrastructure-agent-behavior/#retry) via the embedded fluentbit logging forwarder. Integer values are for the number of intended retries. Other possible values include `False` to set the number of retries to infinite and `no_retries` to turn off the retry functionality entirely.

<table>
<thead>
Expand Down Expand Up @@ -616,7 +560,7 @@ For an example of how all these variables can be used, see our [sample configura
</td>

<td>
5
`5`
</td>

<td>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -279,7 +279,7 @@ txnLogger := log.WithContext(newrelic.NewContext(context.Background(), txn))
go get github.com/newrelic/go-agent/v3/integrations/logcontext-v2/nrzap
```

2. Import the nrlogrus package in the file where you initialize your Zap logger.
2. Import the nrzap package in the file where you initialize your Zap logger.

```go
import (
Expand Down
Loading

0 comments on commit c1279c1

Please sign in to comment.