Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
trader-payne committed Oct 21, 2024
1 parent fa8237c commit eb65115
Show file tree
Hide file tree
Showing 5 changed files with 1,442 additions and 2 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
/bin/
config.yaml
226 changes: 224 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,224 @@
# evm-loadbalancer
A simple load balancer for EVM chains that supports multiple networks and endpoints.
# StakeSquid EVM Loadbalancer

This is a load balancer for EVM nodes that supports multiple networks and endpoints.
It monitors and balances the load across local, monitoring, and fallback nodes based on chainhead, latency, and load metrics.
It integrates with Prometheus for monitoring.

## Features
- Load balancing across Ethereum nodes based on multiple factors (chainhead, latency, load)
- Supports multiple networks
- Prometheus integration for metrics collection
- Fallback mechanisms for node failures
- Rate-limited logging to reduce log noise
- Simple YAML-based configuration

# How it works?
## Overview

The StakeSquid EVM Loadbalancer is designed to efficiently manage requests to Ethereum nodes across multiple networks by balancing load, monitoring node health, and ensuring high availability through failover mechanisms. It continuously monitors nodes, collects metrics, and selects the best node for routing based on configurable factors such as latency, load, and chainhead synchronization.

The core logic behind the load balancer is to provide a reliable and performant infrastructure for querying Ethereum nodes, even in the presence of node failures or network issues.

## Core Components

### 1. Node Monitoring
The load balancer continuously monitors the nodes (local, monitoring, and fallback) to collect real-time data about their health. This includes:

- **Chainhead**: The current block number reported by the node.
- **Latency**: The time taken to respond to requests.
- **Load** (optional): The server load, typically collected using external tools like node exporters.

Each node is periodically polled based on the `local_poll_interval` and `monitoring_poll_interval` defined in the configuration. If a node becomes unresponsive or lags behind the network's chainhead, it will be deprioritized.

### 2. Load Balancing Logic
The core of the system's decision-making revolves around selecting the best node to route requests to based on multiple factors. These factors can be prioritized in the configuration, including:

- **Chainhead**: Nodes that are most in sync with the network are preferred. Nodes with higher block numbers are considered more up-to-date.

- **Latency**: Nodes with lower response times are prioritized for faster request handling.

- **Load**: If load tracking is enabled, nodes with lower resource usage are preferred to balance the load and avoid overloading a single node.

#### Selection Process:

1. **Chainhead Validation**:
- The load balancer checks if nodes are within an acceptable block difference (`network_block_diff`) from the network chainhead. Nodes that are too far behind are excluded.

2. **Initial Selection**:
- Nodes with the highest chainhead are selected first. If multiple nodes have the same chainhead, the load balancer will further prioritize based on the configured load balancing priority (latency or load).

3. **Refinement by Latency and Load**:
- If two or more nodes have the same chainhead, the balancer selects the one with the lowest latency or load, based on the priority defined in the configuration (`load_balance_priority`).

4. **Local Node Prioritization**:
- If a local node (i.e., a node in the `local_endpoints`) is up-to-date and responsive, it is generally prioritized over monitoring or fallback nodes.

### 3. Failover Mechanism

Failover is a key part of the system’s resilience. The load balancer categorizes nodes into three types:

- **Local Endpoints**: These are the primary nodes, typically located in the same network environment as the balancer.
- **Monitoring Endpoints** (optional): These are external nodes used for monitoring purposes, and as a backup in case local nodes become unavailable.
- **Fallback Endpoints** (optional): These are the final fallback nodes in case both local and monitoring nodes are down or not in sync.

The balancer will failover to monitoring or fallback nodes if:

- A local node is behind in blocks (beyond the `network_block_diff`), is slow, or unresponsive.
- Monitoring nodes are also considered only when local nodes fail, and fallback nodes are considered if neither local nor monitoring nodes are suitable.

The failover process happens automatically without interruption in service, ensuring high availability. Once a previously failed node recovers, it can be reintroduced into the pool of valid nodes.

### 4. Prometheus Metrics

Prometheus metrics are used to monitor the load balancer's performance and the health of the nodes. Metrics are exposed on a dedicated `/metrics` endpoint and can be scraped by a Prometheus server for detailed monitoring and alerting.

Key metrics include:

- **Latency (`loadbalancer_node_latency_seconds`)**: Measures the response time of each node.
- **Chainhead (`loadbalancer_node_chainhead`)**: Monitors the current block number for each node.
- **Best Endpoint (`loadbalancer_best_endpoint`)**: Indicates the currently selected node for a network.
- **Request Count and Duration (`loadbalancer_requests_total`, `loadbalancer_request_duration_seconds`)**: Track the number of requests and how long they take to process.

### 5. Proxying Requests

When a request is made to the load balancer, it forwards the request to the best node based on the above logic. The request is routed via an HTTP reverse proxy to the selected Ethereum node.

- If a node fails during the proxying, the load balancer will try another node, ensuring minimal downtime.
- The balancer maintains a cache of reverse proxies to avoid repeatedly creating new ones for the same endpoint, optimizing performance.

## Example Load Balancing Flow

1. A request comes in for the Ethereum Mainnet.
2. The load balancer checks the status of all configured local nodes.
3. It finds that one local node is behind in blocks, so it skips that node.
4. The balancer then checks the remaining local nodes and finds two nodes with the same chainhead, but one has lower latency.
5. The balancer selects the node with the lower latency and forwards the request to it via an HTTP reverse proxy.
6. Metrics are updated (request count, latency, etc.), and the response is sent back to the client.


# Installation

## Prerequisites
- Go 1.18 or higher
- Prometheus (optional): For monitoring the load balancer and nodes.
- Access to EVM RPC nodes

### Build from source

Clone the repository
```bash
git clone https://github.com/StakeSquid/evm-loadbalancer
```

Enter the directory
```bash
cd evm-loadbalancer
```

```bash
go build -o bin/loadbalancer main.go
```
This will generate the `loadbalancer` executable in the `bin` directory.

## Configuration
The load balancer is configured via a YAML file. A sample configuration file (config.yaml) is provided in the repository. The key configuration parameters include:
### Main Configurations
- `port`: The port on which the load balancer listens for incoming requests.
- `log_level`: Logging verbosity. Can be ERROR, INFO, or DEBUG.
- `log_rate_limit`: Time duration between log messages to prevent excessive logging.
- `metrics_port`: Port on which Prometheus metrics are served.
### Network Configuration
For each network you want to monitor and balance across nodes:

- `name`: A unique name for the network.
- `local_endpoints`: List of local Ethereum RPC endpoints.
- `monitoring_endpoints`(optional): List of external monitoring Ethereum RPC endpoints.
- `fallback_endpoints`(optional): List of fallback Ethereum RPC endpoints.
- `load_balance_priority` (optional): A list of load-balancing criteria. Options are latency, load, and chainhead.
- `load_period`(optional): Time window to measure server load.
- `local_poll_interval`: How often to poll local nodes (in Go duration format, e.g., 1s, 5m).
- `monitoring_poll_interval`(optional): How often to poll monitoring nodes.
- `network_block_diff`: Max block difference allowed between nodes and the network chainhead.
- `use_load_tracker`(optional): Whether to monitor server load (requires node exporters or similar setups).
- `rpc_timeout`: Timeout duration for RPC calls.
- `rpc_retries`: Number of retries for RPC requests.
#### Note on Optional Parameters
Parameters marked as **optional** can be omitted. The load balancer will function without them, using default behaviors.
For example, if `monitoring_endpoints` or `fallback_endpoints` are not provided, the load balancer will rely solely on the `local_endpoints`.
If `load_balance_priority` is not specified, the default priority is to use the chainhead for node selection.

## Example configuration
```yaml
port: "8080" # Port on which the load balancer will listen for incoming requests
log_level: "INFO" # Logging level: DEBUG, INFO, or ERROR
log_rate_limit: 10s # Rate limit for logging repeated messages
metrics_port: "9101" # Port for exposing Prometheus metrics

# Define multiple networks that the load balancer will handle
networks:
- name: "mainnet" # Network name, used in request paths
local_endpoints: # List of local node endpoints (primary nodes)
- "http://localhost:8545"
monitoring_endpoints: # (Optional) List of monitoring node endpoints
- "http://monitoring-node:8546"
fallback_endpoints: # (Optional) List of fallback node endpoints
- "http://fallback-node:8547"
load_balance_priority: # (Optional) Priority for load balancing: latency, load, chainhead
- "latency"
- "load"
load_period: 10 # (Optional) Load tracking period (in seconds)
local_poll_interval: "10s" # Interval for polling local nodes for status
monitoring_poll_interval: "30s" # (Optional) Interval for polling monitoring nodes
network_block_diff: 5 # Allowed block difference between network and node
use_load_tracker: true # (Optional) Enable or disable load tracking
rpc_timeout: "5s" # Timeout for RPC calls to nodes
rpc_retries: 3 # Number of retries for failed RPC calls

- name: "ropsten" # Another network configuration (e.g., a testnet)
local_endpoints:
- "http://localhost:8548"
# monitoring_endpoints and fallback_endpoints can be omitted
local_poll_interval: "15s"
network_block_diff: 10
rpc_timeout: "10s"
rpc_retries: 5

```

## Running the Load Balancer
After building the project and setting up the configuration file, you can run the load balancer as follows:

```bash
./loadbalancer --config config.yaml
```
Command-line Options
`--config`: Path to the configuration YAML file (default: `config.yaml`).

## Metrics
The load balancer exposes Prometheus metrics for monitoring. By default, these are served on `http://localhost:9101/metrics`. Metrics include:

- `loadbalancer_node_latency_seconds`: Latency to the nodes in seconds.
- `loadbalancer_node_chainhead`: Current block number at the node.
- `loadbalancer_requests_total`: Total number of requests handled.
- `loadbalancer_request_duration_seconds`: Histogram of request durations.
- `loadbalancer_best_endpoint`: Tracks the best endpoint chosen for each network.

## Monitoring
To monitor the load balancer and nodes, configure your Prometheus server to scrape the /metrics endpoint:

```yaml
scrape_configs:
- job_name: 'evm-loadbalancer'
static_configs:
- targets: ['localhost:9101']
```
## Logging
Logs can be outputted at different verbosity levels:
- `ERROR`: Only log errors.
- `INFO`: Log key information such as startup and major events.
- `DEBUG`: Verbose logging of all events, including chainhead updates.

## License
This project is licensed under the MIT License. See the LICENSE file for details.
32 changes: 32 additions & 0 deletions go.mod
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
module loadbalancer

go 1.23.0

require (
github.com/ethereum/go-ethereum v1.14.11
github.com/prometheus/client_golang v1.12.0
gopkg.in/yaml.v2 v2.4.0
)

require (
github.com/Microsoft/go-winio v0.6.2 // indirect
github.com/StackExchange/wmi v1.2.1 // indirect
github.com/beorn7/perks v1.0.1 // indirect
github.com/cespare/xxhash/v2 v2.3.0 // indirect
github.com/deckarep/golang-set/v2 v2.6.0 // indirect
github.com/go-ole/go-ole v1.3.0 // indirect
github.com/golang/protobuf v1.5.4 // indirect
github.com/gorilla/websocket v1.4.2 // indirect
github.com/holiman/uint256 v1.3.1 // indirect
github.com/matttproud/golang_protobuf_extensions v1.0.2-0.20181231171920-c182affec369 // indirect
github.com/prometheus/client_model v0.2.1-0.20210607210712-147c58e9608a // indirect
github.com/prometheus/common v0.32.1 // indirect
github.com/prometheus/procfs v0.7.3 // indirect
github.com/shirou/gopsutil v3.21.4-0.20210419000835-c7a38de76ee5+incompatible // indirect
github.com/tklauser/go-sysconf v0.3.12 // indirect
github.com/tklauser/numcpus v0.6.1 // indirect
golang.org/x/crypto v0.22.0 // indirect
golang.org/x/exp v0.0.0-20231110203233-9a3e6036ecaa // indirect
golang.org/x/sys v0.22.0 // indirect
google.golang.org/protobuf v1.34.2 // indirect
)
Loading

0 comments on commit eb65115

Please sign in to comment.