ChainOS Node Monitoring

Monitoring Overview

Effective monitoring is essential for maintaining reliable ChainOS nodes. This guide covers monitoring setup, important metrics, alerting, and visualization tools.

Why Monitor Your Node?

Proper monitoring provides several benefits:

Monitoring Stack

We recommend the following monitoring stack for ChainOS nodes:

Recommended Components

  • Prometheus: Time-series database for metrics collection
  • Node Exporter: System metrics exporter for Linux servers
  • ChainOS Exporter: Custom metrics exporter for ChainOS-specific metrics
  • Grafana: Visualization and dashboarding platform
  • Alertmanager: Alert handling and notification system

Installation and Setup

Follow these steps to set up a comprehensive monitoring system:

Prometheus Setup

# Install Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.40.0/prometheus-2.40.0.linux-amd64.tar.gz
tar xvfz prometheus-2.40.0.linux-amd64.tar.gz
cd prometheus-2.40.0.linux-amd64/

# Create a Prometheus configuration file
cat > prometheus.yml << EOF
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093

rule_files:
  - "rules/*.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'chainos'
    static_configs:
      - targets: ['localhost:26660']
EOF

# Create a systemd service file
sudo tee /etc/systemd/system/prometheus.service > /dev/null << EOF
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \\
    --config.file=/etc/prometheus/prometheus.yml \\
    --storage.tsdb.path=/var/lib/prometheus/ \\
    --web.console.templates=/etc/prometheus/consoles \\
    --web.console.libraries=/etc/prometheus/console_libraries

[Install]
WantedBy=multi-user.target
EOF

# Create prometheus user and directories
sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo cp -r prometheus promtool consoles console_libraries /usr/local/bin/
sudo cp -r prometheus.yml /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /usr/local/bin/{prometheus,promtool}

# Start and enable Prometheus
sudo systemctl daemon-reload
sudo systemctl start prometheus
sudo systemctl enable prometheus

Node Exporter Setup

# Install Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.4.0/node_exporter-1.4.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.4.0.linux-amd64.tar.gz
cd node_exporter-1.4.0.linux-amd64/

# Create a systemd service file
sudo tee /etc/systemd/system/node_exporter.service > /dev/null << EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

# Create node_exporter user
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo cp node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

# Start and enable Node Exporter
sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter

ChainOS Metrics Configuration

ChainOS exposes metrics via a Prometheus endpoint. Enable it in your config.toml file:

# In ~/.chainosd/config/config.toml
[instrumentation]
prometheus = true
prometheus_listen_addr = ":26660"
namespace = "chainos"

Grafana Setup

# Install Grafana
sudo apt-get install -y apt-transport-https software-properties-common
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
sudo apt-get update
sudo apt-get install -y grafana

# Start and enable Grafana
sudo systemctl start grafana-server
sudo systemctl enable grafana-server

Alertmanager Setup

# Install Alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
tar xvfz alertmanager-0.24.0.linux-amd64.tar.gz
cd alertmanager-0.24.0.linux-amd64/

# Create Alertmanager configuration
cat > alertmanager.yml << EOF
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'your-email@gmail.com'
  smtp_auth_password: 'your-password'
  smtp_require_tls: true

route:
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'email'

receivers:
- name: 'email'
  email_configs:
  - to: 'your-email@example.com'
    send_resolved: true

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']
EOF

# Create a systemd service file
sudo tee /etc/systemd/system/alertmanager.service > /dev/null << EOF
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \\
    --config.file=/etc/alertmanager/alertmanager.yml \\
    --storage.path=/var/lib/alertmanager

[Install]
WantedBy=multi-user.target
EOF

# Create alertmanager user and directories
sudo useradd --no-create-home --shell /bin/false alertmanager
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo cp alertmanager /usr/local/bin/
sudo cp alertmanager.yml /etc/alertmanager/
sudo chown -R alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager

# Start and enable Alertmanager
sudo systemctl daemon-reload
sudo systemctl start alertmanager
sudo systemctl enable alertmanager

Important Metrics to Monitor

Here are the key metrics you should monitor for your ChainOS node:

System Metrics

CPU Usage

Monitor CPU utilization to ensure your node has sufficient processing power:

rate(node_cpu_seconds_total{mode!="idle"}[1m])

Alert threshold: >80% sustained for 5 minutes

Memory Usage

Track memory usage to prevent swapping and OOM kills:

node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100

Alert threshold: <10% available memory for 5 minutes

Disk Usage

Monitor disk space to prevent running out of storage:

node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100

Alert threshold: <10% available space

Disk I/O

Track disk I/O to identify potential bottlenecks:

rate(node_disk_io_time_seconds_total[1m]) * 100

Alert threshold: >80% disk utilization for 15 minutes

Network Traffic

Monitor network traffic to ensure sufficient bandwidth:

rate(node_network_receive_bytes_total{device!="lo"}[1m])
rate(node_network_transmit_bytes_total{device!="lo"}[1m])

Alert threshold: >80% of available bandwidth

ChainOS-Specific Metrics

Node Sync Status

Track whether your node is in sync with the network:

chainos_consensus_height
chainos_p2p_peers

Alert threshold: Height not increasing for 5 minutes or peer count <3

Validator Performance (for validators)

Monitor validator signing performance:

chainos_consensus_validator_missed_blocks
chainos_consensus_validator_power

Alert threshold: Any missed blocks or power change

Transaction Throughput

Monitor transaction processing:

rate(chainos_mempool_size[5m])
rate(chainos_mempool_tx_size_bytes[5m])

Alert threshold: Mempool size consistently growing

Consensus Rounds

Track consensus performance:

chainos_consensus_rounds
chainos_consensus_num_txs

Alert threshold: Multiple rounds per block

Alert Rules

Create a Prometheus alert rules file to notify you of important issues:

# /etc/prometheus/rules/chainos_alerts.yml
groups:
- name: chainos
  rules:
  - alert: NodeExporterDown
    expr: up{job="node_exporter"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Node Exporter down on {{ $labels.instance }}"
      description: "Node Exporter has been down for more than 5 minutes."

  - alert: ChainOSNodeDown
    expr: up{job="chainos"} == 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "ChainOS node down on {{ $labels.instance }}"
      description: "ChainOS node has been down for more than 5 minutes."

  - alert: ChainOSNodeNotSyncing
    expr: increase(chainos_consensus_height[10m]) < 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "ChainOS node not syncing on {{ $labels.instance }}"
      description: "ChainOS node has not increased in height for 10 minutes."

  - alert: LowPeerCount
    expr: chainos_p2p_peers < 3
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Low peer count on {{ $labels.instance }}"
      description: "ChainOS node has less than 3 peers."

  - alert: HighCPULoad
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU load on {{ $labels.instance }}"
      description: "CPU load is above 80% for 5 minutes."

  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage on {{ $labels.instance }}"
      description: "Memory usage is above 90% for 5 minutes."

  - alert: DiskSpaceRunningOut
    expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100 < 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Disk space running out on {{ $labels.instance }}"
      description: "Disk space is below 10%."

  - alert: ValidatorMissedBlocks
    expr: increase(chainos_consensus_validator_missed_blocks[1h]) > 0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Validator missed blocks on {{ $labels.instance }}"
      description: "Validator has missed blocks in the last hour."

Grafana Dashboards

Set up Grafana dashboards to visualize your metrics:

Setting Up Dashboards

  1. Access Grafana at http://your-server-ip:3000 (default credentials: admin/admin)
  2. Add Prometheus as a data source:
    • Go to Configuration > Data Sources > Add data source
    • Select Prometheus
    • Set URL to http://localhost:9090
    • Click "Save & Test"
  3. Import dashboards:
    • Go to Create > Import
    • Enter dashboard ID or upload JSON file
    • Select your Prometheus data source
    • Click "Import"

Here are some recommended Grafana dashboards for ChainOS monitoring:

Dashboard JSON Files

You can find our custom Grafana dashboard JSON files in the monitoring directory of our GitHub repository.

Notification Channels

Configure Alertmanager to send notifications through various channels:

Email Notifications

Configure email notifications in Alertmanager:

global:
  smtp_smarthost: 'smtp.gmail.com:587'
  smtp_from: 'alertmanager@example.com'
  smtp_auth_username: 'your-email@gmail.com'
  smtp_auth_password: 'your-password'
  smtp_require_tls: true

receivers:
- name: 'email'
  email_configs:
  - to: 'your-email@example.com'
    send_resolved: true

Slack Notifications

Configure Slack notifications in Alertmanager:

receivers:
- name: 'slack'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
    channel: '#monitoring'
    send_resolved: true

Telegram Notifications

Configure Telegram notifications in Alertmanager:

receivers:
- name: 'telegram'
  telegram_configs:
  - bot_token: 'your-telegram-bot-token'
    chat_id: 123456789
    parse_mode: 'HTML'
    send_resolved: true

Monitoring Best Practices

Follow these best practices for effective node monitoring:

Tiered Alert Severity

Categorize alerts by severity to prioritize responses:

  • Critical: Immediate action required (node down, missed blocks)
  • Warning: Potential issues that need attention soon
  • Info: Informational alerts for awareness

Alert Fatigue Prevention

Avoid alert fatigue with these strategies:

  • Set appropriate thresholds to avoid false positives
  • Group related alerts to reduce notification volume
  • Implement alert silencing during maintenance windows
  • Use different notification channels for different severity levels

Monitoring Security

Secure your monitoring infrastructure:

  • Use TLS for all monitoring endpoints
  • Implement authentication for Grafana and Prometheus
  • Restrict access to monitoring ports with firewall rules
  • Regularly update monitoring tools

Need Help?

If you need assistance with node monitoring, join our Discord community where our team and other node operators can help.