ChainOS Node Monitoring
Monitoring Overview
Effective monitoring is essential for maintaining reliable ChainOS nodes. This guide covers monitoring setup, important metrics, alerting, and visualization tools.
Why Monitor Your Node?
Proper monitoring provides several benefits:
- Proactive Issue Detection: Identify and address problems before they cause downtime
- Performance Optimization: Analyze resource usage patterns to optimize your configuration
- Security: Detect unusual activity that might indicate security issues
- Compliance: For validators, ensure you're meeting uptime requirements
- Capacity Planning: Track growth trends to plan for hardware upgrades
Monitoring Stack
We recommend the following monitoring stack for ChainOS nodes:
Recommended Components
- Prometheus: Time-series database for metrics collection
- Node Exporter: System metrics exporter for Linux servers
- ChainOS Exporter: Custom metrics exporter for ChainOS-specific metrics
- Grafana: Visualization and dashboarding platform
- Alertmanager: Alert handling and notification system
Installation and Setup
Follow these steps to set up a comprehensive monitoring system:
Prometheus Setup
# Install Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.40.0/prometheus-2.40.0.linux-amd64.tar.gz
tar xvfz prometheus-2.40.0.linux-amd64.tar.gz
cd prometheus-2.40.0.linux-amd64/
# Create a Prometheus configuration file
cat > prometheus.yml << EOF
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
rule_files:
- "rules/*.yml"
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']
- job_name: 'chainos'
static_configs:
- targets: ['localhost:26660']
EOF
# Create a systemd service file
sudo tee /etc/systemd/system/prometheus.service > /dev/null << EOF
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \\
--config.file=/etc/prometheus/prometheus.yml \\
--storage.tsdb.path=/var/lib/prometheus/ \\
--web.console.templates=/etc/prometheus/consoles \\
--web.console.libraries=/etc/prometheus/console_libraries
[Install]
WantedBy=multi-user.target
EOF
# Create prometheus user and directories
sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo cp -r prometheus promtool consoles console_libraries /usr/local/bin/
sudo cp -r prometheus.yml /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /usr/local/bin/{prometheus,promtool}
# Start and enable Prometheus
sudo systemctl daemon-reload
sudo systemctl start prometheus
sudo systemctl enable prometheus
Node Exporter Setup
# Install Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.4.0/node_exporter-1.4.0.linux-amd64.tar.gz
tar xvfz node_exporter-1.4.0.linux-amd64.tar.gz
cd node_exporter-1.4.0.linux-amd64/
# Create a systemd service file
sudo tee /etc/systemd/system/node_exporter.service > /dev/null << EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter
[Install]
WantedBy=multi-user.target
EOF
# Create node_exporter user
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo cp node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
# Start and enable Node Exporter
sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter
ChainOS Metrics Configuration
ChainOS exposes metrics via a Prometheus endpoint. Enable it in your config.toml
file:
# In ~/.chainosd/config/config.toml
[instrumentation]
prometheus = true
prometheus_listen_addr = ":26660"
namespace = "chainos"
Grafana Setup
# Install Grafana
sudo apt-get install -y apt-transport-https software-properties-common
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
sudo apt-get update
sudo apt-get install -y grafana
# Start and enable Grafana
sudo systemctl start grafana-server
sudo systemctl enable grafana-server
Alertmanager Setup
# Install Alertmanager
wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz
tar xvfz alertmanager-0.24.0.linux-amd64.tar.gz
cd alertmanager-0.24.0.linux-amd64/
# Create Alertmanager configuration
cat > alertmanager.yml << EOF
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'your-email@gmail.com'
smtp_auth_password: 'your-password'
smtp_require_tls: true
route:
group_by: ['alertname', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'email'
receivers:
- name: 'email'
email_configs:
- to: 'your-email@example.com'
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
EOF
# Create a systemd service file
sudo tee /etc/systemd/system/alertmanager.service > /dev/null << EOF
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \\
--config.file=/etc/alertmanager/alertmanager.yml \\
--storage.path=/var/lib/alertmanager
[Install]
WantedBy=multi-user.target
EOF
# Create alertmanager user and directories
sudo useradd --no-create-home --shell /bin/false alertmanager
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo cp alertmanager /usr/local/bin/
sudo cp alertmanager.yml /etc/alertmanager/
sudo chown -R alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager
sudo chown alertmanager:alertmanager /usr/local/bin/alertmanager
# Start and enable Alertmanager
sudo systemctl daemon-reload
sudo systemctl start alertmanager
sudo systemctl enable alertmanager
Important Metrics to Monitor
Here are the key metrics you should monitor for your ChainOS node:
System Metrics
CPU Usage
Monitor CPU utilization to ensure your node has sufficient processing power:
rate(node_cpu_seconds_total{mode!="idle"}[1m])
Alert threshold: >80% sustained for 5 minutes
Memory Usage
Track memory usage to prevent swapping and OOM kills:
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100
Alert threshold: <10% available memory for 5 minutes
Disk Usage
Monitor disk space to prevent running out of storage:
node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100
Alert threshold: <10% available space
Disk I/O
Track disk I/O to identify potential bottlenecks:
rate(node_disk_io_time_seconds_total[1m]) * 100
Alert threshold: >80% disk utilization for 15 minutes
Network Traffic
Monitor network traffic to ensure sufficient bandwidth:
rate(node_network_receive_bytes_total{device!="lo"}[1m])
rate(node_network_transmit_bytes_total{device!="lo"}[1m])
Alert threshold: >80% of available bandwidth
ChainOS-Specific Metrics
Node Sync Status
Track whether your node is in sync with the network:
chainos_consensus_height
chainos_p2p_peers
Alert threshold: Height not increasing for 5 minutes or peer count <3
Validator Performance (for validators)
Monitor validator signing performance:
chainos_consensus_validator_missed_blocks
chainos_consensus_validator_power
Alert threshold: Any missed blocks or power change
Transaction Throughput
Monitor transaction processing:
rate(chainos_mempool_size[5m])
rate(chainos_mempool_tx_size_bytes[5m])
Alert threshold: Mempool size consistently growing
Consensus Rounds
Track consensus performance:
chainos_consensus_rounds
chainos_consensus_num_txs
Alert threshold: Multiple rounds per block
Alert Rules
Create a Prometheus alert rules file to notify you of important issues:
# /etc/prometheus/rules/chainos_alerts.yml
groups:
- name: chainos
rules:
- alert: NodeExporterDown
expr: up{job="node_exporter"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Node Exporter down on {{ $labels.instance }}"
description: "Node Exporter has been down for more than 5 minutes."
- alert: ChainOSNodeDown
expr: up{job="chainos"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "ChainOS node down on {{ $labels.instance }}"
description: "ChainOS node has been down for more than 5 minutes."
- alert: ChainOSNodeNotSyncing
expr: increase(chainos_consensus_height[10m]) < 1
for: 5m
labels:
severity: critical
annotations:
summary: "ChainOS node not syncing on {{ $labels.instance }}"
description: "ChainOS node has not increased in height for 10 minutes."
- alert: LowPeerCount
expr: chainos_p2p_peers < 3
for: 5m
labels:
severity: warning
annotations:
summary: "Low peer count on {{ $labels.instance }}"
description: "ChainOS node has less than 3 peers."
- alert: HighCPULoad
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU load on {{ $labels.instance }}"
description: "CPU load is above 80% for 5 minutes."
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90% for 5 minutes."
- alert: DiskSpaceRunningOut
expr: node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"} * 100 < 10
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space running out on {{ $labels.instance }}"
description: "Disk space is below 10%."
- alert: ValidatorMissedBlocks
expr: increase(chainos_consensus_validator_missed_blocks[1h]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Validator missed blocks on {{ $labels.instance }}"
description: "Validator has missed blocks in the last hour."
Grafana Dashboards
Set up Grafana dashboards to visualize your metrics:
Setting Up Dashboards
- Access Grafana at
http://your-server-ip:3000
(default credentials: admin/admin) - Add Prometheus as a data source:
- Go to Configuration > Data Sources > Add data source
- Select Prometheus
- Set URL to
http://localhost:9090
- Click "Save & Test"
- Import dashboards:
- Go to Create > Import
- Enter dashboard ID or upload JSON file
- Select your Prometheus data source
- Click "Import"
Recommended Dashboards
Here are some recommended Grafana dashboards for ChainOS monitoring:
- Node Exporter Dashboard: ID 1860 - System metrics
- ChainOS Node Dashboard: Available in our GitHub repository
- Validator Dashboard: Available in our GitHub repository
Dashboard JSON Files
You can find our custom Grafana dashboard JSON files in the monitoring directory of our GitHub repository.
Notification Channels
Configure Alertmanager to send notifications through various channels:
Email Notifications
Configure email notifications in Alertmanager:
global:
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'your-email@gmail.com'
smtp_auth_password: 'your-password'
smtp_require_tls: true
receivers:
- name: 'email'
email_configs:
- to: 'your-email@example.com'
send_resolved: true
Slack Notifications
Configure Slack notifications in Alertmanager:
receivers:
- name: 'slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
channel: '#monitoring'
send_resolved: true
Telegram Notifications
Configure Telegram notifications in Alertmanager:
receivers:
- name: 'telegram'
telegram_configs:
- bot_token: 'your-telegram-bot-token'
chat_id: 123456789
parse_mode: 'HTML'
send_resolved: true
Monitoring Best Practices
Follow these best practices for effective node monitoring:
Tiered Alert Severity
Categorize alerts by severity to prioritize responses:
- Critical: Immediate action required (node down, missed blocks)
- Warning: Potential issues that need attention soon
- Info: Informational alerts for awareness
Alert Fatigue Prevention
Avoid alert fatigue with these strategies:
- Set appropriate thresholds to avoid false positives
- Group related alerts to reduce notification volume
- Implement alert silencing during maintenance windows
- Use different notification channels for different severity levels
Monitoring Security
Secure your monitoring infrastructure:
- Use TLS for all monitoring endpoints
- Implement authentication for Grafana and Prometheus
- Restrict access to monitoring ports with firewall rules
- Regularly update monitoring tools
Need Help?
If you need assistance with node monitoring, join our Discord community where our team and other node operators can help.