Monitoring Sockudo
Monitoring your Sockudo server is crucial for understanding its performance, identifying bottlenecks, and ensuring its reliability in a production environment. Sockudo can expose metrics via a Prometheus-compatible endpoint and provides comprehensive observability features.
Enabling Metrics
First, ensure that metrics are enabled in your Sockudo configuration:
{
"metrics": {
"enabled": true,
"driver": "prometheus",
"host": "0.0.0.0",
"port": 9601,
"prometheus": {
"prefix": "sockudo_"
}
}
}
Environment Variables:
METRICS_ENABLED=true
METRICS_HOST="0.0.0.0"
METRICS_PORT=9601
PROMETHEUS_METRICS_PREFIX="sockudo_"
By default, metrics will be available at http://<metrics_host>:<metrics_port>/metrics
(e.g., http://localhost:9601/metrics
).
Available Metrics
Sockudo exposes comprehensive metrics for monitoring server performance and health. For a complete list of available metrics and their descriptions, see the Metrics Configuration Guide.
Key Monitoring Categories
The available metrics cover the following areas:
- Connection Management: Track active connections, connection/disconnection rates, and connection errors
- Message Throughput: Monitor WebSocket messages and bytes transmitted/received
- HTTP API Performance: Track API calls and data transfer
- Channel Activity: Monitor channel subscriptions and active channels
- Rate Limiting: Track rate limit checks and triggers
- Horizontal Scaling: Monitor adapter performance for distributed deployments (Redis, NATS)
- Broadcast Performance: End-to-end latency tracking for message broadcasts (v2.6.1+)
Setting up Prometheus
Prometheus Configuration
Add a scrape job to your prometheus.yml
:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'sockudo'
static_configs:
- targets: ['localhost:9601']
labels:
instance: 'sockudo-1'
environment: 'production'
scrape_interval: 15s
metrics_path: /metrics
scheme: http
Multi-Instance Setup
For multiple Sockudo instances:
scrape_configs:
- job_name: 'sockudo'
static_configs:
- targets:
- 'sockudo-1.example.com:9601'
- 'sockudo-2.example.com:9601'
- 'sockudo-3.example.com:9601'
labels:
environment: 'production'
cluster: 'main'
Docker Compose with Prometheus
version: '3.8'
services:
sockudo:
image: sockudo/sockudo:latest
ports:
- "6001:6001"
- "9601:9601"
environment:
- METRICS_ENABLED=true
labels:
- "prometheus.io/scrape=true"
- "prometheus.io/port=9601"
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
volumes:
prometheus_data:
Kubernetes Service Discovery
scrape_configs:
- job_name: 'sockudo'
kubernetes_sd_configs:
- role: pod
namespaces:
names: ['sockudo-namespace']
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: sockudo
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- source_labels: [__meta_kubernetes_pod_name]
target_label: instance
Visualization with Grafana
Installing Grafana
# Add to docker-compose.yml
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
Key Dashboard Panels
1. Connection Overview
# Active connections gauge
sockudo_connected
# Connection rate (connections per second)
rate(sockudo_new_connections_total[5m])
# Connections by app
sockudo_connected by (app_id)
2. Message Throughput
# Messages sent rate
rate(sockudo_ws_messages_sent_total[5m])
# Messages received rate
rate(sockudo_ws_messages_received_total[5m])
# Bytes transmitted rate
rate(sockudo_socket_transmitted_bytes[5m])
# Bytes received rate
rate(sockudo_socket_received_bytes[5m])
3. HTTP API Performance
# HTTP API calls rate
rate(sockudo_http_calls_received_total[5m])
# HTTP bytes received rate
rate(sockudo_http_received_bytes[5m])
# HTTP bytes transmitted rate
rate(sockudo_http_transmitted_bytes[5m])
4. Channel Activity
# Active channels
sockudo_active_channels
# Subscription rate
rate(sockudo_channel_subscriptions_total[5m])
# Channel activity by type
sockudo_active_channels by (channel_type)
5. System Health
# Rate limit triggers
rate(sockudo_rate_limit_triggered_total[5m])
# Horizontal adapter resolve time
histogram_quantile(0.95, rate(sockudo_horizontal_adapter_resolve_time_bucket[5m]))
# Horizontal adapter operations
rate(sockudo_horizontal_adapter_sent_requests[5m])
rate(sockudo_horizontal_adapter_received_requests[5m])
6. Broadcast Performance (v2.6.1+)
# Broadcast latency percentiles by recipient count
histogram_quantile(0.50, rate(sockudo_broadcast_latency_ms_bucket[5m])) by (recipient_count_bucket)
histogram_quantile(0.95, rate(sockudo_broadcast_latency_ms_bucket[5m])) by (recipient_count_bucket)
histogram_quantile(0.99, rate(sockudo_broadcast_latency_ms_bucket[5m])) by (recipient_count_bucket)
# Average broadcast latency by channel type
rate(sockudo_broadcast_latency_ms_sum[5m]) / rate(sockudo_broadcast_latency_ms_count[5m]) by (channel_type)
# Broadcast latency distribution heatmap
rate(sockudo_broadcast_latency_ms_bucket[5m])
# Broadcast performance by size category
histogram_quantile(0.95, rate(sockudo_broadcast_latency_ms_bucket{recipient_count_bucket="xs"}[5m]))
histogram_quantile(0.95, rate(sockudo_broadcast_latency_ms_bucket{recipient_count_bucket="md"}[5m]))
histogram_quantile(0.95, rate(sockudo_broadcast_latency_ms_bucket{recipient_count_bucket="xl"}[5m]))
Sample Grafana Dashboard JSON
{
"dashboard": {
"title": "Sockudo Metrics",
"panels": [
{
"title": "Active Connections",
"type": "stat",
"targets": [
{
"expr": "sockudo_connected",
"legendFormat": "{{instance}}"
}
]
},
{
"title": "Message Rate",
"type": "graph",
"targets": [
{
"expr": "rate(sockudo_ws_messages_sent_total[5m])",
"legendFormat": "Sent"
},
{
"expr": "rate(sockudo_ws_messages_received_total[5m])",
"legendFormat": "Received"
}
]
},
{
"title": "API Response Time",
"type": "graph",
"targets": [
{
"expr": "rate(sockudo_http_calls_received_total[5m])",
"legendFormat": "API Calls"
},
{
"expr": "rate(sockudo_http_received_bytes[5m])",
"legendFormat": "Bytes In"
},
{
"expr": "rate(sockudo_http_transmitted_bytes[5m])",
"legendFormat": "Bytes Out"
}
]
},
{
"title": "Broadcast Latency by Size (v2.6.1+)",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(sockudo_broadcast_latency_ms_bucket[5m])) by (recipient_count_bucket)",
"legendFormat": "{{recipient_count_bucket}} (p95)"
}
]
}
]
}
}
Alerting Rules
Prometheus Alerting Rules
Create an alerts.yml
file:
groups:
- name: sockudo_alerts
rules:
# Connection Alerts
- alert: SockudoHighConnectionCount
expr: sockudo_connected > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High connection count on {{ $labels.instance }}"
description: "Sockudo instance {{ $labels.instance }} has {{ $value }} active connections"
- alert: SockudoConnectionDrops
expr: rate(sockudo_new_disconnections_total[5m]) > 10
for: 2m
labels:
severity: warning
annotations:
summary: "High connection drop rate on {{ $labels.instance }}"
description: "Connection drop rate is {{ $value }} per second"
# Performance Alerts
# System Health Alerts
- alert: SockudoInstanceDown
expr: up{job="sockudo"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Sockudo instance down"
description: "Instance {{ $labels.instance }} is not responding"
- alert: SockudoRateLimitTriggered
expr: rate(sockudo_rate_limit_triggered_total[5m]) > 5
for: 2m
labels:
severity: warning
annotations:
summary: "High rate limit triggers on {{ $labels.instance }}"
description: "Rate limits are being triggered {{ $value }} times per second"
# Horizontal Adapter Alerts
- alert: SockudoHighAdapterLatency
expr: histogram_quantile(0.95, rate(sockudo_horizontal_adapter_resolve_time_bucket[5m])) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "High horizontal adapter latency on {{ $labels.instance }}"
description: "95th percentile adapter resolve time is {{ $value }}ms"
# Broadcast Performance Alerts (v2.6.1+)
- alert: SockudoHighBroadcastLatency
expr: histogram_quantile(0.95, rate(sockudo_broadcast_latency_ms_bucket[5m])) > 100
for: 5m
labels:
severity: warning
annotations:
summary: "High broadcast latency on {{ $labels.instance }}"
description: "95th percentile broadcast latency is {{ $value }}ms for {{ $labels.recipient_count_bucket }} recipient bucket"
- alert: SockudoVeryHighBroadcastLatency
expr: histogram_quantile(0.99, rate(sockudo_broadcast_latency_ms_bucket[5m])) > 500
for: 2m
labels:
severity: critical
annotations:
summary: "Critical broadcast latency on {{ $labels.instance }}"
description: "99th percentile broadcast latency is {{ $value }}ms for {{ $labels.recipient_count_bucket }} recipient bucket"
Alertmanager Configuration
# alertmanager.yml
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alerts@yourcompany.com'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
email_configs:
- to: 'admin@yourcompany.com'
subject: 'Sockudo Alert: {{ .GroupLabels.alertname }}'
body: |
{{ range .Alerts }}
Alert: {{ .Annotations.summary }}
Description: {{ .Annotations.description }}
{{ end }}
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK_URL'
channel: '#alerts'
title: 'Sockudo Alert'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
Logging and Log Aggregation
Structured Logging
Configure Sockudo for structured logging:
{
"debug": false,
"log_format": "json"
}
Environment Variables:
LOG_FORMAT=json
LOG_LEVEL=info
Log Aggregation with ELK Stack
Filebeat Configuration
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/sockudo/*.log
fields:
service: sockudo
environment: production
fields_under_root: true
output.elasticsearch:
hosts: ["elasticsearch:9200"]
index: "sockudo-logs-%{+yyyy.MM.dd}"
setup.template:
name: "sockudo-logs"
pattern: "sockudo-logs-*"
Logstash Configuration
# logstash.conf
input {
beats {
port => 5044
}
}
filter {
if [service] == "sockudo" {
json {
source => "message"
}
date {
match => [ "timestamp", "ISO8601" ]
}
mutate {
add_field => { "[@metadata][index]" => "sockudo-logs-%{+YYYY.MM.dd}" }
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "%{[@metadata][index]}"
}
}
Log Queries and Dashboards
Common Log Queries (Kibana/Elasticsearch)
// Error logs
{
"query": {
"bool": {
"must": [
{"term": {"service": "sockudo"}},
{"term": {"level": "ERROR"}}
],
"filter": {
"range": {
"@timestamp": {
"gte": "now-1h"
}
}
}
}
}
}
// Connection events
{
"query": {
"bool": {
"must": [
{"term": {"service": "sockudo"}},
{"wildcard": {"message": "*connection*"}}
]
}
}
}
// Authentication failures
{
"query": {
"bool": {
"must": [
{"term": {"service": "sockudo"}},
{"match": {"message": "authentication failed"}}
]
}
}
}
Performance Monitoring
System Resource Monitoring
Use node_exporter with Prometheus to monitor system resources:
# Add to docker-compose.yml
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.ignored-mount-points'
- '^/(sys|proc|dev|host|etc|rootfs/var/lib/docker/containers|rootfs/var/lib/docker/overlay2|rootfs/run/docker/netns|rootfs/var/lib/docker/aufs)($|/)'
Key System Metrics
# CPU usage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk usage
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100
# Network I/O
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])
# File descriptors
node_filefd_allocated / node_filefd_maximum * 100
Best Practices
Monitoring Strategy
- Start with key metrics: Focus on connection count, message rates, and error rates
- Set meaningful thresholds: Base alerts on your actual usage patterns
- Use percentiles: Monitor 95th and 99th percentiles for latency metrics
- Monitor trends: Look for gradual changes that might indicate issues
Alert Fatigue Prevention
- Tune alert thresholds: Avoid false positives
- Use appropriate time windows: Don't alert on brief spikes
- Group related alerts: Use alert grouping in Alertmanager
- Regular review: Periodically review and adjust alert rules
Dashboard Design
- Hierarchy of dashboards: Overview → Detailed → Troubleshooting
- Consistent time ranges: Use standard time ranges across panels
- Meaningful legends: Use clear, descriptive legend formats
- Color coding: Use consistent colors for similar metrics
Data Retention
- Metrics retention: Configure appropriate retention for Prometheus
- Log retention: Set up log rotation and archival policies
- Historical analysis: Keep enough history for trend analysis
- Storage costs: Balance retention needs with storage costs
By implementing comprehensive monitoring with Prometheus, Grafana, and proper alerting, you can ensure your Sockudo deployment remains healthy, performant, and reliable in production environments.