Skip to content

guide/monitoring.md

Monitoring Sockudo

Monitoring your Sockudo server is crucial for understanding its performance, identifying bottlenecks, and ensuring its reliability in a production environment. Sockudo can expose metrics via a Prometheus-compatible endpoint and provides comprehensive observability features.

Enabling Metrics

First, ensure that metrics are enabled in your Sockudo configuration:

json
{
  "metrics": {
    "enabled": true,
    "driver": "prometheus",
    "host": "0.0.0.0",
    "port": 9601,
    "prometheus": {
      "prefix": "sockudo_"
    }
  }
}

Environment Variables:

bash
METRICS_ENABLED=true
METRICS_HOST="0.0.0.0"
METRICS_PORT=9601
PROMETHEUS_METRICS_PREFIX="sockudo_"

By default, metrics will be available at http://<metrics_host>:<metrics_port>/metrics (e.g., http://localhost:9601/metrics).

Key Metrics to Monitor

Sockudo exposes comprehensive metrics across different categories:

Connection Metrics

Active Connections

  • sockudo_active_connections: Current number of active WebSocket connections
  • sockudo_connections_per_app: Active connections broken down by application
  • sockudo_total_connections: Total number of connections established since startup

Connection Events

  • sockudo_connection_established_total: Total connections established
  • sockudo_connection_closed_total: Total connections closed
  • sockudo_connection_errors_total: Connection errors (timeouts, protocol errors)

Message Throughput

Message Flows

  • sockudo_messages_sent_total: Total messages sent by the server to clients
  • sockudo_messages_received_total: Total messages received from clients
  • sockudo_broadcast_messages_total: Messages broadcast to multiple subscribers
  • sockudo_client_events_total: Client-triggered events processed

Message Processing

  • sockudo_message_processing_duration_seconds: Histogram of message processing times
  • sockudo_message_size_bytes: Histogram of message sizes

HTTP API Performance

Request Metrics

  • sockudo_http_requests_total: Total HTTP API requests with labels for method, endpoint, status
  • sockudo_http_request_duration_seconds: Request latency histogram
  • sockudo_http_response_size_bytes: Response size histogram

API Errors

  • sockudo_http_errors_total: HTTP errors by status code
  • sockudo_api_authentication_failures_total: Failed authentication attempts

Channel Statistics

Channel Activity

  • sockudo_active_channels: Current number of channels with subscribers
  • sockudo_channels_per_app: Active channels per application
  • sockudo_channel_subscriptions_total: Total channel subscriptions
  • sockudo_channel_unsubscriptions_total: Total channel unsubscriptions

Presence Channels

  • sockudo_presence_members: Current members in presence channels
  • sockudo_presence_events_total: Member join/leave events

Rate Limiting

Rate Limit Events

  • sockudo_rate_limit_triggered_total: Rate limits triggered by type (API, WebSocket)
  • sockudo_rate_limit_checks_total: Total rate limit checks with results

Queue Performance (if enabled)

Job Processing

  • sockudo_queue_jobs_processed_total: Successfully processed queue jobs
  • sockudo_queue_jobs_failed_total: Failed queue jobs
  • sockudo_queue_active_jobs: Current jobs waiting in queue
  • sockudo_queue_job_duration_seconds: Job processing time histogram

Webhook Metrics

Webhook Delivery

  • sockudo_webhooks_sent_total: Total webhooks sent
  • sockudo_webhooks_failed_total: Failed webhook deliveries
  • sockudo_webhook_duration_seconds: Webhook request duration
  • sockudo_webhook_retries_total: Webhook retry attempts

Cache Performance

Cache Operations

  • sockudo_cache_hits_total: Cache hits
  • sockudo_cache_misses_total: Cache misses
  • sockudo_cache_operations_total: Total cache operations
  • sockudo_cache_memory_usage_bytes: Current cache memory usage

Adapter Metrics

Adapter Performance

  • sockudo_adapter_operations_total: Adapter operations (publish, subscribe)
  • sockudo_adapter_errors_total: Adapter errors
  • sockudo_adapter_latency_seconds: Adapter operation latency
  • sockudo_adapter_message_size_bytes: Size of messages through adapter

Setting up Prometheus

Prometheus Configuration

Add a scrape job to your prometheus.yml:

yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'sockudo'
    static_configs:
      - targets: ['localhost:9601']
        labels:
          instance: 'sockudo-1'
          environment: 'production'
    scrape_interval: 15s
    metrics_path: /metrics
    scheme: http

Multi-Instance Setup

For multiple Sockudo instances:

yaml
scrape_configs:
  - job_name: 'sockudo'
    static_configs:
      - targets: 
          - 'sockudo-1.example.com:9601'
          - 'sockudo-2.example.com:9601'
          - 'sockudo-3.example.com:9601'
        labels:
          environment: 'production'
          cluster: 'main'

Docker Compose with Prometheus

yaml
version: '3.8'
services:
  sockudo:
    image: sockudo/sockudo:latest
    ports:
      - "6001:6001"
      - "9601:9601"
    environment:
      - METRICS_ENABLED=true
    labels:
      - "prometheus.io/scrape=true"
      - "prometheus.io/port=9601"

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'

volumes:
  prometheus_data:

Kubernetes Service Discovery

yaml
scrape_configs:
  - job_name: 'sockudo'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names: ['sockudo-namespace']
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: sockudo
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: instance

Visualization with Grafana

Installing Grafana

yaml
# Add to docker-compose.yml
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning

Key Dashboard Panels

1. Connection Overview

promql
# Active connections gauge
sockudo_active_connections

# Connection rate (connections per second)
rate(sockudo_connection_established_total[5m])

# Connections by app
sockudo_active_connections by (app_id)

2. Message Throughput

promql
# Messages sent rate
rate(sockudo_messages_sent_total[5m])

# Messages received rate
rate(sockudo_messages_received_total[5m])

# Client events rate
rate(sockudo_client_events_total[5m])

# Broadcast efficiency (messages sent vs received)
rate(sockudo_broadcast_messages_total[5m]) / rate(sockudo_messages_received_total[5m])

3. HTTP API Performance

promql
# Request rate
rate(sockudo_http_requests_total[5m])

# 95th percentile response time
histogram_quantile(0.95, rate(sockudo_http_request_duration_seconds_bucket[5m]))

# Error rate percentage
rate(sockudo_http_requests_total{status=~"5.."}[5m]) / rate(sockudo_http_requests_total[5m]) * 100

# Requests by endpoint
rate(sockudo_http_requests_total[5m]) by (endpoint)

4. Channel Activity

promql
# Active channels
sockudo_active_channels

# Subscription rate
rate(sockudo_channel_subscriptions_total[5m])

# Presence channel members
sockudo_presence_members

# Channel activity by type
sockudo_active_channels by (channel_type)

5. System Health

promql
# Rate limit triggers
rate(sockudo_rate_limit_triggered_total[5m])

# Queue depth (if using queues)
sockudo_queue_active_jobs

# Cache hit rate
rate(sockudo_cache_hits_total[5m]) / (rate(sockudo_cache_hits_total[5m]) + rate(sockudo_cache_misses_total[5m])) * 100

# Webhook success rate
rate(sockudo_webhooks_sent_total[5m]) / (rate(sockudo_webhooks_sent_total[5m]) + rate(sockudo_webhooks_failed_total[5m])) * 100

Sample Grafana Dashboard JSON

json
{
  "dashboard": {
    "title": "Sockudo Metrics",
    "panels": [
      {
        "title": "Active Connections",
        "type": "stat",
        "targets": [
          {
            "expr": "sockudo_active_connections",
            "legendFormat": "{{instance}}"
          }
        ]
      },
      {
        "title": "Message Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(sockudo_messages_sent_total[5m])",
            "legendFormat": "Sent"
          },
          {
            "expr": "rate(sockudo_messages_received_total[5m])",
            "legendFormat": "Received"
          }
        ]
      },
      {
        "title": "API Response Time",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, rate(sockudo_http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "p50"
          },
          {
            "expr": "histogram_quantile(0.95, rate(sockudo_http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "p95"
          },
          {
            "expr": "histogram_quantile(0.99, rate(sockudo_http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "p99"
          }
        ]
      }
    ]
  }
}

Alerting Rules

Prometheus Alerting Rules

Create an alerts.yml file:

yaml
groups:
  - name: sockudo_alerts
    rules:
      # Connection Alerts
      - alert: SockudoHighConnectionCount
        expr: sockudo_active_connections > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High connection count on {{ $labels.instance }}"
          description: "Sockudo instance {{ $labels.instance }} has {{ $value }} active connections"

      - alert: SockudoConnectionDrops
        expr: rate(sockudo_connection_closed_total[5m]) > 10
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High connection drop rate on {{ $labels.instance }}"
          description: "Connection drop rate is {{ $value }} per second"

      # Performance Alerts
      - alert: SockudoHighLatency
        expr: histogram_quantile(0.95, rate(sockudo_http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High API latency on {{ $labels.instance }}"
          description: "95th percentile latency is {{ $value }}s"

      - alert: SockudoHighErrorRate
        expr: rate(sockudo_http_requests_total{status=~"5.."}[5m]) / rate(sockudo_http_requests_total[5m]) > 0.1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.instance }}"
          description: "Error rate is {{ $value | humanizePercentage }}"

      # System Health Alerts
      - alert: SockudoInstanceDown
        expr: up{job="sockudo"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Sockudo instance down"
          description: "Instance {{ $labels.instance }} is not responding"

      - alert: SockudoRateLimitTriggered
        expr: rate(sockudo_rate_limit_triggered_total[5m]) > 5
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High rate limit triggers on {{ $labels.instance }}"
          description: "Rate limits are being triggered {{ $value }} times per second"

      # Queue Alerts (if using queues)
      - alert: SockudoQueueBacklog
        expr: sockudo_queue_active_jobs > 1000
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High queue backlog on {{ $labels.instance }}"
          description: "Queue has {{ $value }} pending jobs"

      - alert: SockudoWebhookFailures
        expr: rate(sockudo_webhooks_failed_total[5m]) / rate(sockudo_webhooks_sent_total[5m]) > 0.2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High webhook failure rate on {{ $labels.instance }}"
          description: "Webhook failure rate is {{ $value | humanizePercentage }}"

      # Cache Performance
      - alert: SockudoLowCacheHitRate
        expr: rate(sockudo_cache_hits_total[5m]) / (rate(sockudo_cache_hits_total[5m]) + rate(sockudo_cache_misses_total[5m])) < 0.7
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low cache hit rate on {{ $labels.instance }}"
          description: "Cache hit rate is {{ $value | humanizePercentage }}"

Alertmanager Configuration

yaml
# alertmanager.yml
global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@yourcompany.com'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
  - name: 'web.hook'
    email_configs:
      - to: 'admin@yourcompany.com'
        subject: 'Sockudo Alert: {{ .GroupLabels.alertname }}'
        body: |
          {{ range .Alerts }}
          Alert: {{ .Annotations.summary }}
          Description: {{ .Annotations.description }}
          {{ end }}
    
    slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK_URL'
        channel: '#alerts'
        title: 'Sockudo Alert'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

Logging and Log Aggregation

Structured Logging

Configure Sockudo for structured logging:

json
{
  "debug": false,
  "log_format": "json"
}

Environment Variables:

bash
LOG_FORMAT=json
LOG_LEVEL=info

Log Aggregation with ELK Stack

Filebeat Configuration

yaml
# filebeat.yml
filebeat.inputs:
- type: log
  enabled: true
  paths:
    - /var/log/sockudo/*.log
  fields:
    service: sockudo
    environment: production
  fields_under_root: true

output.elasticsearch:
  hosts: ["elasticsearch:9200"]
  index: "sockudo-logs-%{+yyyy.MM.dd}"

setup.template:
  name: "sockudo-logs"
  pattern: "sockudo-logs-*"

Logstash Configuration

ruby
# logstash.conf
input {
  beats {
    port => 5044
  }
}

filter {
  if [service] == "sockudo" {
    json {
      source => "message"
    }
    
    date {
      match => [ "timestamp", "ISO8601" ]
    }
    
    mutate {
      add_field => { "[@metadata][index]" => "sockudo-logs-%{+YYYY.MM.dd}" }
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "%{[@metadata][index]}"
  }
}

Log Queries and Dashboards

Common Log Queries (Kibana/Elasticsearch)

json
// Error logs
{
  "query": {
    "bool": {
      "must": [
        {"term": {"service": "sockudo"}},
        {"term": {"level": "ERROR"}}
      ],
      "filter": {
        "range": {
          "@timestamp": {
            "gte": "now-1h"
          }
        }
      }
    }
  }
}

// Connection events
{
  "query": {
    "bool": {
      "must": [
        {"term": {"service": "sockudo"}},
        {"wildcard": {"message": "*connection*"}}
      ]
    }
  }
}

// Authentication failures
{
  "query": {
    "bool": {
      "must": [
        {"term": {"service": "sockudo"}},
        {"match": {"message": "authentication failed"}}
      ]
    }
  }
}

Performance Monitoring

System Resource Monitoring

Use node_exporter with Prometheus to monitor system resources:

yaml
# Add to docker-compose.yml
  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.ignored-mount-points'
      - '^/(sys|proc|dev|host|etc|rootfs/var/lib/docker/containers|rootfs/var/lib/docker/overlay2|rootfs/run/docker/netns|rootfs/var/lib/docker/aufs)($|/)'

Key System Metrics

promql
# CPU usage
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk usage
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100

# Network I/O
rate(node_network_receive_bytes_total[5m])
rate(node_network_transmit_bytes_total[5m])

# File descriptors
node_filefd_allocated / node_filefd_maximum * 100

Best Practices

Monitoring Strategy

  1. Start with key metrics: Focus on connection count, message rates, and error rates
  2. Set meaningful thresholds: Base alerts on your actual usage patterns
  3. Use percentiles: Monitor 95th and 99th percentiles for latency metrics
  4. Monitor trends: Look for gradual changes that might indicate issues

Alert Fatigue Prevention

  1. Tune alert thresholds: Avoid false positives
  2. Use appropriate time windows: Don't alert on brief spikes
  3. Group related alerts: Use alert grouping in Alertmanager
  4. Regular review: Periodically review and adjust alert rules

Dashboard Design

  1. Hierarchy of dashboards: Overview → Detailed → Troubleshooting
  2. Consistent time ranges: Use standard time ranges across panels
  3. Meaningful legends: Use clear, descriptive legend formats
  4. Color coding: Use consistent colors for similar metrics

Data Retention

  1. Metrics retention: Configure appropriate retention for Prometheus
  2. Log retention: Set up log rotation and archival policies
  3. Historical analysis: Keep enough history for trend analysis
  4. Storage costs: Balance retention needs with storage costs

By implementing comprehensive monitoring with Prometheus, Grafana, and proper alerting, you can ensure your Sockudo deployment remains healthy, performant, and reliable in production environments.

Released under the AGPL-3.0 License.