Skip to content

System Monitoring

Real-time monitoring and performance metrics for collabrains.eu.

Monitoring Stack

Components

Component Purpose URL
Prometheus Metrics collection and storage http://localhost:9090
Node Exporter System metrics (CPU, memory, disk) http://localhost:9100
cAdvisor Container metrics http://localhost:8081
Grafana Visualization and dashboards https://grafana.collabrains.eu

Prometheus

Access

# Local access (from server)
curl http://localhost:9090

# View targets
curl http://localhost:9090/api/v1/targets | jq

Query Metrics

Common queries:

# CPU usage percentage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# Disk usage percentage
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100

# Container memory usage (GB)
sum(container_memory_usage_bytes) / 1024 / 1024 / 1024

# Network I/O
rate(node_network_receive_bytes_total[5m])

Grafana Dashboards

Access

  • URL: https://grafana.collabrains.eu
  • Default user: admin
  • Password: Check /data/coolify/services/GRAFANA_ID/.env

Pre-built Dashboards

Dashboards track: - System CPU, memory, disk - Container resource usage - Network I/O - Service-specific metrics

Create Custom Dashboard

  1. Open Grafana
  2. Dashboards → Create
  3. Add panels with PromQL queries
  4. Configure visualization
  5. Save

Import Dashboards

Grafana comes with pre-configured dashboards: 1. Dashboards → Import 2. Paste dashboard ID from Grafana.com 3. Select Prometheus data source 4. Import

Key Metrics

System Health

# CPU usage
curl http://localhost:9090/api/v1/query?query='100-avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))*100'

# Memory usage
curl http://localhost:9090/api/v1/query?query='(1-(node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes))*100'

# Disk space
curl http://localhost:9090/api/v1/query?query='(1-(node_filesystem_avail_bytes/node_filesystem_size_bytes))*100'

Container Metrics

# All container memory usage
docker stats --no-stream

# Specific container
docker stats CONTAINER_NAME --no-stream

Alert Conditions

Consider alerts for: - CPU > 80% for 5 minutes - Memory > 90% - Disk > 85% - Container restart loops

Monitoring Queries

Database Performance

# PostgreSQL query time (if metrics exposed)
histogram_quantile(0.95, rate(pg_query_duration_seconds_bucket[5m]))

# Connection count
pg_stat_activity_count

Service Health

# Container status
container_last_seen{name="SERVICE_NAME"}

# Restart count
increase(container_last_seen{name="SERVICE_NAME"}[1h])

Performance Troubleshooting

High CPU Usage

  1. Identify culprit: bash docker stats --no-stream | sort -k 3 -h

  2. Check Prometheus:

  3. Query: rate(container_cpu_usage_seconds_total[5m]) * 100
  4. Filters by container

  5. Common causes:

  6. OCR processing (Paperless)
  7. AI indexing (Immich)
  8. Workflow execution (n8n)

  9. Solution:

  10. Let process complete
  11. Adjust settings to run at off-peak hours
  12. Restart if stuck: docker restart CONTAINER_NAME

High Memory Usage

  1. Monitor: bash free -h docker stats CONTAINER_NAME --no-stream

  2. Query Prometheus:

  3. sum(container_memory_usage_bytes) / 1024 / 1024 / 1024

  4. Common causes:

  5. Memory leak in application
  6. OCR/AI processing
  7. Large dataset operations

  8. Solution:

  9. Restart container: docker restart CONTAINER_NAME
  10. Increase swap if needed
  11. Reduce concurrent tasks

Disk Space Issues

  1. Check usage: bash df -h / du -sh /data/coolify du -sh /backups

  2. Query Prometheus:

  3. node_filesystem_avail_bytes{mountpoint="/"} (bytes available)

  4. Common causes:

  5. Old backups
  6. Large volumes
  7. Log files

  8. Solution: ```bash # Remove old backups find /backups -mtime +30 -exec rm -rf {} \;

# Clean Docker docker system prune -a --volumes

# Check service sizes du -sh /data/coolify/services/*/ ```

Alerting

Manual Alert Setup

Set up alerts in Grafana: 1. Dashboard → Alert rules 2. Create new alert 3. Set condition and notification channel 4. Configure actions

Alert Examples

CPU Alert:

Condition: 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 80
For: 5 minutes

Memory Alert:

Condition: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
For: 5 minutes

Disk Alert:

Condition: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 85
For: 5 minutes

Retention Policies

Prometheus

  • Default retention: 15 days
  • Metrics resolution: 15 seconds
  • Storage: ~1GB per week

Adjust in Traefik/Prometheus config if needed.

Exporting Metrics

Export Data

# Query and export to JSON
curl 'http://localhost:9090/api/v1/query_range?query=cpu_usage&start=1609459200&end=1609545600&step=60' | jq > metrics.json

Grafana Export

  1. Dashboard → Menu → Share
  2. Export → Download JSON
  3. Use JSON to restore dashboard on other Grafana instance