System Monitoring
Real-time monitoring and performance metrics for collabrains.eu.
Monitoring Stack
Components
| Component | Purpose | URL |
|---|---|---|
| Prometheus | Metrics collection and storage | http://localhost:9090 |
| Node Exporter | System metrics (CPU, memory, disk) | http://localhost:9100 |
| cAdvisor | Container metrics | http://localhost:8081 |
| Grafana | Visualization and dashboards | https://grafana.collabrains.eu |
Prometheus
Access
# Local access (from server)
curl http://localhost:9090
# View targets
curl http://localhost:9090/api/v1/targets | jq
Query Metrics
Common queries:
# CPU usage percentage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk usage percentage
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100
# Container memory usage (GB)
sum(container_memory_usage_bytes) / 1024 / 1024 / 1024
# Network I/O
rate(node_network_receive_bytes_total[5m])
Grafana Dashboards
Access
- URL: https://grafana.collabrains.eu
- Default user: admin
- Password: Check
/data/coolify/services/GRAFANA_ID/.env
Pre-built Dashboards
Dashboards track: - System CPU, memory, disk - Container resource usage - Network I/O - Service-specific metrics
Create Custom Dashboard
- Open Grafana
- Dashboards → Create
- Add panels with PromQL queries
- Configure visualization
- Save
Import Dashboards
Grafana comes with pre-configured dashboards: 1. Dashboards → Import 2. Paste dashboard ID from Grafana.com 3. Select Prometheus data source 4. Import
Key Metrics
System Health
# CPU usage
curl http://localhost:9090/api/v1/query?query='100-avg(rate(node_cpu_seconds_total{mode="idle"}[5m]))*100'
# Memory usage
curl http://localhost:9090/api/v1/query?query='(1-(node_memory_MemAvailable_bytes/node_memory_MemTotal_bytes))*100'
# Disk space
curl http://localhost:9090/api/v1/query?query='(1-(node_filesystem_avail_bytes/node_filesystem_size_bytes))*100'
Container Metrics
# All container memory usage
docker stats --no-stream
# Specific container
docker stats CONTAINER_NAME --no-stream
Alert Conditions
Consider alerts for: - CPU > 80% for 5 minutes - Memory > 90% - Disk > 85% - Container restart loops
Monitoring Queries
Database Performance
# PostgreSQL query time (if metrics exposed)
histogram_quantile(0.95, rate(pg_query_duration_seconds_bucket[5m]))
# Connection count
pg_stat_activity_count
Service Health
# Container status
container_last_seen{name="SERVICE_NAME"}
# Restart count
increase(container_last_seen{name="SERVICE_NAME"}[1h])
Performance Troubleshooting
High CPU Usage
-
Identify culprit:
bash docker stats --no-stream | sort -k 3 -h -
Check Prometheus:
- Query:
rate(container_cpu_usage_seconds_total[5m]) * 100 -
Filters by container
-
Common causes:
- OCR processing (Paperless)
- AI indexing (Immich)
-
Workflow execution (n8n)
-
Solution:
- Let process complete
- Adjust settings to run at off-peak hours
- Restart if stuck:
docker restart CONTAINER_NAME
High Memory Usage
-
Monitor:
bash free -h docker stats CONTAINER_NAME --no-stream -
Query Prometheus:
-
sum(container_memory_usage_bytes) / 1024 / 1024 / 1024 -
Common causes:
- Memory leak in application
- OCR/AI processing
-
Large dataset operations
-
Solution:
- Restart container:
docker restart CONTAINER_NAME - Increase swap if needed
- Reduce concurrent tasks
Disk Space Issues
-
Check usage:
bash df -h / du -sh /data/coolify du -sh /backups -
Query Prometheus:
-
node_filesystem_avail_bytes{mountpoint="/"}(bytes available) -
Common causes:
- Old backups
- Large volumes
-
Log files
-
Solution: ```bash # Remove old backups find /backups -mtime +30 -exec rm -rf {} \;
# Clean Docker docker system prune -a --volumes
# Check service sizes du -sh /data/coolify/services/*/ ```
Alerting
Manual Alert Setup
Set up alerts in Grafana: 1. Dashboard → Alert rules 2. Create new alert 3. Set condition and notification channel 4. Configure actions
Alert Examples
CPU Alert:
Condition: 100 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100 > 80
For: 5 minutes
Memory Alert:
Condition: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
For: 5 minutes
Disk Alert:
Condition: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 85
For: 5 minutes
Retention Policies
Prometheus
- Default retention: 15 days
- Metrics resolution: 15 seconds
- Storage: ~1GB per week
Adjust in Traefik/Prometheus config if needed.
Exporting Metrics
Export Data
# Query and export to JSON
curl 'http://localhost:9090/api/v1/query_range?query=cpu_usage&start=1609459200&end=1609545600&step=60' | jq > metrics.json
Grafana Export
- Dashboard → Menu → Share
- Export → Download JSON
- Use JSON to restore dashboard on other Grafana instance
Related Documentation
- Troubleshooting — Performance issues
- Common Commands — Manual checks
- Services Overview — Service metrics