Monitoring a Kubernetes Media Stack with Grafana

Production-grade monitoring for a Kubernetes media stack: custom Grafana dashboards, intelligent Prometheus alerts, and formatted email notifications via Gmail SMTP.

“Hope is not a strategy.” - Everyone who checked the logs too late

The Problem

Running 8+ media apps on Kubernetes without monitoring is flying blind. Is Plex streaming or just “Running”? Which app is eating RAM? Did Sonarr crash at 3am?

You need observability before users complain.

Architecture

Prometheus
  ├─► Scrapes metrics (cAdvisor, kube-state-metrics, kubelet)
  ├─► Evaluates alert rules every 30s
  └─► Sends alerts to Alertmanager
       └─► Routes to Gmail SMTP with HTML templates

Grafana
  ├─► Queries Prometheus for metrics
  ├─► Displays custom dashboards
  └─► Includes pre-built K8s dashboards

Deployment

kube-prometheus-stack

Single Helm chart includes everything: Prometheus, Grafana, Alertmanager, node-exporter, kube-state-metrics.

helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts

helm install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  -n media \
  -f apps/monitoring/values.yaml

Configuration

# apps/monitoring/values.yaml
grafana:
  enabled: true
  adminPassword: admin
  
  initChownData:
    enabled: false  # Causes permission errors on Talos
  
  ingress:
    enabled: true
    ingressClassName: traefik
    hosts:
      - grafana.media.lan
    tls:
      - hosts:
          - grafana.media.lan
  
  persistence:
    enabled: true
    storageClassName: local-path  # SQLite needs fast I/O
    size: 5Gi
  
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 512Mi
  
  defaultDashboardsEnabled: true

prometheus:
  enabled: true
  
  prometheusSpec:
    retention: 7d
    
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: nfs-appdata  # See storage notes below
          accessModes: ["ReadWriteOnce"]
          resources:
            requests:
              storage: 10Gi
    
    resources:
      requests:
        cpu: 200m
        memory: 512Mi
      limits:
        cpu: 1000m
        memory: 2Gi

alertmanager:
  enabled: true
  
  config:
    global:
      resolve_timeout: 5m
      smtp_from: 'jlambert229@gmail.com'
      smtp_smarthost: 'smtp.gmail.com:587'
      smtp_auth_username: 'jlambert229@gmail.com'
      smtp_auth_password: 'your-gmail-app-password'
      smtp_require_tls: true
    
    route:
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 10s
      group_interval: 10s
      repeat_interval: 12h
      receiver: 'email'
    
    receivers:
      - name: 'email'
        email_configs:
          - to: 'jlambert229@gmail.com'
            send_resolved: true
            headers:
              Subject: '{{ if eq .Status "firing" }}🚨{{ else }}✅{{ end }} [{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
            html: |
              <!DOCTYPE html>
              <html>
              <head>
                <style>
                  .critical { background-color: #ffebee; border-left: 5px solid #c62828; }
                  .warning { background-color: #fff3e0; border-left: 5px solid #f57c00; }
                  .resolved { background-color: #e8f5e9; border-left: 5px solid #388e3c; }
                </style>
              </head>
              <body>
                <h1>{{ if eq .Status "firing" }}🚨 Alert Firing{{ else }}✅ Alert Resolved{{ end }}</h1>
                {{ range .Alerts }}
                <div class="alert {{ .Labels.severity }}">
                  <h2>{{ .Labels.alertname }}</h2>
                  <p><strong>Summary:</strong> {{ .Annotations.summary }}</p>
                  <p><strong>Description:</strong> {{ .Annotations.description }}</p>
                  <p><strong>Started:</strong> {{ .StartsAt }}</p>
                </div>
                {{ end }}
              </body>
              </html>

💡 Tip

Get your Gmail app password at https://myaccount.google.com/apppasswords. Requires 2FA enabled on your Google account.

Storage Decisions

Why NFS for Prometheus?

Tried local-path first. Got permission errors:

Error: open /prometheus/queries.active: permission denied

Root cause: Talos Linux’s strict security + Prometheus’s file permissions = conflict

Solution: Use NFS storage (more lenient permissions)

prometheus:
  prometheusSpec:
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: nfs-appdata  # Not local-path

Trade-off: Historical metrics aren’t latency-sensitive. NFS network overhead is acceptable.

Why local-path for Grafana?

Grafana uses SQLite. Random I/O needs fast storage:

grafana:
  persistence:
    storageClassName: local-path  # Fast SSD

ℹ️ Info

This is the same split as the media apps: databases on local-path, large sequential data on NFS.

Custom Dashboard

Grafana ships with excellent Kubernetes dashboards. But I wanted media stack-specific panels.

Auto-Import via ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboard-media-stack
  namespace: media
  labels:
    grafana_dashboard: "1"  # Sidecar watches for this label
data:
  media-stack-overview.json: |
    {
      "title": "Media Stack Overview",
      "panels": [
        {
          "title": "CPU Usage by App",
          "targets": [{
            "expr": "sum(rate(container_cpu_usage_seconds_total{namespace=\"media\", pod=~\"plex.*|sonarr.*|radarr.*\"}[5m])) by (pod)"
          }]
        },
        {
          "title": "Memory Usage by App",
          "targets": [{
            "expr": "sum(container_memory_working_set_bytes{namespace=\"media\", pod=~\"plex.*|sonarr.*|radarr.*\"}) by (pod)"
          }]
        },
        {
          "title": "Pod Status",
          "targets": [{
            "expr": "kube_pod_status_ready{namespace=\"media\", condition=\"true\", pod=~\"plex.*|sonarr.*|radarr.*\"}"
          }]
        },
        {
          "title": "Network I/O",
          "targets": [{
            "expr": "sum(rate(container_network_receive_bytes_total{namespace=\"media\", pod=~\"plex.*|sabnzbd.*\"}[5m])) by (pod)"
          }]
        },
        {
          "title": "Storage Usage",
          "targets": [{
            "expr": "100 * sum(kubelet_volume_stats_used_bytes{persistentvolumeclaim=\"media-data\"}) / sum(kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=\"media-data\"})"
          }]
        }
      ]
    }

Deploy:

kubectl apply -f dashboards/media-stack-overview.yaml

Grafana’s sidecar container watches for grafana_dashboard: "1" labels and imports automatically. No manual clicking.

💡 Tip

Check import status: kubectl logs -n media -l app.kubernetes.io/name=grafana -c grafana-sc-dashboard

Alert Rules

Philosophy

Every alert should be:

Actionable - You can fix it
Urgent - Needs timely response
Real - Actually indicates a problem

Don’t alert on noise.

Configuration

# monitoring/prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: media-stack-alerts
  namespace: media
  labels:
    prometheus: kube-prometheus-stack-prometheus
    role: alert-rules
spec:
  groups:
    - name: media-stack
      interval: 30s
      rules:
        # CRITICAL: App down
        - alert: MediaAppPodDown
          expr: |
            kube_pod_status_phase{
              namespace="media",
              phase="Running",
              pod=~"plex.*|sonarr.*|radarr.*|sabnzbd.*"
            } == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "{{ $labels.pod }} is down"
            description: "Pod has not been running for 2 minutes."
        
        # WARNING: High CPU
        - alert: MediaAppHighCPU
          expr: |
            sum(rate(container_cpu_usage_seconds_total{
              namespace="media",
              pod=~"plex.*|sonarr.*|radarr.*|sabnzbd.*"
            }[5m])) by (pod) > 0.8
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High CPU on {{ $labels.pod }}"
            description: "Pod using {{ $value | humanizePercentage }} CPU for 5+ minutes."
        
        # WARNING: High Memory
        - alert: MediaAppHighMemory
          expr: |
            sum(container_memory_working_set_bytes{
              namespace="media",
              pod=~"plex.*|sonarr.*|radarr.*"
            }) by (pod) /
            sum(container_spec_memory_limit_bytes{
              namespace="media",
              pod=~"plex.*|sonarr.*|radarr.*"
            }) by (pod) > 0.85
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High memory on {{ $labels.pod }}"
            description: "Pod using {{ $value | humanizePercentage }} of memory limit."
        
        # WARNING: Storage full
        - alert: MediaStorageFull
          expr: |
            100 * sum(kubelet_volume_stats_used_bytes{
              namespace="media",
              persistentvolumeclaim="media-data"
            }) / sum(kubelet_volume_stats_capacity_bytes{
              namespace="media",
              persistentvolumeclaim="media-data"
            }) > 85
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Media storage almost full"
            description: "NFS storage is {{ $value | humanize }}% full."
        
        # INFO: Container restarted
        - alert: MediaAppRestarting
          expr: |
            increase(kube_pod_container_status_restarts_total{
              namespace="media",
              pod=~"plex.*|sonarr.*|radarr.*"
            }[1h]) > 0
          labels:
            severity: info
          annotations:
            summary: "{{ $labels.pod }} restarted"
            description: "Pod restarted {{ $value }} time(s) in the last hour."

Deploy:

kubectl apply -f monitoring/prometheus-rules.yaml

ℹ️ Info

The for: 5m clause prevents flapping. Plex CPU spikes for 30s during a transcode? Not worth alerting.

Gmail SMTP Setup

Get App Password

Enable 2-Step Verification: https://myaccount.google.com/security
Generate app password: https://myaccount.google.com/apppasswords
Name it “Kubernetes Alerts”
Copy the 16-character password

Test SMTP

curl --url 'smtp://smtp.gmail.com:587' \
  --ssl-reqd \
  --mail-from 'your-email@gmail.com' \
  --mail-rcpt 'your-email@gmail.com' \
  --user 'your-email@gmail.com:app-password' \
  --upload-file - <<EOF
Subject: Test Alert

This is a test from Alertmanager.
EOF

If that succeeds, Alertmanager will work.

Useful PromQL Queries

Top CPU Consumers

topk(5, sum(rate(container_cpu_usage_seconds_total{namespace="media"}[5m])) by (pod))

Memory as % of Limit

100 * container_memory_working_set_bytes / container_spec_memory_limit_bytes

Pods with >5 Restarts (24h)

increase(kube_pod_container_status_restarts_total{namespace="media"}[24h]) > 5

Storage Trend (7d)

kubelet_volume_stats_used_bytes{persistentvolumeclaim="media-data"}[7d]

💡 Tip

Run these in Grafana’s Explore tab (left sidebar) to test queries before adding to dashboards.

Troubleshooting

Prometheus CrashLoopBackOff

Symptom:

Error: open /prometheus/queries.active: permission denied

Fix: Use NFS storage

prometheus:
  prometheusSpec:
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: nfs-appdata  # Not local-path

Talos Linux’s strict security conflicts with Prometheus file permissions. NFS bypasses the issue.

Dashboard Not Auto-Importing

Check sidecar logs:

kubectl logs -n media -l app.kubernetes.io/name=grafana -c grafana-sc-dashboard

Verify label:

labels:
  grafana_dashboard: "1"  # Must be string "1", not int 1

Alerts Not Firing

Check rule status:

kubectl port-forward -n media svc/kube-prometheus-stack-prometheus 9090:9090
# Visit http://localhost:9090/rules

Manually trigger test alert:

- alert: TestAlert
  expr: vector(1)  # Always true
  annotations:
    summary: "Test alert"

Email Not Sending

Check Alertmanager logs:

kubectl logs -n media alertmanager-kube-prometheus-stack-alertmanager-0 -c alertmanager | grep -i smtp

Common issues:

Wrong app password (16 chars, no spaces)
2FA not enabled on Gmail
Port 587 blocked by firewall

💡 Tip

Test Gmail SMTP directly (curl command above) before debugging Alertmanager. Separate infrastructure from config issues.

Alert Tuning

Start with Too Many

Deploy all alerts, then tune down based on noise. Better to know what’s noisy than miss real issues.

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 10s
  group_interval: 10s

If 3 pods go down simultaneously, you get one email with 3 alerts, not 3 separate emails.

Use `for:` to Prevent Flapping

for: 5m  # Must be true for 5 consecutive minutes

Transient CPU spikes don’t trigger alerts. Only sustained issues do.

Send Resolved Notifications

send_resolved: true

Getting the “all clear” email is satisfying. Always enable it.

Operational Insights

What Actually Gets Alerted (Last 7 Days)

3x MediaAppRestarting (info) - Sonarr updated
1x MediaAppHighMemory (warning) - Plex hit 85% but didn’t OOM
0x MediaAppPodDown (critical) - Stack stable

Key insight: Most restarts are planned (updates). Use severity: info so you’re aware but not paged at 3am.

Dashboards I Use Daily

Media Stack Overview - Quick health check
Kubernetes / Compute Resources / Namespace - When something feels slow
Node Exporter / Nodes - Host-level investigation

The pre-built Kubernetes dashboards are excellent. Don’t reinvent them.

Checklist

Before calling monitoring “done”:

Grafana accessible at https://grafana.media.lan
Prometheus /targets page shows all scrape targets
Custom dashboard auto-imported
Alert rules loaded (/rules page)
Test alert sent to email
Resolved alert sent when cleared
Dashboard panels show live data

What’s Next

Enhance monitoring:

Add app-specific metrics (Plex API, SABnzbd queue)
Create per-app dashboards
Set up SLO tracking (99.9% uptime target)

Improve alerting:

Silence windows during maintenance
Add Slack/Discord integration
Create runbooks linked to each alert

External monitoring:

Uptime Robot for external checks
Synthetic monitoring (test Plex stream every 5min)

The Problem#

Architecture#

Deployment#

kube-prometheus-stack#

Configuration#

Storage Decisions#

Why NFS for Prometheus?#

Why local-path for Grafana?#

Custom Dashboard#

Auto-Import via ConfigMap#

Alert Rules#

Philosophy#

Configuration#

Gmail SMTP Setup#

Get App Password#

Test SMTP#

Useful PromQL Queries#

Top CPU Consumers#

Memory as % of Limit#

Pods with >5 Restarts (24h)#

Storage Trend (7d)#

Troubleshooting#

Prometheus CrashLoopBackOff#

Dashboard Not Auto-Importing#

Alerts Not Firing#

Email Not Sending#

Alert Tuning#

Start with Too Many#

Group Related Alerts#

Use for: to Prevent Flapping#

Send Resolved Notifications#

Operational Insights#

What Actually Gets Alerted (Last 7 Days)#

Dashboards I Use Daily#

Checklist#

What’s Next#

References#

The Problem

Architecture

Deployment

kube-prometheus-stack

Configuration

Storage Decisions

Why NFS for Prometheus?

Why local-path for Grafana?

Custom Dashboard

Auto-Import via ConfigMap

Alert Rules

Philosophy

Configuration

Gmail SMTP Setup

Get App Password

Test SMTP

Useful PromQL Queries

Top CPU Consumers

Memory as % of Limit

Pods with >5 Restarts (24h)

Storage Trend (7d)

Troubleshooting

Prometheus CrashLoopBackOff

Dashboard Not Auto-Importing

Alerts Not Firing

Email Not Sending

Alert Tuning

Start with Too Many

Group Related Alerts

Use `for:` to Prevent Flapping

Send Resolved Notifications

Operational Insights

What Actually Gets Alerted (Last 7 Days)

Dashboards I Use Daily

Checklist

What’s Next

References