A complete automated media stack on Kubernetes: streaming, downloads, requests, and monitoring. TLS everywhere, SQLite on fast storage, NFS for media files.

“If it’s not reproducible, it doesn’t exist.” - Everyone rebuilding from scratch


Why Kubernetes for Media?

I ran this on Docker Compose for two years. It worked great until I had to rebuild the host. Spent a weekend trying to remember which directories went where, what ports I’d used, and which environment variables actually mattered.

Kubernetes fixes this:

  • Declarative config - Lose the entire cluster? git clone && kubectl apply. You’re back up in 15 minutes.
  • Health probes - Apps crash. Kubernetes restarts them. You sleep through it.
  • Resource limits - Plex will eat all your RAM if you let it. Don’t let it.
  • Observability - Know what broke before your family complains.

This isn’t theoretical. I’ve rebuilt this stack from scratch three times. Each time took less effort because everything’s in Git.

Full source: k8s-media-stack

⚠️ This is a homelab
I’m using self-signed certs and hardcoding some passwords in Helm values. Don’t do this at work. For a media server running on your home network behind a firewall, it’s fine. I’ll point out where production would diverge.

The Stack

AppPurpose
PlexMedia server for streaming
SonarrTV show automation
RadarrMovie automation
SABnzbdUsenet downloads
OverseerrUser request management
ProwlarrIndexer management
BazarrSubtitle downloads
TautulliPlex statistics
HomepageUnified dashboard

Infrastructure:

ComponentPurpose
TraefikIngress + TLS termination
cert-managerAutomated cert issuance
MetalLBLoadBalancer for bare metal
NFS CSI DriverNetwork storage
local-pathFast local storage

Monitoring:

ComponentPurpose
GrafanaDashboards and visualization
PrometheusMetrics collection
AlertmanagerEmail notifications

Storage Strategy

“I’ll just put SQLite on NFS, what could go wrong?” - Everyone, once

The most important decision: where to put what.

I made this mistake so you don’t have to: I put everything on NFS. Seemed elegant - one place for all the data. Sonarr took 45 seconds to load a page. Radarr was worse. The apps work by hammering SQLite with thousands of small reads. NFS is terrible at this.

Here’s what actually works:

Media files (movies/TV) → NFS
  - Large, sequential I/O
  - Shared across apps
  - Network latency doesn't matter

App configs (SQLite DBs) → local-path
  - Small, random I/O
  - Performance critical
  - Each app pinned to one node

After moving databases to local SSDs, page loads dropped from 45s to 2s. This isn’t a nice-to-have optimization. It’s the difference between usable and unusable.

💡 Tip
Check your database size: kubectl exec -n media deploy/sonarr -- du -sh /config. If it’s under 2GB (and these apps are usually under 500MB), put it on local storage. The performance difference is night and day.

Foundation Setup

“Security is not something you add at the end. It’s something you build in from the start.” - Every security incident postmortem

TLS Certificates

Using cert-manager with a self-signed CA for *.media.lan:

# Generate CA
openssl req -x509 -newkey rsa:4096 -sha256 -days 3650 \
  -nodes -keyout ca.key -out ca.crt \
  -subj "/CN=Media Stack CA" \
  -addext "subjectAltName=DNS:*.media.lan"

# Create secret
kubectl create secret tls media-lan-ca \
  --cert=ca.crt --key=ca.key -n cert-manager

# ClusterIssuer
kubectl apply -f - <<EOF
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: media-lan-ca
spec:
  ca:
    secretName: media-lan-ca
EOF

cert-manager now issues certs automatically for any Ingress in the media namespace. Add the annotation, get a cert.

🏭 In production
For anything exposed to the internet or used at work: Let’s Encrypt with DNS-01 challenges, or run a proper internal CA (Vault, etc). Self-signed is fine when you control all the clients and can add your CA to their trust store.

Storage Classes

# NFS storage class
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-appdata
provisioner: nfs.csi.k8s.io
parameters:
  server: 192.168.2.129
  share: /volume1/nfs01/data
mountOptions:
  - vers=3
  - soft
  - intr

---
# Local storage (Rancher local-path-provisioner)
# Already installed - just use storageClassName: local-path
💡 Tip
The soft,intr mount options mean NFS timeouts return errors instead of hanging forever. Without this, when your NAS reboots, every pod with an NFS mount hangs indefinitely. You can’t even kill them. Learn from my pain: use soft,intr.

App Configuration Pattern

Every app uses bjw-s/app-template for consistency. Here’s the pattern:

# Example: Sonarr
controllers:
  sonarr:
    containers:
      app:
        image:
          repository: ghcr.io/onedr0p/sonarr
          tag: 4.0.11
        
        # TCP probes more reliable than HTTP
        probes:
          startup:
            enabled: true
            spec:
              tcpSocket:
                port: 8989
              failureThreshold: 60  # 5 minutes to start
              periodSeconds: 5
          
          liveness:
            enabled: true
            spec:
              tcpSocket:
                port: 8989
              periodSeconds: 30
              failureThreshold: 10  # 5 minutes before restart
          
          readiness:
            enabled: true
            spec:
              tcpSocket:
                port: 8989
              periodSeconds: 10
        
        # Prevent memory leaks from killing the node
        resources:
          requests:
            cpu: 200m
            memory: 512Mi
          limits:
            cpu: "2"
            memory: 2Gi

service:
  app:
    controller: sonarr
    ports:
      http:
        port: 8989

ingress:
  app:
    className: traefik
    annotations:
      cert-manager.io/cluster-issuer: media-lan-ca
    hosts:
      - host: sonarr.media.lan
        paths:
          - path: /
            service:
              identifier: app
              port: http
    tls:
      - hosts:
          - sonarr.media.lan
        secretName: sonarr-tls

persistence:
  config:
    type: persistentVolumeClaim
    accessMode: ReadWriteOnce
    size: 2Gi
    storageClass: local-path  # Fast local storage
    globalMounts:
      - path: /config
  
  # Cache posters locally instead of hitting NFS
  mediacover:
    type: emptyDir
    globalMounts:
      - path: /config/MediaCover
  
  # Shared media storage
  data:
    type: persistentVolumeClaim
    existingClaim: media-data
    globalMounts:
      - path: /data
ℹ️ Info
The bjw-s/app-template chart is generic. Learn it once, deploy anything. I use the same pattern for all eight apps. Consistency means less context switching when something breaks at 2am.

Key Insights

1. TCP probes over HTTP

I wasted an afternoon debugging why Sonarr kept restarting during updates. The HTTP probe was hitting the endpoint before the app was actually ready, probe failed, Kubernetes killed it, restart loop.

TCP probes just check if the port is open. Much more reliable for these apps:

tcpSocket:
  port: 8989
failureThreshold: 60  # Give it time. These apps are slow to start.

Set that failureThreshold high. These aren’t cloud-native apps. They take their time.

2. EmptyDir for caches

Sonarr was downloading the same poster images over NFS every time it loaded a page. Hundreds of images, every page load. My NAS was getting hammered.

Mount an emptyDir over the cache directory. Posters get cached on the node’s local disk:

mediacover:
  type: emptyDir
  globalMounts:
    - path: /config/MediaCover

This single change cut NFS I/O by 60%. Your NAS will thank you.

3. Resource limits matter

“Surely Plex won’t eat ALL the RAM…” - Famous last words before your node dies

Plex killed my node. Twice. Without limits, it was streaming four 4K transcodes and consumed every byte of RAM until the kernel OOM killer started taking down random system processes to free memory.

Set limits. Force Plex to OOM the pod before it affects your node:

resources:
  limits:
    memory: 4Gi  # Pod dies. Node survives.

Don’t guess at limits. Run your workload for a week, then check actual usage:

kubectl top pods -n media

Set limits 20-30% above your peaks. Too tight and legitimate spikes OOM the pod. Too loose and you’re back where you started.


Plex Configuration

Plex needs a LoadBalancer IP for direct access (DLNA, clients):

service:
  app:
    controller: plex
    type: LoadBalancer  # MetalLB assigns an IP
    loadBalancerIP: 192.168.2.245
    ports:
      http:
        port: 32400

persistence:
  config:
    storageClass: local-path  # Metadata DB needs fast I/O
    size: 25Gi
  
  data:
    existingClaim: media-data  # Shared NFS
    globalMounts:
      - path: /data/media

Hardlinks require shared filesystem:

This matters more than you think. When Sonarr “moves” a completed download to your library, it needs to be instant. If downloads and media are on different filesystems, it actually copies the file. That means:

  • Your 50GB 4K movie takes 5 minutes to “move”
  • You need 50GB of free space you didn’t need before
  • It’s hitting NFS with sustained sequential writes

Hardlinks solve this. Same NFS volume, different mount paths:

# Same NFS PVC mounted at /data in all apps
volumes:
  - name: media-data
    nfs:
      server: 192.168.2.129
      path: /volume1/nfs01/data

# Directory structure on NFS:
/data/
  media/
    tv/      # Sonarr final location
    movies/  # Radarr final location
  downloads/
    complete/  # SABnzbd output

Now Sonarr “moves” a 50GB file in under a second. It’s just updating the directory entry. Same file, new path. No copy.


SABnzbd Setup

Switched from qBittorrent after dealing with too many dead torrents and ratio requirements. Usenet is faster, more reliable, and doesn’t care about seeding.

persistence:
  config:
    storageClass: local-path  # Fast DB
    size: 1Gi
  
  incomplete-downloads:
    type: emptyDir  # Temp downloads on fast local disk
    sizeLimit: 100Gi
  
  data:
    existingClaim: media-data  # Final destination on NFS
    globalMounts:
      - path: /data

Configure in Web UI:

  1. Add Newshosting server (Settings → Servers)
  2. Set categories (Settings → Categories):
    • tv/data/downloads/complete/tv
    • movies/data/downloads/complete/movies
  3. Enable API (Settings → General)

Integrate with Prowlarr:

Prowlarr is the indexer manager. Add your indexers once there, and it pushes them to Sonarr/Radarr via API. This beats manually adding the same 5 indexers to three different apps.

In Prowlarr: Settings → Apps → Add Application → Pick Sonarr or Radarr. It autodiscovers via DNS if they’re in the same namespace. If not, use service DNS names: http://sonarr.media.svc.cluster.local:8989

💡 Tip
Grab API keys from Settings → General in each app. You’ll need them for cross-app integrations. Store them somewhere - you’ll use them again when you rebuild this in six months.

Monitoring Stack

“If you can’t measure it, you can’t improve it. If you can’t observe it, you can’t fix it.” - Operations Mantra

You need monitoring. Not because it’s best practice - because you need to know when Plex is approaching its memory limit before the pod OOMs at 10pm on a Friday while your family is watching a movie.

Deploy kube-prometheus-stack:

helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts

helm install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  -n media \
  -f apps/monitoring/values.yaml

Key Configuration

# apps/monitoring/values.yaml
grafana:
  persistence:
    storageClass: local-path  # SQLite needs fast I/O
    size: 5Gi
  
  ingress:
    enabled: true
    className: traefik
    hosts:
      - grafana.media.lan
    tls:
      - hosts:
          - grafana.media.lan
  
  defaultDashboardsEnabled: true  # Pre-built K8s dashboards

prometheus:
  prometheusSpec:
    retention: 7d
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: nfs-appdata  # Historical data on NFS
          resources:
            requests:
              storage: 10Gi

alertmanager:
  config:
    global:
      smtp_from: 'your-email@gmail.com'
      smtp_smarthost: 'smtp.gmail.com:587'
      smtp_auth_username: 'your-email@gmail.com'
      smtp_auth_password: 'your-gmail-app-password'
      smtp_require_tls: true
    
    receivers:
      - name: 'email'
        email_configs:
          - to: 'your-email@gmail.com'
            send_resolved: true
💡 Tip
Gmail requires an app password for SMTP. Generate one at https://myaccount.google.com/apppasswords. Regular password won’t work. Don’t ask me how I know.

Custom Dashboard

Auto-imported via ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboard-media-stack
  namespace: media
  labels:
    grafana_dashboard: "1"  # Grafana sidecar imports this
data:
  media-stack-overview.json: |
    { ... dashboard JSON ... }

Panels:

  • CPU/Memory usage by app
  • Pod status (visual health check)
  • Network I/O (Plex streaming + downloads)
  • Container restarts
  • Storage usage

Access at: https://grafana.media.lan

Alert Rules

# monitoring/prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: media-stack-alerts
  namespace: media
spec:
  groups:
    - name: media-stack
      rules:
        - alert: MediaAppPodDown
          expr: kube_pod_status_phase{namespace="media", phase="Running", pod=~"plex.*|sonarr.*|radarr.*"} == 0
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "{{ $labels.pod }} is down"
        
        - alert: MediaStorageFull
          expr: 100 * kubelet_volume_stats_used_bytes{persistentvolumeclaim="media-data"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim="media-data"} > 85
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Media storage {{ $value | humanize }}% full"
ℹ️ Info
You’ll get two emails per alert: one when it fires, one when it resolves. Don’t disable the resolved notifications - knowing that things fixed themselves is useful information.

Homepage Dashboard

Unified view with live widgets:

# dashboard/homepage.yaml
services:
  - Media:
      - Plex:
          href: http://192.168.2.245:32400/web
      - Overseerr:
          widget:
            type: overseerr
            url: http://overseerr.media.svc.cluster.local:5055
            key: <api-key>
  
  - Downloads:
      - Sonarr:
          widget:
            type: sonarr
            url: http://sonarr.media.svc.cluster.local:8989
            key: <api-key>
      - Radarr:
          widget:
            type: radarr
            url: http://radarr.media.svc.cluster.local:7878
            key: <api-key>
      - SABnzbd:
          widget:
            type: sabnzbd
            url: http://sabnzbd.media.svc.cluster.local:8080
            key: <api-key>
  
  - Monitoring:
      - Grafana:
          widget:
            type: grafana
            url: http://kube-prometheus-stack-grafana.media.svc.cluster.local

Access at: https://home.media.lan

💡 Tip
Grab API keys from Settings → General → API Key in each app. The widgets show real-time queue sizes and disk usage. Way better than clicking through eight different URLs to check on things.

Deployment

# deploy.sh
#!/bin/bash
set -e

NAMESPACE="media"
KUBECONFIG="/path/to/kubeconfig"

# Foundation
kubectl apply -f foundation/namespace.yaml
kubectl apply -f storage/

# Storage
kubectl apply -f storage/media-pvc.yaml

# Apps (order matters)
for app in plex prowlarr sonarr radarr sabnzbd overseerr bazarr tautulli; do
  helm upgrade --install $app bjw-s/app-template \
    -n $NAMESPACE -f apps/$app/values.yaml
done

# Monitoring
helm upgrade --install kube-prometheus-stack \
  prometheus-community/kube-prometheus-stack \
  -n $NAMESPACE -f apps/monitoring/values.yaml

# Dashboard
kubectl apply -f dashboard/homepage.yaml

echo "✓ Media stack deployed"
💡 Tip
Deploy Prowlarr first. Add your indexers once there, and it pushes them to Sonarr/Radarr via API. Do it in any other order and you’ll be manually configuring the same indexers three times.

Troubleshooting

“The best time to test your backups is before you need them. The second best time is now.” - Disaster Recovery 101

Sonarr/Radarr Slow

Symptom: 30+ second page loads. Console shows SQLite locking warnings.

Root cause: SQLite over NFS. This is a known bad combination. The apps do thousands of small random reads. NFS adds 5-10ms to each one.

Fix: Move config to local-path storage

# 1. Backup
kubectl exec -n media deploy/sonarr -- tar czf /tmp/backup.tar.gz -C /config .
kubectl cp media/sonarr-xxx:/tmp/backup.tar.gz /tmp/

# 2. Delete and recreate with local-path
helm uninstall sonarr -n media
kubectl delete pvc sonarr -n media

# Edit values.yaml: storageClass: local-path
helm install sonarr bjw-s/app-template -n media -f apps/sonarr/values.yaml

# 3. Restore
kubectl cp /tmp/backup.tar.gz media/sonarr-xxx:/tmp/
kubectl exec -n media deploy/sonarr -- tar xzf /tmp/backup.tar.gz -C /config
kubectl rollout restart -n media deploy/sonarr

Prometheus Permission Errors

Symptom: open /prometheus/queries.active: permission denied

Root cause: Talos Linux runs a tight security model. Prometheus wants to run as a specific UID that doesn’t have permissions on local-path volumes.

Fix: Use NFS storage instead of local-path

prometheus:
  prometheusSpec:
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: nfs-appdata  # Not local-path

Yes, this is backwards from what I told you about databases. Prometheus is the exception. Its workload pattern is fine on NFS, and NFS has looser permissions. Annoying, but it works.

Plex Not Updating Metadata

Fix: Trigger refresh via API

# Get token
PLEX_TOKEN=$(kubectl exec -n media deploy/plex -- \
  cat "/config/Library/Application Support/Plex Media Server/Preferences.xml" \
  | grep -oP 'PlexOnlineToken="\K[^"]+')

# Refresh library
curl -X PUT "http://192.168.2.245:32400/library/sections/1/refresh?X-Plex-Token=$PLEX_TOKEN"

Check Alert Status

# View active alerts
kubectl port-forward -n media svc/kube-prometheus-stack-alertmanager 9093:9093
# Visit http://localhost:9093

# Check Prometheus rules
kubectl port-forward -n media svc/kube-prometheus-stack-prometheus 9090:9090
# Visit http://localhost:9090/alerts
💡 Tip
Before you close that SSH session, load each app in a browser and make sure it works. I’ve deployed, disconnected, and then realized I fat-fingered an environment variable. Remote troubleshooting takes 3x longer than fixing it while you’re still connected.

Lessons Learned (The Hard Way)

1. SQLite and NFS Don’t Mix

30 second page loads in Sonarr. Constant database lock errors. Moved to local SSD, problem gone. This isn’t a performance optimization, it’s a requirement. Don’t even try NFS for SQLite.

2. Cache Everything You Can Locally

Sonarr was downloading the same 500 poster images every page load. emptyDir cache cut NFS I/O by 60%. Your NAS has better things to do than serve the same cached image 100 times a day.

3. Set Resource Limits Before Production

Plex killed my node twice before I added limits. Don’t learn this lesson yourself. Set limits from day one based on your expected workload, not your available resources.

4. TCP Probes Save Debugging Time

Spent an afternoon figuring out why Sonarr restart-looped during updates. HTTP health checks were failing before the app was ready. Switched to TCP probes, problem disappeared. These apps are slow to start - check if the port is open, that’s enough.


What’s Next

“It works on my cluster.” - Now you can say this unironically

You now have a media stack that you can destroy and rebuild from Git in under 20 minutes. I’ve done it three times. Twice for hardware upgrades, once because I wanted to test Talos Linux.

That’s the point. Infrastructure as code isn’t about being clever. It’s about having a bad day and knowing you can recover.

What I haven’t covered:

  • Backups - Config is in Git. Media files are replaceable (you have the NZBs, right?). But back up your Plex watch history if you care about it.
  • External access - Tailscale is the easy answer. Don’t expose Plex directly to the internet without Cloudflare in front.
  • GPU transcoding - Plex with Intel Quick Sync is worth the effort if you have multiple remote users.
  • Scaling - You don’t need it. These apps are single-instance by design. Horizontal scaling doesn’t help here.

References