A complete automated media stack on Kubernetes: streaming, downloads, requests, and monitoring. TLS everywhere, SQLite on fast storage, NFS for media files.
“If it’s not reproducible, it doesn’t exist.” - Everyone rebuilding from scratch
Why Kubernetes for Media?
I ran this on Docker Compose for two years. It worked great until I had to rebuild the host. Spent a weekend trying to remember which directories went where, what ports I’d used, and which environment variables actually mattered.
Kubernetes fixes this:
- Declarative config - Lose the entire cluster?
git clone && kubectl apply. You’re back up in 15 minutes. - Health probes - Apps crash. Kubernetes restarts them. You sleep through it.
- Resource limits - Plex will eat all your RAM if you let it. Don’t let it.
- Observability - Know what broke before your family complains.
This isn’t theoretical. I’ve rebuilt this stack from scratch three times. Each time took less effort because everything’s in Git.
Full source: k8s-media-stack
The Stack
| App | Purpose |
|---|---|
| Plex | Media server for streaming |
| Sonarr | TV show automation |
| Radarr | Movie automation |
| SABnzbd | Usenet downloads |
| Overseerr | User request management |
| Prowlarr | Indexer management |
| Bazarr | Subtitle downloads |
| Tautulli | Plex statistics |
| Homepage | Unified dashboard |
Infrastructure:
| Component | Purpose |
|---|---|
| Traefik | Ingress + TLS termination |
| cert-manager | Automated cert issuance |
| MetalLB | LoadBalancer for bare metal |
| NFS CSI Driver | Network storage |
| local-path | Fast local storage |
Monitoring:
| Component | Purpose |
|---|---|
| Grafana | Dashboards and visualization |
| Prometheus | Metrics collection |
| Alertmanager | Email notifications |
Storage Strategy
“I’ll just put SQLite on NFS, what could go wrong?” - Everyone, once
The most important decision: where to put what.
I made this mistake so you don’t have to: I put everything on NFS. Seemed elegant - one place for all the data. Sonarr took 45 seconds to load a page. Radarr was worse. The apps work by hammering SQLite with thousands of small reads. NFS is terrible at this.
Here’s what actually works:
Media files (movies/TV) → NFS
- Large, sequential I/O
- Shared across apps
- Network latency doesn't matter
App configs (SQLite DBs) → local-path
- Small, random I/O
- Performance critical
- Each app pinned to one node
After moving databases to local SSDs, page loads dropped from 45s to 2s. This isn’t a nice-to-have optimization. It’s the difference between usable and unusable.
kubectl exec -n media deploy/sonarr -- du -sh /config. If it’s under 2GB (and these apps are usually under 500MB), put it on local storage. The performance difference is night and day.Foundation Setup
“Security is not something you add at the end. It’s something you build in from the start.” - Every security incident postmortem
TLS Certificates
Using cert-manager with a self-signed CA for *.media.lan:
# Generate CA
openssl req -x509 -newkey rsa:4096 -sha256 -days 3650 \
-nodes -keyout ca.key -out ca.crt \
-subj "/CN=Media Stack CA" \
-addext "subjectAltName=DNS:*.media.lan"
# Create secret
kubectl create secret tls media-lan-ca \
--cert=ca.crt --key=ca.key -n cert-manager
# ClusterIssuer
kubectl apply -f - <<EOF
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: media-lan-ca
spec:
ca:
secretName: media-lan-ca
EOF
cert-manager now issues certs automatically for any Ingress in the media namespace. Add the annotation, get a cert.
Storage Classes
# NFS storage class
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: nfs-appdata
provisioner: nfs.csi.k8s.io
parameters:
server: 192.168.2.129
share: /volume1/nfs01/data
mountOptions:
- vers=3
- soft
- intr
---
# Local storage (Rancher local-path-provisioner)
# Already installed - just use storageClassName: local-path
soft,intr mount options mean NFS timeouts return errors instead of hanging forever. Without this, when your NAS reboots, every pod with an NFS mount hangs indefinitely. You can’t even kill them. Learn from my pain: use soft,intr.App Configuration Pattern
Every app uses bjw-s/app-template for consistency. Here’s the pattern:
# Example: Sonarr
controllers:
sonarr:
containers:
app:
image:
repository: ghcr.io/onedr0p/sonarr
tag: 4.0.11
# TCP probes more reliable than HTTP
probes:
startup:
enabled: true
spec:
tcpSocket:
port: 8989
failureThreshold: 60 # 5 minutes to start
periodSeconds: 5
liveness:
enabled: true
spec:
tcpSocket:
port: 8989
periodSeconds: 30
failureThreshold: 10 # 5 minutes before restart
readiness:
enabled: true
spec:
tcpSocket:
port: 8989
periodSeconds: 10
# Prevent memory leaks from killing the node
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
cpu: "2"
memory: 2Gi
service:
app:
controller: sonarr
ports:
http:
port: 8989
ingress:
app:
className: traefik
annotations:
cert-manager.io/cluster-issuer: media-lan-ca
hosts:
- host: sonarr.media.lan
paths:
- path: /
service:
identifier: app
port: http
tls:
- hosts:
- sonarr.media.lan
secretName: sonarr-tls
persistence:
config:
type: persistentVolumeClaim
accessMode: ReadWriteOnce
size: 2Gi
storageClass: local-path # Fast local storage
globalMounts:
- path: /config
# Cache posters locally instead of hitting NFS
mediacover:
type: emptyDir
globalMounts:
- path: /config/MediaCover
# Shared media storage
data:
type: persistentVolumeClaim
existingClaim: media-data
globalMounts:
- path: /data
bjw-s/app-template chart is generic. Learn it once, deploy anything. I use the same pattern for all eight apps. Consistency means less context switching when something breaks at 2am.Key Insights
1. TCP probes over HTTP
I wasted an afternoon debugging why Sonarr kept restarting during updates. The HTTP probe was hitting the endpoint before the app was actually ready, probe failed, Kubernetes killed it, restart loop.
TCP probes just check if the port is open. Much more reliable for these apps:
tcpSocket:
port: 8989
failureThreshold: 60 # Give it time. These apps are slow to start.
Set that failureThreshold high. These aren’t cloud-native apps. They take their time.
2. EmptyDir for caches
Sonarr was downloading the same poster images over NFS every time it loaded a page. Hundreds of images, every page load. My NAS was getting hammered.
Mount an emptyDir over the cache directory. Posters get cached on the node’s local disk:
mediacover:
type: emptyDir
globalMounts:
- path: /config/MediaCover
This single change cut NFS I/O by 60%. Your NAS will thank you.
3. Resource limits matter
“Surely Plex won’t eat ALL the RAM…” - Famous last words before your node dies
Plex killed my node. Twice. Without limits, it was streaming four 4K transcodes and consumed every byte of RAM until the kernel OOM killer started taking down random system processes to free memory.
Set limits. Force Plex to OOM the pod before it affects your node:
resources:
limits:
memory: 4Gi # Pod dies. Node survives.
Don’t guess at limits. Run your workload for a week, then check actual usage:
kubectl top pods -n media
Set limits 20-30% above your peaks. Too tight and legitimate spikes OOM the pod. Too loose and you’re back where you started.
Plex Configuration
Plex needs a LoadBalancer IP for direct access (DLNA, clients):
service:
app:
controller: plex
type: LoadBalancer # MetalLB assigns an IP
loadBalancerIP: 192.168.2.245
ports:
http:
port: 32400
persistence:
config:
storageClass: local-path # Metadata DB needs fast I/O
size: 25Gi
data:
existingClaim: media-data # Shared NFS
globalMounts:
- path: /data/media
Hardlinks require shared filesystem:
This matters more than you think. When Sonarr “moves” a completed download to your library, it needs to be instant. If downloads and media are on different filesystems, it actually copies the file. That means:
- Your 50GB 4K movie takes 5 minutes to “move”
- You need 50GB of free space you didn’t need before
- It’s hitting NFS with sustained sequential writes
Hardlinks solve this. Same NFS volume, different mount paths:
# Same NFS PVC mounted at /data in all apps
volumes:
- name: media-data
nfs:
server: 192.168.2.129
path: /volume1/nfs01/data
# Directory structure on NFS:
/data/
media/
tv/ # Sonarr final location
movies/ # Radarr final location
downloads/
complete/ # SABnzbd output
Now Sonarr “moves” a 50GB file in under a second. It’s just updating the directory entry. Same file, new path. No copy.
SABnzbd Setup
Switched from qBittorrent after dealing with too many dead torrents and ratio requirements. Usenet is faster, more reliable, and doesn’t care about seeding.
persistence:
config:
storageClass: local-path # Fast DB
size: 1Gi
incomplete-downloads:
type: emptyDir # Temp downloads on fast local disk
sizeLimit: 100Gi
data:
existingClaim: media-data # Final destination on NFS
globalMounts:
- path: /data
Configure in Web UI:
- Add Newshosting server (Settings → Servers)
- Set categories (Settings → Categories):
tv→/data/downloads/complete/tvmovies→/data/downloads/complete/movies
- Enable API (Settings → General)
Integrate with Prowlarr:
Prowlarr is the indexer manager. Add your indexers once there, and it pushes them to Sonarr/Radarr via API. This beats manually adding the same 5 indexers to three different apps.
In Prowlarr: Settings → Apps → Add Application → Pick Sonarr or Radarr. It autodiscovers via DNS if they’re in the same namespace. If not, use service DNS names: http://sonarr.media.svc.cluster.local:8989
Monitoring Stack
“If you can’t measure it, you can’t improve it. If you can’t observe it, you can’t fix it.” - Operations Mantra
You need monitoring. Not because it’s best practice - because you need to know when Plex is approaching its memory limit before the pod OOMs at 10pm on a Friday while your family is watching a movie.
Deploy kube-prometheus-stack:
helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
helm install kube-prometheus-stack \
prometheus-community/kube-prometheus-stack \
-n media \
-f apps/monitoring/values.yaml
Key Configuration
# apps/monitoring/values.yaml
grafana:
persistence:
storageClass: local-path # SQLite needs fast I/O
size: 5Gi
ingress:
enabled: true
className: traefik
hosts:
- grafana.media.lan
tls:
- hosts:
- grafana.media.lan
defaultDashboardsEnabled: true # Pre-built K8s dashboards
prometheus:
prometheusSpec:
retention: 7d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: nfs-appdata # Historical data on NFS
resources:
requests:
storage: 10Gi
alertmanager:
config:
global:
smtp_from: 'your-email@gmail.com'
smtp_smarthost: 'smtp.gmail.com:587'
smtp_auth_username: 'your-email@gmail.com'
smtp_auth_password: 'your-gmail-app-password'
smtp_require_tls: true
receivers:
- name: 'email'
email_configs:
- to: 'your-email@gmail.com'
send_resolved: true
Custom Dashboard
Auto-imported via ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-media-stack
namespace: media
labels:
grafana_dashboard: "1" # Grafana sidecar imports this
data:
media-stack-overview.json: |
{ ... dashboard JSON ... }
Panels:
- CPU/Memory usage by app
- Pod status (visual health check)
- Network I/O (Plex streaming + downloads)
- Container restarts
- Storage usage
Access at: https://grafana.media.lan
Alert Rules
# monitoring/prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: media-stack-alerts
namespace: media
spec:
groups:
- name: media-stack
rules:
- alert: MediaAppPodDown
expr: kube_pod_status_phase{namespace="media", phase="Running", pod=~"plex.*|sonarr.*|radarr.*"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "{{ $labels.pod }} is down"
- alert: MediaStorageFull
expr: 100 * kubelet_volume_stats_used_bytes{persistentvolumeclaim="media-data"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim="media-data"} > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Media storage {{ $value | humanize }}% full"
Homepage Dashboard
Unified view with live widgets:
# dashboard/homepage.yaml
services:
- Media:
- Plex:
href: http://192.168.2.245:32400/web
- Overseerr:
widget:
type: overseerr
url: http://overseerr.media.svc.cluster.local:5055
key: <api-key>
- Downloads:
- Sonarr:
widget:
type: sonarr
url: http://sonarr.media.svc.cluster.local:8989
key: <api-key>
- Radarr:
widget:
type: radarr
url: http://radarr.media.svc.cluster.local:7878
key: <api-key>
- SABnzbd:
widget:
type: sabnzbd
url: http://sabnzbd.media.svc.cluster.local:8080
key: <api-key>
- Monitoring:
- Grafana:
widget:
type: grafana
url: http://kube-prometheus-stack-grafana.media.svc.cluster.local
Access at: https://home.media.lan
Deployment
# deploy.sh
#!/bin/bash
set -e
NAMESPACE="media"
KUBECONFIG="/path/to/kubeconfig"
# Foundation
kubectl apply -f foundation/namespace.yaml
kubectl apply -f storage/
# Storage
kubectl apply -f storage/media-pvc.yaml
# Apps (order matters)
for app in plex prowlarr sonarr radarr sabnzbd overseerr bazarr tautulli; do
helm upgrade --install $app bjw-s/app-template \
-n $NAMESPACE -f apps/$app/values.yaml
done
# Monitoring
helm upgrade --install kube-prometheus-stack \
prometheus-community/kube-prometheus-stack \
-n $NAMESPACE -f apps/monitoring/values.yaml
# Dashboard
kubectl apply -f dashboard/homepage.yaml
echo "✓ Media stack deployed"
Troubleshooting
“The best time to test your backups is before you need them. The second best time is now.” - Disaster Recovery 101
Sonarr/Radarr Slow
Symptom: 30+ second page loads. Console shows SQLite locking warnings.
Root cause: SQLite over NFS. This is a known bad combination. The apps do thousands of small random reads. NFS adds 5-10ms to each one.
Fix: Move config to local-path storage
# 1. Backup
kubectl exec -n media deploy/sonarr -- tar czf /tmp/backup.tar.gz -C /config .
kubectl cp media/sonarr-xxx:/tmp/backup.tar.gz /tmp/
# 2. Delete and recreate with local-path
helm uninstall sonarr -n media
kubectl delete pvc sonarr -n media
# Edit values.yaml: storageClass: local-path
helm install sonarr bjw-s/app-template -n media -f apps/sonarr/values.yaml
# 3. Restore
kubectl cp /tmp/backup.tar.gz media/sonarr-xxx:/tmp/
kubectl exec -n media deploy/sonarr -- tar xzf /tmp/backup.tar.gz -C /config
kubectl rollout restart -n media deploy/sonarr
Prometheus Permission Errors
Symptom: open /prometheus/queries.active: permission denied
Root cause: Talos Linux runs a tight security model. Prometheus wants to run as a specific UID that doesn’t have permissions on local-path volumes.
Fix: Use NFS storage instead of local-path
prometheus:
prometheusSpec:
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: nfs-appdata # Not local-path
Yes, this is backwards from what I told you about databases. Prometheus is the exception. Its workload pattern is fine on NFS, and NFS has looser permissions. Annoying, but it works.
Plex Not Updating Metadata
Fix: Trigger refresh via API
# Get token
PLEX_TOKEN=$(kubectl exec -n media deploy/plex -- \
cat "/config/Library/Application Support/Plex Media Server/Preferences.xml" \
| grep -oP 'PlexOnlineToken="\K[^"]+')
# Refresh library
curl -X PUT "http://192.168.2.245:32400/library/sections/1/refresh?X-Plex-Token=$PLEX_TOKEN"
Check Alert Status
# View active alerts
kubectl port-forward -n media svc/kube-prometheus-stack-alertmanager 9093:9093
# Visit http://localhost:9093
# Check Prometheus rules
kubectl port-forward -n media svc/kube-prometheus-stack-prometheus 9090:9090
# Visit http://localhost:9090/alerts
Lessons Learned (The Hard Way)
1. SQLite and NFS Don’t Mix
30 second page loads in Sonarr. Constant database lock errors. Moved to local SSD, problem gone. This isn’t a performance optimization, it’s a requirement. Don’t even try NFS for SQLite.
2. Cache Everything You Can Locally
Sonarr was downloading the same 500 poster images every page load. emptyDir cache cut NFS I/O by 60%. Your NAS has better things to do than serve the same cached image 100 times a day.
3. Set Resource Limits Before Production
Plex killed my node twice before I added limits. Don’t learn this lesson yourself. Set limits from day one based on your expected workload, not your available resources.
4. TCP Probes Save Debugging Time
Spent an afternoon figuring out why Sonarr restart-looped during updates. HTTP health checks were failing before the app was ready. Switched to TCP probes, problem disappeared. These apps are slow to start - check if the port is open, that’s enough.
What’s Next
“It works on my cluster.” - Now you can say this unironically
You now have a media stack that you can destroy and rebuild from Git in under 20 minutes. I’ve done it three times. Twice for hardware upgrades, once because I wanted to test Talos Linux.
That’s the point. Infrastructure as code isn’t about being clever. It’s about having a bad day and knowing you can recover.
What I haven’t covered:
- Backups - Config is in Git. Media files are replaceable (you have the NZBs, right?). But back up your Plex watch history if you care about it.
- External access - Tailscale is the easy answer. Don’t expose Plex directly to the internet without Cloudflare in front.
- GPU transcoding - Plex with Intel Quick Sync is worth the effort if you have multiple remote users.
- Scaling - You don’t need it. These apps are single-instance by design. Horizontal scaling doesn’t help here.