Cluster-level disaster recovery for Kubernetes. Backup all resources, persistent volumes, and configs. Restore an entire namespace (or the whole cluster) to a new environment in minutes.
“My cluster’s etcd is corrupted. Do I have backups?” - Questions you don’t want to ask at 2am

Why Velero
I had per-app backups (Sonarr SQLite, Radarr DB) but no cluster-level disaster recovery. One bad Helm upgrade - helm upgrade metallb with wrong IP pool - cascaded. MetalLB stopped assigning IPs. Traefik had no external IP. Plex, Sonarr, Overseerr: all unreachable. Spent four hours manually recreating the pool, ingress rules, and re-allocating LoadBalancer IPs from Git history. “We’re going to need a bigger boat.” - Jaws. Or in your case: better backups. Per-app wasn’t enough.
Velero gives you:
- Full namespace backups - All resources (Deployments, Services, PVCs, Secrets, ConfigMaps)
- Persistent volume snapshots - Actual data, not just resource definitions
- Scheduled backups - Daily/weekly cron-like automation
- Cross-cluster restore - Rebuild on new hardware
- Selective recovery - Restore one app or the whole cluster
This complements per-app backups. Velero captures the Kubernetes layer (manifests, volumes). App backups capture internal state (databases, configs).
“Backups are worthless until you’ve restored from them. Test restores before you need them.” - Disaster recovery 101
Architecture
┌─────────────────────────────────────────────────────────────┐
│ Velero (velero namespace) │
│ ↓ │
│ Scheduled Backups: │
│ • media-daily (all resources + PVCs) │
│ • cluster-weekly (full cluster state) │
│ ↓ │
│ Storage: │
│ • NFS backend (Synology: /volume1/nfs01/velero-backups) │
│ • Restic for volume snapshots (filesystem-level) │
│ ↓ │
│ Restore: │
│ • Same cluster (rollback bad upgrades) │
│ • New cluster (disaster recovery) │
└─────────────────────────────────────────────────────────────┘
Deployment Repo
Full source: k8s-velero-backups on GitHub
k8s-velero-backups/
├── values.yaml # Velero Helm values
├── backup-schedules/ # CronJob-style backup definitions
│ ├── media-daily.yaml
│ └── cluster-weekly.yaml
├── deploy.sh # Automated deployment
├── restore.sh # Interactive restore script
└── verify-backup.sh # Test backup integrity
Storage Backend
Velero supports S3, GCS, Azure Blob, and filesystem targets. For homelab, NFS is simplest.
NFS Setup on Synology
SSH to your NAS and create the backup directory:
ssh youruser@192.168.1.10
sudo mkdir -p /volume1/nfs01/velero-backups
sudo chown -R nobody:nogroup /volume1/nfs01/velero-backups
sudo chmod 755 /volume1/nfs01/velero-backups
Verify NFS export in DSM: Control Panel → Shared Folder → nfs01 → Edit → NFS Permissions
Ensure your K8s subnet (192.168.1.0/24) has read/write access.
Helm Values
# values.yaml
image:
repository: velero/velero
tag: v1.14.1
initContainers:
- name: velero-plugin-for-aws
image: velero/velero-plugin-for-aws:v1.10.1
volumeMounts:
- mountPath: /target
name: plugins
configuration:
# NFS storage via S3 API (MinIO running on NFS)
backupStorageLocation:
- name: default
provider: aws
bucket: velero
config:
region: minio
s3ForcePathStyle: "true"
s3Url: http://minio.velero.svc.cluster.local:9000
publicUrl: http://minio.velero.svc.cluster.local:9000
volumeSnapshotLocation:
- name: default
provider: aws
config:
region: minio
# Use Restic for filesystem-level PVC backups
uploaderType: restic
defaultVolumesToFsBackup: true
# Namespaces to include in cluster backups
backupSyncPeriod: 1h
restoreOnlyMode: false
# MinIO for S3-compatible NFS backend
deployNodeAgent: true
nodeAgent:
podVolumePath: /var/lib/kubelet/pods
privileged: false
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
credentials:
useSecret: true
existingSecret: velero-credentials
# Schedules (defined separately as CRDs)
schedules: {}
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
rbac:
create: true
serviceAccount:
server:
create: true
Key decisions:
- MinIO as S3 gateway - Wraps NFS in S3 API (Velero’s native interface)
- Restic for volumes - Filesystem-level snapshots (doesn’t require CSI snapshot support)
- Node agent - Runs DaemonSet to access PVCs for backup
Velero was originally designed for cloud object storage (S3, GCS). For homelab NFS, we run MinIO as a lightweight S3-compatible shim. This keeps Velero’s API clean while storing backups on your NAS.
Deploy
1. Install MinIO (S3 Backend)
MinIO provides the S3 API that Velero expects, backed by NFS.
Create minio-values.yaml:
mode: standalone
replicas: 1
persistence:
enabled: true
storageClass: nfs-appdata
size: 50Gi
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
service:
type: ClusterIP
port: 9000
consoleService:
enabled: true
port: 9001
buckets:
- name: velero
policy: none
purge: false
users:
- accessKey: velero
secretKey: velero-secret-key
policy: readwrite
ingress:
enabled: true
ingressClassName: traefik
hosts:
- minio.media.lan
tls: []
Deploy:
helm repo add minio https://charts.min.io/
helm repo update
kubectl create namespace velero
helm upgrade --install minio minio/minio \
-n velero -f minio-values.yaml --wait
Verify:
kubectl get pods -n velero
kubectl get svc -n velero
Access MinIO console: http://minio.media.lan (user: velero, password: velero-secret-key)
2. Create Velero Credentials Secret
cat <<EOF > credentials-velero
[default]
aws_access_key_id = velero
aws_secret_access_key = velero-secret-key
EOF
kubectl create secret generic velero-credentials \
-n velero \
--from-file=cloud=credentials-velero
rm credentials-velero
3. Install Velero
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm repo update
helm upgrade --install velero vmware-tanzu/velero \
-n velero -f values.yaml --wait
Verify:
kubectl get pods -n velero
kubectl logs -n velero -l app.kubernetes.io/name=velero
You should see: "Backup storage location is valid"
Backup Schedules
Create scheduled backups using Velero’s Schedule CRD (like CronJobs for backups).
Daily Media Namespace Backup
# backup-schedules/media-daily.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: media-daily
namespace: velero
spec:
schedule: "0 2 * * *" # 2am daily
template:
includedNamespaces:
- media
includedResources:
- '*'
defaultVolumesToFsBackup: true
storageLocation: default
ttl: 168h # Keep 7 days
Apply:
kubectl apply -f backup-schedules/media-daily.yaml
Weekly Full Cluster Backup
# backup-schedules/cluster-weekly.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: cluster-weekly
namespace: velero
spec:
schedule: "0 3 * * 0" # 3am Sunday
template:
includedNamespaces:
- '*'
excludedNamespaces:
- kube-system
- kube-public
- kube-node-lease
includedResources:
- '*'
defaultVolumesToFsBackup: true
storageLocation: default
ttl: 720h # Keep 30 days
Apply:
kubectl apply -f backup-schedules/cluster-weekly.yaml
Verify schedules:
velero schedule get
velero backup get
Manual Backups
Trigger an immediate backup:
# Backup entire media namespace
velero backup create media-manual \
--include-namespaces media \
--default-volumes-to-fs-backup \
--wait
# Backup single app (Sonarr)
velero backup create sonarr-manual \
--include-namespaces media \
--selector app.kubernetes.io/name=sonarr \
--default-volumes-to-fs-backup \
--wait
# Full cluster backup
velero backup create cluster-manual \
--exclude-namespaces kube-system,kube-public,kube-node-lease \
--default-volumes-to-fs-backup \
--wait
Check status:
velero backup describe media-manual
velero backup logs media-manual
Restore
Restore Entire Namespace
Scenario: Bad Helm upgrade broke the media namespace.
# 1. Delete broken namespace (optional but cleaner)
kubectl delete namespace media
# 2. Restore from latest backup
velero restore create media-restore-$(date +%s) \
--from-backup media-daily-20260208020000 \
--wait
# 3. Verify
kubectl get all -n media
Restore Single App
Scenario: Sonarr’s database corrupted. Restore just Sonarr.
# 1. Scale down Sonarr
kubectl scale -n media deploy/sonarr --replicas=0
# 2. Restore Sonarr resources
velero restore create sonarr-restore-$(date +%s) \
--from-backup media-daily-20260208020000 \
--include-resources deployment,service,ingress,pvc,secret,configmap \
--selector app.kubernetes.io/name=sonarr \
--wait
# 3. Verify
kubectl get pods -n media -l app.kubernetes.io/name=sonarr
Disaster Recovery (New Cluster)
Scenario: Entire cluster lost (hardware failure, etcd corruption).
- Build new cluster - Use k8s-deploy Terraform repo
- Install foundation - MetalLB, Traefik, NFS CSI, Velero (same as original)
- Point Velero at existing backups:
# MinIO already deployed with same NFS backend
# Velero sees existing backups automatically
velero backup get
- Restore cluster state:
velero restore create full-restore-$(date +%s) \
--from-backup cluster-weekly-20260202030000 \
--wait
- Verify all namespaces:
kubectl get namespaces
kubectl get all -n media
kubectl get all -n monitoring
Backup Verification
“Hope is not a restore strategy. Verify before the incident.” - Every SRE who learned the hard way
Don’t trust backups you haven’t tested. The repo includes verify-backup.sh:
#!/bin/bash
set -euo pipefail
BACKUP_NAME="${1:-}"
if [[ -z "$BACKUP_NAME" ]]; then
echo "Usage: $0 <backup-name>"
echo ""
echo "Available backups:"
velero backup get
exit 1
fi
echo "Verifying backup: $BACKUP_NAME"
# Check backup completed successfully
STATUS=$(velero backup describe "$BACKUP_NAME" --details | grep -i phase | awk '{print $2}')
if [[ "$STATUS" != "Completed" ]]; then
echo "❌ Backup status: $STATUS"
exit 1
fi
echo "✅ Backup status: Completed"
# Check for errors
ERRORS=$(velero backup describe "$BACKUP_NAME" --details | grep -i errors | awk '{print $2}')
if [[ "$ERRORS" != "0" ]]; then
echo "⚠️ Backup has $ERRORS errors:"
velero backup logs "$BACKUP_NAME" | grep -i error
exit 1
fi
echo "✅ No errors"
# Verify volumes backed up
VOLUMES=$(velero backup describe "$BACKUP_NAME" --details | grep -A 20 "Restic Backups" | grep "Completed: " | awk '{print $2}')
echo "✅ Volumes backed up: $VOLUMES"
# Check backup size in MinIO
echo ""
echo "Backup stored in MinIO (velero bucket)"
Run monthly:
./verify-backup.sh media-daily-20260208020000
Troubleshooting
Backup Stuck in Progress
Symptom: Backup never completes.
velero backup describe <backup-name>
Common causes:
- Restic timeout - Large volumes take time. Increase timeout:
# values.yaml configuration: fsBackupTimeout: 4h # Default 1h - PVC not found - Velero can’t access PVC. Check node agent pods:
kubectl get pods -n velero -l name=node-agent kubectl logs -n velero -l name=node-agent
Restore Fails with “Already Exists”
Symptom: velero restore fails because resources already exist.
Fix: Delete and retry, or use --preserve-nodeports=false for Services.
kubectl delete namespace media
velero restore create media-restore-$(date +%s) --from-backup <backup> --wait
MinIO Connection Refused
Symptom: Velero logs show connection refused to MinIO.
Check:
kubectl get svc -n velero minio
kubectl logs -n velero -l app.kubernetes.io/name=velero | grep -i minio
Fix: Verify MinIO service is running and accessible:
kubectl exec -n velero deploy/velero -- wget -O- http://minio.velero.svc.cluster.local:9000
Backup Storage Location Unavailable
Symptom: velero backup-location get shows Unavailable.
Check:
velero backup-location describe default
Common causes:
- Wrong credentials - Verify
velero-credentialssecret - MinIO not running - Check
kubectl get pods -n velero - Bucket doesn’t exist - Create
velerobucket in MinIO console
Resource Usage
Tested on 2-worker cluster (2 vCPU, 4 GB RAM per worker):
- Velero server: 50 MB RAM, <1% CPU (idle)
- Node agent (per node): 100 MB RAM, <5% CPU (during backup)
- MinIO: 200 MB RAM, <5% CPU
Backup times:
- Media namespace (8 apps, 40 GB PVCs): ~15 minutes
- Full cluster (3 namespaces, 60 GB total): ~25 minutes
Storage:
- Daily media backups: ~8 GB each (with compression)
- 7-day retention: ~56 GB
- Weekly full cluster: ~15 GB each
- 30-day retention: ~120 GB total
Provision 200 GB on your NAS for Velero backups.
What I Learned
1. Test Restores, Not Just Backups
“The only backup that matters is the one you’ve successfully restored from.” - Chaos engineering corollary
I ran Velero backups for four months. Never tested a restore. Then a bad Helm upgrade (helm upgrade traefik with a broken values file) nuked the media namespace - Traefik’s CRDs got deleted, cascade deleted a bunch of Ingresses. I ran velero restore create. It failed. Missing CRDs. The Traefik IngressRoute CRDs weren’t in the backup scope; Velero backs up resources, not necessarily the CRDs that define them. I spent two hours manually recreating ingress resources from Git history while Plex stayed down. My family’s weekend movie plans were collateral damage. I now schedule a quarterly “chaos day” where I delete a namespace and restore. Found two more issues: PVC restore failed because the StorageClass had been renamed, and Secrets with immutable fields broke. I fix these proactively now. Don’t learn this the way I did.
2. Separate App-Level and Cluster-Level Backups
Velero backs up Kubernetes state. Per-app backups back up internal databases. You need both.
True story: I restored Sonarr from Velero. The Deployment, Service, PVC - all back. Sonarr started. Opened the UI. Zero shows. Zero history. The PVC had the right permissions but the SQLite database was from before I’d added anything. Velero restored the empty volume. App-level backups would have restored the actual database. I had to re-add 200+ series. I am still finding things I forgot to re-add. Both layers. Always both.
3. MinIO Adds Latency but Simplifies Ops
Considered backing up directly to NFS (Velero’s filesystem plugin). MinIO adds a hop, but: S3 API is Velero’s native interface (fewer bugs), MinIO console makes browsing backups easy, and I can switch to Backblaze B2 later with zero config change. The 50 MB of extra memory is worth it. I tried the filesystem plugin once. Velero’s NFS support was… let’s say “enthusiastically documented but sparingly implemented.” MinIO just works.
4. Backup Everything, Restore Selectively
Full cluster backups sound expensive. They’re 15 GB. I back up everything weekly. One time I ran kubectl delete namespace media while tired. I’d meant to delete media-test or default - I was cleaning up. Autocomplete betrayed me. I stared at the terminal. Plex, Sonarr, Radarr, all of it - gone. The backup from 2am had everything. velero restore create media-restore --from-backup media-daily-20260208020000 --wait. Twenty minutes. Crisis averted. I added media to my “never delete” list. And stopped using kubectl when sleepy. Having options reduces stress. Having no option leads to a very long night.
5. Retention Policies Save Disk Space
First month, I kept every backup forever. “Storage is cheap.” Hit 500 GB. The NAS sent me a warning. Then my spouse asked why the homelab backup folder was larger than our entire photo library. I had no defensible answer. Set TTLs: daily 7 days, weekly 30 days, monthly 1 year. Now 200 GB. Auto-pruning means I never have to think about it. “I’ll clean this up later” is a lie we tell ourselves. Make the system do it.
What’s Next
You have disaster recovery for your Kubernetes cluster. Restore individual apps or rebuild from scratch.
Optional enhancements:
- Off-site backups - Sync MinIO to Backblaze B2 or AWS S3 for geographic redundancy
- Pre/post hooks - Quiesce databases before backup (flush writes, snapshot consistency)
- Monitoring integration - Alert on failed backups via Uptime Kuma
- Immutable backups - Enable S3 object lock to prevent ransomware deletion
The core setup is production-ready. Sleep better knowing you can rebuild in minutes.