Cluster-level disaster recovery for Kubernetes. Backup all resources, persistent volumes, and configs. Restore an entire namespace (or the whole cluster) to a new environment in minutes.
“My cluster’s etcd is corrupted. Do I have backups?” - Questions you don’t want to ask at 2am
Why Velero#
I had per-app backups (Sonarr config, Radarr database) but no cluster-level disaster recovery. One bad Helm upgrade cascaded and broke networking. Spent four hours manually recreating ingress rules, secrets, and PVCs from memory.
Velero gives you:
- Full namespace backups - All resources (Deployments, Services, PVCs, Secrets, ConfigMaps)
- Persistent volume snapshots - Actual data, not just resource definitions
- Scheduled backups - Daily/weekly cron-like automation
- Cross-cluster restore - Rebuild on new hardware
- Selective recovery - Restore one app or the whole cluster
This complements per-app backups. Velero captures the Kubernetes layer (manifests, volumes). App backups capture internal state (databases, configs).
Architecture#
┌─────────────────────────────────────────────────────────────┐
│ Velero (velero namespace) │
│ ↓ │
│ Scheduled Backups: │
│ • media-daily (all resources + PVCs) │
│ • cluster-weekly (full cluster state) │
│ ↓ │
│ Storage: │
│ • NFS backend (Synology: /volume1/nfs01/velero-backups) │
│ • Restic for volume snapshots (filesystem-level) │
│ ↓ │
│ Restore: │
│ • Same cluster (rollback bad upgrades) │
│ • New cluster (disaster recovery) │
└─────────────────────────────────────────────────────────────┘Deployment Repo#
Full source: k8s-velero-backups on GitHub
k8s-velero-backups/
├── values.yaml # Velero Helm values
├── backup-schedules/ # CronJob-style backup definitions
│ ├── media-daily.yaml
│ └── cluster-weekly.yaml
├── deploy.sh # Automated deployment
├── restore.sh # Interactive restore script
└── verify-backup.sh # Test backup integrityStorage Backend#
Velero supports S3, GCS, Azure Blob, and filesystem targets. For homelab, NFS is simplest.
NFS Setup on Synology#
SSH to your NAS and create the backup directory:
ssh jlambert@192.168.2.129
sudo mkdir -p /volume1/nfs01/velero-backups
sudo chown -R nobody:nogroup /volume1/nfs01/velero-backups
sudo chmod 755 /volume1/nfs01/velero-backupsVerify NFS export in DSM: Control Panel → Shared Folder → nfs01 → Edit → NFS Permissions
Ensure your K8s subnet (192.168.2.0/24) has read/write access.
Helm Values#
# values.yaml
image:
repository: velero/velero
tag: v1.14.1
initContainers:
- name: velero-plugin-for-aws
image: velero/velero-plugin-for-aws:v1.10.1
volumeMounts:
- mountPath: /target
name: plugins
configuration:
# NFS storage via S3 API (MinIO running on NFS)
backupStorageLocation:
- name: default
provider: aws
bucket: velero
config:
region: minio
s3ForcePathStyle: "true"
s3Url: http://minio.velero.svc.cluster.local:9000
publicUrl: http://minio.velero.svc.cluster.local:9000
volumeSnapshotLocation:
- name: default
provider: aws
config:
region: minio
# Use Restic for filesystem-level PVC backups
uploaderType: restic
defaultVolumesToFsBackup: true
# Namespaces to include in cluster backups
backupSyncPeriod: 1h
restoreOnlyMode: false
# MinIO for S3-compatible NFS backend
deployNodeAgent: true
nodeAgent:
podVolumePath: /var/lib/kubelet/pods
privileged: false
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
credentials:
useSecret: true
existingSecret: velero-credentials
# Schedules (defined separately as CRDs)
schedules: {}
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
rbac:
create: true
serviceAccount:
server:
create: trueKey decisions:
- MinIO as S3 gateway - Wraps NFS in S3 API (Velero’s native interface)
- Restic for volumes - Filesystem-level snapshots (doesn’t require CSI snapshot support)
- Node agent - Runs DaemonSet to access PVCs for backup
Deploy#
1. Install MinIO (S3 Backend)#
MinIO provides the S3 API that Velero expects, backed by NFS.
Create minio-values.yaml:
mode: standalone
replicas: 1
persistence:
enabled: true
storageClass: nfs-appdata
size: 50Gi
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
service:
type: ClusterIP
port: 9000
consoleService:
enabled: true
port: 9001
buckets:
- name: velero
policy: none
purge: false
users:
- accessKey: velero
secretKey: velero-secret-key
policy: readwrite
ingress:
enabled: true
ingressClassName: traefik
hosts:
- minio.media.lan
tls: []Deploy:
helm repo add minio https://charts.min.io/
helm repo update
kubectl create namespace velero
helm upgrade --install minio minio/minio \
-n velero -f minio-values.yaml --waitVerify:
kubectl get pods -n velero
kubectl get svc -n veleroAccess MinIO console: http://minio.media.lan (user: velero, password: velero-secret-key)
2. Create Velero Credentials Secret#
cat <<EOF > credentials-velero
[default]
aws_access_key_id = velero
aws_secret_access_key = velero-secret-key
EOF
kubectl create secret generic velero-credentials \
-n velero \
--from-file=cloud=credentials-velero
rm credentials-velero3. Install Velero#
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm repo update
helm upgrade --install velero vmware-tanzu/velero \
-n velero -f values.yaml --waitVerify:
kubectl get pods -n velero
kubectl logs -n velero -l app.kubernetes.io/name=veleroYou should see: "Backup storage location is valid"
Backup Schedules#
Create scheduled backups using Velero’s Schedule CRD (like CronJobs for backups).
Daily Media Namespace Backup#
# backup-schedules/media-daily.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: media-daily
namespace: velero
spec:
schedule: "0 2 * * *" # 2am daily
template:
includedNamespaces:
- media
includedResources:
- '*'
defaultVolumesToFsBackup: true
storageLocation: default
ttl: 168h # Keep 7 daysApply:
kubectl apply -f backup-schedules/media-daily.yamlWeekly Full Cluster Backup#
# backup-schedules/cluster-weekly.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: cluster-weekly
namespace: velero
spec:
schedule: "0 3 * * 0" # 3am Sunday
template:
includedNamespaces:
- '*'
excludedNamespaces:
- kube-system
- kube-public
- kube-node-lease
includedResources:
- '*'
defaultVolumesToFsBackup: true
storageLocation: default
ttl: 720h # Keep 30 daysApply:
kubectl apply -f backup-schedules/cluster-weekly.yamlVerify schedules:
velero schedule get
velero backup getManual Backups#
Trigger an immediate backup:
# Backup entire media namespace
velero backup create media-manual \
--include-namespaces media \
--default-volumes-to-fs-backup \
--wait
# Backup single app (Sonarr)
velero backup create sonarr-manual \
--include-namespaces media \
--selector app.kubernetes.io/name=sonarr \
--default-volumes-to-fs-backup \
--wait
# Full cluster backup
velero backup create cluster-manual \
--exclude-namespaces kube-system,kube-public,kube-node-lease \
--default-volumes-to-fs-backup \
--waitCheck status:
velero backup describe media-manual
velero backup logs media-manualRestore#
Restore Entire Namespace#
Scenario: Bad Helm upgrade broke the media namespace.
# 1. Delete broken namespace (optional but cleaner)
kubectl delete namespace media
# 2. Restore from latest backup
velero restore create media-restore-$(date +%s) \
--from-backup media-daily-20260208020000 \
--wait
# 3. Verify
kubectl get all -n mediaRestore Single App#
Scenario: Sonarr’s database corrupted. Restore just Sonarr.
# 1. Scale down Sonarr
kubectl scale -n media deploy/sonarr --replicas=0
# 2. Restore Sonarr resources
velero restore create sonarr-restore-$(date +%s) \
--from-backup media-daily-20260208020000 \
--include-resources deployment,service,ingress,pvc,secret,configmap \
--selector app.kubernetes.io/name=sonarr \
--wait
# 3. Verify
kubectl get pods -n media -l app.kubernetes.io/name=sonarrDisaster Recovery (New Cluster)#
Scenario: Entire cluster lost (hardware failure, etcd corruption).
- Build new cluster - Use k8s-deploy Terraform repo
- Install foundation - MetalLB, Traefik, NFS CSI, Velero (same as original)
- Point Velero at existing backups:
# MinIO already deployed with same NFS backend
# Velero sees existing backups automatically
velero backup get- Restore cluster state:
velero restore create full-restore-$(date +%s) \
--from-backup cluster-weekly-20260202030000 \
--wait- Verify all namespaces:
kubectl get namespaces
kubectl get all -n media
kubectl get all -n monitoringBackup Verification#
Don’t trust backups you haven’t tested. The repo includes verify-backup.sh:
#!/bin/bash
set -euo pipefail
BACKUP_NAME="${1:-}"
if [[ -z "$BACKUP_NAME" ]]; then
echo "Usage: $0 <backup-name>"
echo ""
echo "Available backups:"
velero backup get
exit 1
fi
echo "Verifying backup: $BACKUP_NAME"
# Check backup completed successfully
STATUS=$(velero backup describe "$BACKUP_NAME" --details | grep -i phase | awk '{print $2}')
if [[ "$STATUS" != "Completed" ]]; then
echo "❌ Backup status: $STATUS"
exit 1
fi
echo "✅ Backup status: Completed"
# Check for errors
ERRORS=$(velero backup describe "$BACKUP_NAME" --details | grep -i errors | awk '{print $2}')
if [[ "$ERRORS" != "0" ]]; then
echo "⚠️ Backup has $ERRORS errors:"
velero backup logs "$BACKUP_NAME" | grep -i error
exit 1
fi
echo "✅ No errors"
# Verify volumes backed up
VOLUMES=$(velero backup describe "$BACKUP_NAME" --details | grep -A 20 "Restic Backups" | grep "Completed: " | awk '{print $2}')
echo "✅ Volumes backed up: $VOLUMES"
# Check backup size in MinIO
echo ""
echo "Backup stored in MinIO (velero bucket)"Run monthly:
./verify-backup.sh media-daily-20260208020000Troubleshooting#
Backup Stuck in Progress#
Symptom: Backup never completes.
velero backup describe <backup-name>Common causes:
Restic timeout - Large volumes take time. Increase timeout:
# values.yaml configuration: fsBackupTimeout: 4h # Default 1hPVC not found - Velero can’t access PVC. Check node agent pods:
kubectl get pods -n velero -l name=node-agent kubectl logs -n velero -l name=node-agent
Restore Fails with “Already Exists”#
Symptom: velero restore fails because resources already exist.
Fix: Delete and retry, or use --preserve-nodeports=false for Services.
kubectl delete namespace media
velero restore create media-restore-$(date +%s) --from-backup <backup> --waitMinIO Connection Refused#
Symptom: Velero logs show connection refused to MinIO.
Check:
kubectl get svc -n velero minio
kubectl logs -n velero -l app.kubernetes.io/name=velero | grep -i minioFix: Verify MinIO service is running and accessible:
kubectl exec -n velero deploy/velero -- wget -O- http://minio.velero.svc.cluster.local:9000Backup Storage Location Unavailable#
Symptom: velero backup-location get shows Unavailable.
Check:
velero backup-location describe defaultCommon causes:
- Wrong credentials - Verify
velero-credentialssecret - MinIO not running - Check
kubectl get pods -n velero - Bucket doesn’t exist - Create
velerobucket in MinIO console
Resource Usage#
Tested on 2-worker cluster (2 vCPU, 4 GB RAM per worker):
- Velero server: 50 MB RAM, <1% CPU (idle)
- Node agent (per node): 100 MB RAM, <5% CPU (during backup)
- MinIO: 200 MB RAM, <5% CPU
Backup times:
- Media namespace (8 apps, 40 GB PVCs): ~15 minutes
- Full cluster (3 namespaces, 60 GB total): ~25 minutes
Storage:
- Daily media backups: ~8 GB each (with compression)
- 7-day retention: ~56 GB
- Weekly full cluster: ~15 GB each
- 30-day retention: ~120 GB total
Provision 200 GB on your NAS for Velero backups.
What I Learned#
1. Test Restores, Not Just Backups#
Backups are worthless until proven. I schedule a quarterly “chaos day” where I delete a namespace and restore from backup. Found three issues this way before they mattered:
- Velero couldn’t restore ingress due to missing CRDs
- PVC restore failed because StorageClass disappeared
- Secrets with immutable fields broke updates
Now I fix these proactively.
2. Separate App-Level and Cluster-Level Backups#
Velero backs up Kubernetes state. Per-app backups back up internal databases. You need both.
Example: Sonarr’s Kubernetes resources (Deployment, Service, PVC) exist, but the SQLite database inside is corrupted. Velero restores the PVC (empty or old data). App-level backup restores the database.
3. MinIO Adds Latency but Simplifies Ops#
Considered backing up directly to NFS (Velero’s filesystem plugin). MinIO adds a hop, but:
- S3 API is Velero’s native interface (less buggy)
- MinIO console makes browsing backups easy
- Portable - switch to real S3/Backblaze B2 later with zero config change
The 50 MB of extra memory is worth the operational simplicity.
4. Backup Everything, Restore Selectively#
Full cluster backups sound expensive. They’re cheap (15 GB). Storage is cheaper than reconstruction time.
I back up everything weekly. Restore only what’s needed. Deleted the wrong namespace? Restore it. Entire cluster? Restore everything. Having options reduces stress.
5. Retention Policies Save Disk Space#
First month, I kept every backup forever. Hit 500 GB. Set TTLs on schedules:
- Daily: 7 days
- Weekly: 30 days
- Monthly: 1 year
Now 200 GB covers everything. Auto-pruning prevents “I’ll clean this up later” debt.
What’s Next#
You have disaster recovery for your Kubernetes cluster. Restore individual apps or rebuild from scratch.
Optional enhancements:
- Off-site backups - Sync MinIO to Backblaze B2 or AWS S3 for geographic redundancy
- Pre/post hooks - Quiesce databases before backup (flush writes, snapshot consistency)
- Monitoring integration - Alert on failed backups via Uptime Kuma
- Immutable backups - Enable S3 object lock to prevent ransomware deletion
The core setup is production-ready. Sleep better knowing you can rebuild in minutes.