Skip to main content
  1. All Blog Posts/

Kubernetes Cluster Backups with Velero

Author
Jourdan Lambert
Welcome! I’m Jourdan — an SRE and Security engineer writing about my journey through cloud and DevOps technology. This site covers Docker, Kubernetes, Terraform, Packer, and more.
Table of Contents
Kubernetes Homelab - This article is part of a series.
Part : This Article

Cluster-level disaster recovery for Kubernetes. Backup all resources, persistent volumes, and configs. Restore an entire namespace (or the whole cluster) to a new environment in minutes.

“My cluster’s etcd is corrupted. Do I have backups?” - Questions you don’t want to ask at 2am


Why Velero
#

I had per-app backups (Sonarr config, Radarr database) but no cluster-level disaster recovery. One bad Helm upgrade cascaded and broke networking. Spent four hours manually recreating ingress rules, secrets, and PVCs from memory.

Velero gives you:

  • Full namespace backups - All resources (Deployments, Services, PVCs, Secrets, ConfigMaps)
  • Persistent volume snapshots - Actual data, not just resource definitions
  • Scheduled backups - Daily/weekly cron-like automation
  • Cross-cluster restore - Rebuild on new hardware
  • Selective recovery - Restore one app or the whole cluster

This complements per-app backups. Velero captures the Kubernetes layer (manifests, volumes). App backups capture internal state (databases, configs).


Architecture
#

┌─────────────────────────────────────────────────────────────┐
│  Velero (velero namespace)                                   │
│       ↓                                                      │
│  Scheduled Backups:                                          │
│   • media-daily (all resources + PVCs)                       │
│   • cluster-weekly (full cluster state)                      │
│       ↓                                                      │
│  Storage:                                                    │
│   • NFS backend (Synology: /volume1/nfs01/velero-backups)  │
│   • Restic for volume snapshots (filesystem-level)          │
│       ↓                                                      │
│  Restore:                                                    │
│   • Same cluster (rollback bad upgrades)                    │
│   • New cluster (disaster recovery)                         │
└─────────────────────────────────────────────────────────────┘

Deployment Repo
#

Full source: k8s-velero-backups on GitHub

k8s-velero-backups/
├── values.yaml              # Velero Helm values
├── backup-schedules/        # CronJob-style backup definitions
│   ├── media-daily.yaml
│   └── cluster-weekly.yaml
├── deploy.sh                # Automated deployment
├── restore.sh               # Interactive restore script
└── verify-backup.sh         # Test backup integrity

Storage Backend
#

Velero supports S3, GCS, Azure Blob, and filesystem targets. For homelab, NFS is simplest.

NFS Setup on Synology
#

SSH to your NAS and create the backup directory:

ssh jlambert@192.168.2.129
sudo mkdir -p /volume1/nfs01/velero-backups
sudo chown -R nobody:nogroup /volume1/nfs01/velero-backups
sudo chmod 755 /volume1/nfs01/velero-backups

Verify NFS export in DSM: Control Panel → Shared Folder → nfs01 → Edit → NFS Permissions

Ensure your K8s subnet (192.168.2.0/24) has read/write access.


Helm Values
#

# values.yaml
image:
  repository: velero/velero
  tag: v1.14.1

initContainers:
  - name: velero-plugin-for-aws
    image: velero/velero-plugin-for-aws:v1.10.1
    volumeMounts:
      - mountPath: /target
        name: plugins

configuration:
  # NFS storage via S3 API (MinIO running on NFS)
  backupStorageLocation:
    - name: default
      provider: aws
      bucket: velero
      config:
        region: minio
        s3ForcePathStyle: "true"
        s3Url: http://minio.velero.svc.cluster.local:9000
        publicUrl: http://minio.velero.svc.cluster.local:9000

  volumeSnapshotLocation:
    - name: default
      provider: aws
      config:
        region: minio

  # Use Restic for filesystem-level PVC backups
  uploaderType: restic
  defaultVolumesToFsBackup: true

  # Namespaces to include in cluster backups
  backupSyncPeriod: 1h
  restoreOnlyMode: false

# MinIO for S3-compatible NFS backend
deployNodeAgent: true

nodeAgent:
  podVolumePath: /var/lib/kubelet/pods
  privileged: false
  resources:
    requests:
      cpu: 100m
      memory: 128Mi
    limits:
      cpu: 500m
      memory: 512Mi

credentials:
  useSecret: true
  existingSecret: velero-credentials

# Schedules (defined separately as CRDs)
schedules: {}

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 512Mi

rbac:
  create: true

serviceAccount:
  server:
    create: true

Key decisions:

  • MinIO as S3 gateway - Wraps NFS in S3 API (Velero’s native interface)
  • Restic for volumes - Filesystem-level snapshots (doesn’t require CSI snapshot support)
  • Node agent - Runs DaemonSet to access PVCs for backup
ℹ️ Info
Velero was originally designed for cloud object storage (S3, GCS). For homelab NFS, we run MinIO as a lightweight S3-compatible shim. This keeps Velero’s API clean while storing backups on your NAS.

Deploy
#

1. Install MinIO (S3 Backend)
#

MinIO provides the S3 API that Velero expects, backed by NFS.

Create minio-values.yaml:

mode: standalone

replicas: 1

persistence:
  enabled: true
  storageClass: nfs-appdata
  size: 50Gi

resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi

service:
  type: ClusterIP
  port: 9000

consoleService:
  enabled: true
  port: 9001

buckets:
  - name: velero
    policy: none
    purge: false

users:
  - accessKey: velero
    secretKey: velero-secret-key
    policy: readwrite

ingress:
  enabled: true
  ingressClassName: traefik
  hosts:
    - minio.media.lan
  tls: []

Deploy:

helm repo add minio https://charts.min.io/
helm repo update

kubectl create namespace velero

helm upgrade --install minio minio/minio \
    -n velero -f minio-values.yaml --wait

Verify:

kubectl get pods -n velero
kubectl get svc -n velero

Access MinIO console: http://minio.media.lan (user: velero, password: velero-secret-key)

2. Create Velero Credentials Secret
#

cat <<EOF > credentials-velero
[default]
aws_access_key_id = velero
aws_secret_access_key = velero-secret-key
EOF

kubectl create secret generic velero-credentials \
    -n velero \
    --from-file=cloud=credentials-velero

rm credentials-velero

3. Install Velero
#

helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm repo update

helm upgrade --install velero vmware-tanzu/velero \
    -n velero -f values.yaml --wait

Verify:

kubectl get pods -n velero
kubectl logs -n velero -l app.kubernetes.io/name=velero

You should see: "Backup storage location is valid"


Backup Schedules
#

Create scheduled backups using Velero’s Schedule CRD (like CronJobs for backups).

Daily Media Namespace Backup
#

# backup-schedules/media-daily.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: media-daily
  namespace: velero
spec:
  schedule: "0 2 * * *"  # 2am daily
  template:
    includedNamespaces:
      - media
    includedResources:
      - '*'
    defaultVolumesToFsBackup: true
    storageLocation: default
    ttl: 168h  # Keep 7 days

Apply:

kubectl apply -f backup-schedules/media-daily.yaml

Weekly Full Cluster Backup
#

# backup-schedules/cluster-weekly.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: cluster-weekly
  namespace: velero
spec:
  schedule: "0 3 * * 0"  # 3am Sunday
  template:
    includedNamespaces:
      - '*'
    excludedNamespaces:
      - kube-system
      - kube-public
      - kube-node-lease
    includedResources:
      - '*'
    defaultVolumesToFsBackup: true
    storageLocation: default
    ttl: 720h  # Keep 30 days

Apply:

kubectl apply -f backup-schedules/cluster-weekly.yaml

Verify schedules:

velero schedule get
velero backup get

Manual Backups
#

Trigger an immediate backup:

# Backup entire media namespace
velero backup create media-manual \
    --include-namespaces media \
    --default-volumes-to-fs-backup \
    --wait

# Backup single app (Sonarr)
velero backup create sonarr-manual \
    --include-namespaces media \
    --selector app.kubernetes.io/name=sonarr \
    --default-volumes-to-fs-backup \
    --wait

# Full cluster backup
velero backup create cluster-manual \
    --exclude-namespaces kube-system,kube-public,kube-node-lease \
    --default-volumes-to-fs-backup \
    --wait

Check status:

velero backup describe media-manual
velero backup logs media-manual

Restore
#

Restore Entire Namespace
#

Scenario: Bad Helm upgrade broke the media namespace.

# 1. Delete broken namespace (optional but cleaner)
kubectl delete namespace media

# 2. Restore from latest backup
velero restore create media-restore-$(date +%s) \
    --from-backup media-daily-20260208020000 \
    --wait

# 3. Verify
kubectl get all -n media

Restore Single App
#

Scenario: Sonarr’s database corrupted. Restore just Sonarr.

# 1. Scale down Sonarr
kubectl scale -n media deploy/sonarr --replicas=0

# 2. Restore Sonarr resources
velero restore create sonarr-restore-$(date +%s) \
    --from-backup media-daily-20260208020000 \
    --include-resources deployment,service,ingress,pvc,secret,configmap \
    --selector app.kubernetes.io/name=sonarr \
    --wait

# 3. Verify
kubectl get pods -n media -l app.kubernetes.io/name=sonarr

Disaster Recovery (New Cluster)
#

Scenario: Entire cluster lost (hardware failure, etcd corruption).

  1. Build new cluster - Use k8s-deploy Terraform repo
  2. Install foundation - MetalLB, Traefik, NFS CSI, Velero (same as original)
  3. Point Velero at existing backups:
# MinIO already deployed with same NFS backend
# Velero sees existing backups automatically
velero backup get
  1. Restore cluster state:
velero restore create full-restore-$(date +%s) \
    --from-backup cluster-weekly-20260202030000 \
    --wait
  1. Verify all namespaces:
kubectl get namespaces
kubectl get all -n media
kubectl get all -n monitoring

Backup Verification
#

Don’t trust backups you haven’t tested. The repo includes verify-backup.sh:

#!/bin/bash
set -euo pipefail

BACKUP_NAME="${1:-}"

if [[ -z "$BACKUP_NAME" ]]; then
    echo "Usage: $0 <backup-name>"
    echo ""
    echo "Available backups:"
    velero backup get
    exit 1
fi

echo "Verifying backup: $BACKUP_NAME"

# Check backup completed successfully
STATUS=$(velero backup describe "$BACKUP_NAME" --details | grep -i phase | awk '{print $2}')

if [[ "$STATUS" != "Completed" ]]; then
    echo "❌ Backup status: $STATUS"
    exit 1
fi

echo "✅ Backup status: Completed"

# Check for errors
ERRORS=$(velero backup describe "$BACKUP_NAME" --details | grep -i errors | awk '{print $2}')

if [[ "$ERRORS" != "0" ]]; then
    echo "⚠️  Backup has $ERRORS errors:"
    velero backup logs "$BACKUP_NAME" | grep -i error
    exit 1
fi

echo "✅ No errors"

# Verify volumes backed up
VOLUMES=$(velero backup describe "$BACKUP_NAME" --details | grep -A 20 "Restic Backups" | grep "Completed: " | awk '{print $2}')

echo "✅ Volumes backed up: $VOLUMES"

# Check backup size in MinIO
echo ""
echo "Backup stored in MinIO (velero bucket)"

Run monthly:

./verify-backup.sh media-daily-20260208020000

Troubleshooting
#

Backup Stuck in Progress
#

Symptom: Backup never completes.

velero backup describe <backup-name>

Common causes:

  • Restic timeout - Large volumes take time. Increase timeout:

    # values.yaml
    configuration:
      fsBackupTimeout: 4h  # Default 1h
  • PVC not found - Velero can’t access PVC. Check node agent pods:

    kubectl get pods -n velero -l name=node-agent
    kubectl logs -n velero -l name=node-agent

Restore Fails with “Already Exists”
#

Symptom: velero restore fails because resources already exist.

Fix: Delete and retry, or use --preserve-nodeports=false for Services.

kubectl delete namespace media
velero restore create media-restore-$(date +%s) --from-backup <backup> --wait

MinIO Connection Refused
#

Symptom: Velero logs show connection refused to MinIO.

Check:

kubectl get svc -n velero minio
kubectl logs -n velero -l app.kubernetes.io/name=velero | grep -i minio

Fix: Verify MinIO service is running and accessible:

kubectl exec -n velero deploy/velero -- wget -O- http://minio.velero.svc.cluster.local:9000

Backup Storage Location Unavailable
#

Symptom: velero backup-location get shows Unavailable.

Check:

velero backup-location describe default

Common causes:

  • Wrong credentials - Verify velero-credentials secret
  • MinIO not running - Check kubectl get pods -n velero
  • Bucket doesn’t exist - Create velero bucket in MinIO console

Resource Usage
#

Tested on 2-worker cluster (2 vCPU, 4 GB RAM per worker):

  • Velero server: 50 MB RAM, <1% CPU (idle)
  • Node agent (per node): 100 MB RAM, <5% CPU (during backup)
  • MinIO: 200 MB RAM, <5% CPU

Backup times:

  • Media namespace (8 apps, 40 GB PVCs): ~15 minutes
  • Full cluster (3 namespaces, 60 GB total): ~25 minutes

Storage:

  • Daily media backups: ~8 GB each (with compression)
  • 7-day retention: ~56 GB
  • Weekly full cluster: ~15 GB each
  • 30-day retention: ~120 GB total

Provision 200 GB on your NAS for Velero backups.


What I Learned
#

1. Test Restores, Not Just Backups
#

Backups are worthless until proven. I schedule a quarterly “chaos day” where I delete a namespace and restore from backup. Found three issues this way before they mattered:

  • Velero couldn’t restore ingress due to missing CRDs
  • PVC restore failed because StorageClass disappeared
  • Secrets with immutable fields broke updates

Now I fix these proactively.

2. Separate App-Level and Cluster-Level Backups
#

Velero backs up Kubernetes state. Per-app backups back up internal databases. You need both.

Example: Sonarr’s Kubernetes resources (Deployment, Service, PVC) exist, but the SQLite database inside is corrupted. Velero restores the PVC (empty or old data). App-level backup restores the database.

3. MinIO Adds Latency but Simplifies Ops
#

Considered backing up directly to NFS (Velero’s filesystem plugin). MinIO adds a hop, but:

  • S3 API is Velero’s native interface (less buggy)
  • MinIO console makes browsing backups easy
  • Portable - switch to real S3/Backblaze B2 later with zero config change

The 50 MB of extra memory is worth the operational simplicity.

4. Backup Everything, Restore Selectively
#

Full cluster backups sound expensive. They’re cheap (15 GB). Storage is cheaper than reconstruction time.

I back up everything weekly. Restore only what’s needed. Deleted the wrong namespace? Restore it. Entire cluster? Restore everything. Having options reduces stress.

5. Retention Policies Save Disk Space
#

First month, I kept every backup forever. Hit 500 GB. Set TTLs on schedules:

  • Daily: 7 days
  • Weekly: 30 days
  • Monthly: 1 year

Now 200 GB covers everything. Auto-pruning prevents “I’ll clean this up later” debt.


What’s Next
#

You have disaster recovery for your Kubernetes cluster. Restore individual apps or rebuild from scratch.

Optional enhancements:

  • Off-site backups - Sync MinIO to Backblaze B2 or AWS S3 for geographic redundancy
  • Pre/post hooks - Quiesce databases before backup (flush writes, snapshot consistency)
  • Monitoring integration - Alert on failed backups via Uptime Kuma
  • Immutable backups - Enable S3 object lock to prevent ransomware deletion

The core setup is production-ready. Sleep better knowing you can rebuild in minutes.


References
#

Kubernetes Homelab - This article is part of a series.
Part : This Article