Pod and node autoscaling for a homelab Talos cluster. HPA for horizontal scaling, VPA for right-sizing, and a custom Terraform-based node autoscaler for Proxmox.
“My Plex transcode pod needs 4 cores right now, but 0.1 cores at 3am. I’m not manually scaling this.”
The Problem
Static resource allocation wastes capacity. You size workloads for peak load, then spend 90% of the day running at 20% utilization. On a homelab with limited RAM and CPU, that overhead means fewer apps or bigger hardware.
I ran everything with fixed requests and limits for months. Sonarr got 512Mi of RAM — it used 180Mi. Plex got 2 CPU cores — it used 0.3 cores unless someone was transcoding. Meanwhile, qBittorrent was OOMKilled weekly because its 512Mi limit was too low during batch imports.
The fix isn’t guessing better numbers. It’s letting the cluster measure actual usage and adjust.
Architecture
Three layers of autoscaling, each solving a different problem:
┌────────────────────────────────────────────────────────┐
│ Layer 3: Node Autoscaler (custom) │
│ Monitors: kubectl top nodes │
│ Adjusts: worker_count in terraform.tfvars │
│ Result: More/fewer Proxmox VMs │
├────────────────────────────────────────────────────────┤
│ Layer 2: HPA (Horizontal Pod Autoscaler) │
│ Monitors: Pod CPU/memory metrics │
│ Adjusts: Replica count │
│ Result: More/fewer pods │
├────────────────────────────────────────────────────────┤
│ Layer 1: VPA (Vertical Pod Autoscaler) │
│ Monitors: Pod resource usage over time │
│ Adjusts: Resource requests/limits │
│ Result: Right-sized pods │
├────────────────────────────────────────────────────────┤
│ Foundation: Metrics Server │
│ Provides: CPU/memory metrics via Metrics API │
│ Required by: HPA, VPA, kubectl top │
└────────────────────────────────────────────────────────┘
| Layer | Scales what | Trigger | Speed |
|---|---|---|---|
| VPA | Pod resources (CPU/RAM) | Usage drift from requests | Minutes (recreates pod) |
| HPA | Pod replicas | CPU/memory threshold | Seconds |
| Node autoscaler | Worker VMs | Node-level CPU threshold | Minutes (Terraform apply) |
Full source: k8s-deploy/addons
Metrics Server
Everything depends on metrics. No metrics, no autoscaling.
kubectl apply -f addons/metrics-server/metrics-server.yaml
# Wait for it
kubectl wait --for=condition=ready pod \
-l k8s-app=metrics-server -n kube-system --timeout=120s
Verify:
kubectl top nodes
# NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
# talos-cp-1 250m 12% 1200Mi 30%
# talos-w-1 800m 20% 3200Mi 20%
# talos-w-2 600m 15% 2800Mi 17%
kubectl top pods -n media
If kubectl top shows error: Metrics API not available, the metrics server isn’t ready yet. Check its logs: kubectl logs -n kube-system -l k8s-app=metrics-server. Common issue: metrics server can’t verify kubelet certificates — Talos uses self-signed certs, so the metrics server deployment needs --kubelet-insecure-tls.
Vertical Pod Autoscaler (VPA)
VPA watches actual pod resource consumption over time and recommends (or applies) better requests and limits. It’s the “stop guessing” autoscaler.
Deploy
kubectl apply -f addons/vpa/vpa.yaml
kubectl get pods -n vpa-system
VPA Modes
| Mode | Behavior | When to use |
|---|---|---|
Off |
Recommends only, doesn’t change anything | Start here. Observe for a week. |
Initial |
Sets requests on pod creation, no restarts | Stateful apps (Plex, Sonarr) |
Auto |
Evicts and recreates pods with new requests | Stateless apps (frontends, APIs) |
Example: VPA for Sonarr
Start in Off mode to see what VPA recommends without disrupting anything:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: sonarr
namespace: media
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: sonarr
updatePolicy:
updateMode: "Off"
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: 50m
memory: 128Mi
maxAllowed:
cpu: 2000m
memory: 4Gi
After a week, check what VPA thinks:
kubectl describe vpa sonarr -n media
You’ll see:
Recommendation:
Container Recommendations:
Container Name: app
Lower Bound: Cpu: 80m, Memory: 160Mi
Target: Cpu: 150m, Memory: 280Mi
Upper Bound: Cpu: 400m, Memory: 600Mi
Uncapped Target: Cpu: 150m, Memory: 280Mi
This tells you Sonarr actually needs ~150m CPU and ~280Mi RAM — half of what I’d guessed. Apply those numbers to your Helm values or switch the VPA to Initial mode to let it set requests on new pods.
VPA for the Whole Media Stack
# Apply VPA to every media app
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: plex
namespace: media
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: plex
updatePolicy:
updateMode: "Initial"
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: 100m
memory: 256Mi
maxAllowed:
cpu: 4000m
memory: 8Gi
Don’t use Auto mode on Plex or other media apps that hold long-running connections. VPA in Auto mode evicts pods to apply new resources — killing active transcodes or downloads. Use Initial for media workloads: it only sets resources when the pod is first created.
Horizontal Pod Autoscaler (HPA)
HPA adds or removes pod replicas based on metrics. Most useful for stateless workloads that can run multiple instances.
When HPA Makes Sense
| App | HPA useful? | Why |
|---|---|---|
| Overseerr | Yes | Stateless frontend, handles more users with more replicas |
| API services | Yes | Horizontal scaling is natural |
| Plex | No | Single instance owns the media library |
| Sonarr/Radarr | No | SQLite DB, not designed for multi-instance |
| qBittorrent | No | Port mappings, single-instance tracker connections |
Example: HPA for Overseerr
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: overseerr
namespace: media
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: overseerr
minReplicas: 1
maxReplicas: 3
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 60
scaleUp:
policies:
- type: Pods
value: 1
periodSeconds: 30
The behavior section prevents flapping — scale down waits 5 minutes before removing a replica, and only removes one at a time. Scale up is faster: one replica every 30 seconds.
Monitor HPA
kubectl get hpa -n media
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
# overseerr Deployment/overseerr 15%/70% 1 3 1
kubectl describe hpa overseerr -n media
HPA Prerequisites
HPA only works if pods have resource requests set. Without requests, there’s nothing to calculate a percentage of.
# This MUST exist for HPA to work
resources:
requests:
cpu: 100m
memory: 128Mi
If HPA shows <unknown>/70%, it means requests aren’t set.
Don’t use HPA and VPA on the same metric for the same pod. If both try to adjust CPU, they’ll fight. The pattern is: VPA manages requests (right-sizing), HPA manages replica count (horizontal scaling). VPA sets the right size per pod; HPA decides how many pods.
Node Autoscaler
Cloud Kubernetes (EKS, GKE, AKS) has built-in node autoscaling. Bare-metal Proxmox doesn’t. So I built a simple one.
How It Works
A Python service monitors cluster-wide CPU usage. When average CPU exceeds a threshold, it bumps worker_count in terraform.tfvars and runs terraform apply. Talos auto-joins the new node. When CPU drops, it reduces worker_count and Terraform destroys the extra VM.
Monitor (every 60s)
↓
kubectl top nodes → average CPU%
↓
CPU > 80%? → Scale up (worker_count + 1, terraform apply)
CPU < 30%? → Scale down (worker_count - 1, terraform apply)
↓
Cooldown (5 minutes between operations)
Configuration
# k8s-deploy/addons/node-autoscaler/autoscaler.env
SCALE_UP_THRESHOLD=80
SCALE_DOWN_THRESHOLD=30
MIN_WORKERS=2
MAX_WORKERS=6
CHECK_INTERVAL=60
COOLDOWN_SECONDS=300
| Setting | Value | Why |
|---|---|---|
MIN_WORKERS=2 |
Floor | Workloads need at least 2 nodes for scheduling redundancy |
MAX_WORKERS=6 |
Ceiling | Proxmox host has 64GB RAM; 6 workers × 16GB = 96GB, so 6 is the limit |
SCALE_UP_THRESHOLD=80 |
Trigger | 80% average CPU means pods are contending |
SCALE_DOWN_THRESHOLD=30 |
Trigger | 30% means wasted capacity |
COOLDOWN_SECONDS=300 |
Guard | Prevents rapid oscillation |
Deploy as Systemd Service
cd ~/Repos/k8s-deploy/addons/node-autoscaler
# Install dependencies
python3 -m venv venv
venv/bin/pip install -r requirements.txt
# Install service
sudo cp autoscaler.service /etc/systemd/system/k8s-node-autoscaler.service
sudo systemctl daemon-reload
sudo systemctl enable --now k8s-node-autoscaler
Monitor it:
sudo systemctl status k8s-node-autoscaler
sudo journalctl -u k8s-node-autoscaler -f
Test Scale-Up
Simulate load to verify the autoscaler responds:
# Deploy a CPU stress test
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: cpu-stress
spec:
containers:
- name: stress
image: polinux/stress
command: ["stress"]
args: ["--cpu", "4", "--timeout", "600s"]
resources:
requests:
cpu: "2000m"
EOF
# Watch autoscaler respond
sudo journalctl -u k8s-node-autoscaler -f
# After ~60s: "Average CPU: 85%. Scaling up from 2 to 3 workers"
# Terraform runs, new VM appears, Talos joins it
# Clean up
kubectl delete pod cpu-stress
Test Scale-Down
After the stress test, wait for cooldown (5 minutes), then watch:
# Autoscaler should detect low CPU and scale down
sudo journalctl -u k8s-node-autoscaler -f
# "Average CPU: 22%. Scaling down from 3 to 2 workers"
The node autoscaler runs terraform apply -auto-approve. This is intentional for automation, but it means Terraform state must be consistent. If someone is manually editing terraform.tfvars while the autoscaler runs, you’ll get conflicts. Don’t manually edit worker_count while the autoscaler is active.
Putting It All Together
The recommended setup for a homelab media cluster:
| Component | Scope | Mode |
|---|---|---|
| Metrics Server | Cluster-wide | Always on |
| VPA | All media apps | Off for first week, then Initial |
| HPA | Stateless apps only (Overseerr, Homepage) | 1-3 replicas |
| Node autoscaler | Whole cluster | 2-6 workers, 80/30 thresholds |
Deployment Order
- Metrics Server — Everything depends on this
- VPA — Start in
Offmode, observe recommendations - HPA — Only on apps that support multi-instance
- Node autoscaler — Optional, useful if load varies significantly
# 1. Metrics
kubectl apply -f addons/metrics-server/metrics-server.yaml
kubectl wait --for=condition=ready pod -l k8s-app=metrics-server -n kube-system --timeout=120s
# 2. VPA
kubectl apply -f addons/vpa/vpa.yaml
# 3. Verify
kubectl top nodes
kubectl top pods -A
kubectl get vpa -A
Common Issues
| Symptom | Cause | Fix |
|---|---|---|
kubectl top shows <unknown> |
Metrics server not running | Check kubectl get pods -n kube-system -l k8s-app=metrics-server |
HPA shows <unknown>/70% |
Pod has no resource requests | Add resources.requests to the Deployment |
| VPA recommendations seem wrong | Not enough data | Wait a week. VPA needs usage history to make good recommendations |
| VPA + HPA fighting | Both managing CPU | Use VPA for requests, HPA for replicas. Don’t overlap metrics |
| Node autoscaler flapping | Thresholds too close | Widen the gap: 80% up / 30% down, not 70% up / 50% down |
| Scale-down kills active workloads | Pods not draining | Node autoscaler should drain before removing. Check logs |
| Terraform conflict | Manual tfvars edit during autoscale | Stop autoscaler before manual changes: sudo systemctl stop k8s-node-autoscaler |
What I Learned
1. VPA in Off Mode Is Worth More Than You’d Think
I deployed VPA in Off mode “just to see what it says.” A week later, the recommendations changed how I thought about resource allocation. Plex: I gave it 2 CPU / 4 Gi. VPA said it actually used 0.4 CPU / 1.2 Gi — except during transcodes, when it spiked to 3 CPU. Sonarr: I gave it 512Mi. VPA said 180Mi. Radarr: I gave it 512Mi. VPA said 200Mi.
I was over-provisioning everything by 2-3x. On a 2-worker cluster with 32 GB total RAM, that’s the difference between “can’t schedule anything else” and “room for 5 more apps.” Even if you never enable Auto mode, the recommendations alone justify deploying VPA.
2. HPA Doesn’t Make Sense for Most Homelab Apps
I tried HPA on Sonarr. Replicas scaled to 2 during a mass import. Both replicas wrote to the same SQLite database. Corruption. Lost the database, restored from Velero, spent an hour re-importing custom formats.
Most *arr apps and media tools are fundamentally single-instance. They use local databases, write to shared filesystems with single-writer assumptions, and maintain state in memory. HPA is for stateless services — web frontends, APIs, proxies. For a homelab, that’s maybe Overseerr and Homepage. Everything else should stay at 1 replica with VPA for right-sizing.
3. Node Autoscaling Has a Startup Tax
A new Proxmox VM takes about 2 minutes to boot, get a Talos config, start kubelet, and register with the API server. Add 1-2 minutes for pods to schedule and start on the new node. Total: 3-4 minutes from scale-up trigger to usable capacity.
For a homelab, this is fine — media workloads aren’t latency-critical. But it means the autoscaler is reactive, not predictive. If you know you’ll need extra capacity (movie night with the family, batch import), scale up manually before the load hits:
cd ~/Repos/k8s-deploy
# Edit worker_count = 4 in terraform.tfvars
terraform apply
4. Cooldowns Prevent Expensive Oscillation
Without the 5-minute cooldown, the autoscaler would: scale up (terraform apply, 3 minutes), measure CPU (now low because new node), scale down (terraform apply, 3 minutes), measure CPU (now high because node removed), scale up again. Each cycle costs 6 minutes of Terraform applies and Proxmox VM operations.
The 300-second cooldown is the minimum that prevents this. I initially set it to 60 seconds and watched it add and remove a node three times in 20 minutes. The Proxmox host was not amused.
What’s Next
- Talos Day-2 Operations - Node management, etcd, troubleshooting
- Upgrade Talos and Kubernetes - Rolling upgrade procedures
- Deploy the cluster - Initial Terraform setup
- Velero Backups - Disaster recovery