Talos Linux Kubernetes cluster on Proxmox VE via Terraform. Image factory, VM provisioning, bootstrap, credentials - one terraform apply.

“Just SSH in and fix it manually” - Things you’ll never say again with Talos


Problem

I’ve built Kubernetes clusters three different ways. The first time with kubeadm, I documented nothing and forgot half the steps. The second time with Ansible, the playbooks broke on upgrades. The third time I did it right.

You want a cluster you can destroy and rebuild in 15 minutes. One that doesn’t depend on SSH-ing into nodes and running commands you barely remember.

Solution

Talos Linux + Terraform. Talos is immutable and API-only - no SSH, no shell, no “let me just fix this one thing.” Either it’s in your Terraform config or it doesn’t exist. This sounds limiting until you need to rebuild and realize you can.

Terraform manages the VMs. Talos manages everything inside them. One terraform apply gives you a working cluster.

Full source: k8s-deploy

⚠️ Homelab defaults
This uses a single control plane and flat networking. Fine for a homelab. For production, you need 3+ control planes, separate VLANs, and remote state with locking. I’ll point out the differences.

Repo Structure

├── versions.tf          # Terraform + provider version pins
├── providers.tf         # Proxmox and Talos provider config
├── variables.tf         # Inputs with defaults
├── locals.tf            # Computed IPs, node maps, cluster endpoint
├── image.tf             # Talos image factory + download
├── main.tf              # VM resources (control plane + workers)
├── talos.tf             # Machine config, bootstrap, health check, creds
├── outputs.tf           # kubeconfig, talosconfig, node IPs
├── terraform.tfvars.example
└── generated/           # (git-ignored) kubeconfig + talosconfig

Providers

terraform {
  required_version = ">= 1.5.0"
  required_providers {
    proxmox = {
      source  = "bpg/proxmox"
      version = ">= 0.69.0"
    }
    talos = {
      source  = "siderolabs/talos"
      version = ">= 0.7.0"
    }
    local = {
      source  = "hashicorp/local"
      version = ">= 2.0.0"
    }
  }
}

bpg/proxmox manages VMs. siderolabs/talos handles image factory, machine configs, bootstrap, and health checks.

The Proxmox provider needs both API and SSH access. API for VM operations, SSH for uploading the Talos image to Proxmox storage. This confused me at first - why SSH? Because the image upload happens via SCP under the hood.

provider "proxmox" {
  endpoint  = var.proxmox_endpoint
  insecure  = var.proxmox_insecure
  api_token = var.proxmox_api_token

  ssh {
    agent    = true
    username = var.proxmox_ssh_user
  }
}
💡 Tip
proxmox_insecure = true works around self-signed certs. I use it because regenerating certs on my homelab Proxmox every year is annoying. Production needs proper certs.

Variables

Only two are required. Everything else has defaults.

# Required
proxmox_endpoint  = "https://<PROXMOX_IP>:8006"
proxmox_api_token = "<USER>@pam!<TOKEN_NAME>=<TOKEN_SECRET>"
💡 Tip
Create a dedicated API token: Datacenter → Permissions → API Tokens. PVEVMAdmin on /. Don’t use root.

Key defaults:

VariableDefaultNotes
cluster_name"talos"Prefixed to VM names
talos_version"v1.9.2"Pinned for reproducibility
controlplane_count1Set to 3 + cluster_vip for HA
worker_count2Scale by changing this
network_cidr"10.0.0.0/24"IPs calculated from offsets
cp_ip_offset70First CP gets .70
worker_ip_offset80First worker gets .80

How IPs Are Computed

The first time I did this, I hardcoded IPs in every resource. Scaling from 2 to 3 workers meant editing six places. Missed one, spent 20 minutes figuring out why the third worker wouldn’t join.

Use maps. Compute everything once:

locals {
  controlplanes = {
    for i in range(var.controlplane_count) :
    "${var.cluster_name}-cp-${i + 1}" => {
      vm_id = var.vm_id_base + i
      ip    = cidrhost(var.network_cidr, var.cp_ip_offset + i)
    }
  }

  workers = {
    for i in range(var.worker_count) :
    "${var.cluster_name}-w-${i + 1}" => {
      vm_id = var.vm_id_base + 10 + i
      ip    = cidrhost(var.network_cidr, var.worker_ip_offset + i)
    }
  }
}

Change worker_count from 2 to 4? Terraform adds two nodes with correct IPs, VM IDs, and hostnames. One variable change.


Image Factory

Talos image factory builds custom OS images with the extensions you specify. This downloads a nocloud image with qemu-guest-agent baked in:

data "talos_image_factory_extensions_versions" "this" {
  talos_version = var.talos_version
  filters = { names = var.talos_extensions }
}

resource "talos_image_factory_schematic" "this" {
  schematic = yamlencode({
    customization = {
      systemExtensions = {
        officialExtensions = data.talos_image_factory_extensions_versions.this.extensions_info[*].name
      }
    }
  })
}

resource "proxmox_virtual_environment_download_file" "talos" {
  content_type            = "iso"
  datastore_id            = var.image_storage
  node_name               = var.proxmox_node
  file_name               = "talos-${var.talos_version}-${var.cluster_name}.img"
  url                     = "${var.talos_factory_url}/image/${talos_image_factory_schematic.this.id}/${var.talos_version}/nocloud-amd64.raw.gz"
  decompression_algorithm = "gz"
  overwrite               = false
  overwrite_unmanaged     = true
}

Terraform downloads this once. Subsequent applies skip the download if the file exists.

💡 Tip
Start with just qemu-guest-agent. Add other extensions when you need them. I added iscsi-tools when I connected to a SAN, tailscale when I wanted remote access. Don’t preemptively install things you might need someday.

VM Provisioning

Both roles use the same module, differing only in sizing and tags:

module "controlplane" {
  source   = "git::https://github.com/jlambert229/terraform-proxmox.git"
  for_each = local.controlplanes

  vm_id     = each.value.vm_id
  name      = each.key
  node_name = var.proxmox_node
  tags      = [var.cluster_name, "controlplane", "talos"]
  on_boot   = true

  disk_image_id = proxmox_virtual_environment_download_file.talos.id
  disk_size_gb  = var.controlplane_disk_gb
  disk_storage  = var.disk_storage
  boot_order    = ["scsi0"]

  cpu_cores     = var.controlplane_cpu
  memory_mb     = var.controlplane_memory_mb
  agent_enabled = local.has_guest_agent

  network_bridge = var.network_bridge
  vlan_id        = var.vlan_id

  initialize                  = true
  initialization_datastore_id = var.disk_storage
  ip_address                  = "${each.value.ip}/${local.network_prefix}"
  gateway                     = var.gateway
  dns                         = var.nameservers
}

Workers: same module, local.workers, worker sizing.

ℹ️ Info
VM module lives in a separate terraform-proxmox repo for reuse across projects.

Default sizing:

RoleCPUMemoryDiskVM ID base
Control Plane24 GB20 GB400
Worker24 GB50 GB410

Cluster Configuration and Bootstrap

After VMs boot, Terraform configures them via the Talos API (port 50000). No SSH.

Secrets

resource "talos_machine_secrets" "this" {
  talos_version = var.talos_version
}

Generates cluster PKI (CA, certs, tokens). Stored in Terraform state.

⚠️ Warning
State contains cluster secrets. Use remote state with encryption for anything beyond a lab.

Machine Configs

data "talos_machine_configuration" "controlplane" {
  cluster_name     = var.cluster_name
  cluster_endpoint = "https://${local.cluster_endpoint}:6443"
  machine_type     = "controlplane"
  machine_secrets  = talos_machine_secrets.this.machine_secrets
  talos_version    = var.talos_version
}

resource "talos_machine_configuration_apply" "controlplane" {
  depends_on = [module.controlplane]
  for_each   = local.controlplanes

  client_configuration        = talos_machine_secrets.this.client_configuration
  machine_configuration_input = data.talos_machine_configuration.controlplane.machine_configuration
  node                        = each.value.ip
  config_patches = [
    yamlencode({
      machine = {
        network = { hostname = each.key }
        install = { disk = var.install_disk }
      }
    })
  ]
}

Base config generated from cluster name + role + secrets. Per-node patches set hostname and install disk in a single yamlencode block. Workers identical with machine_type = "worker".

Client Configuration

data "talos_client_configuration" "this" {
  cluster_name         = var.cluster_name
  client_configuration = talos_machine_secrets.this.client_configuration
  endpoints            = [for _, cp in local.controlplanes : cp.ip]
}

Generates the talosconfig used by talosctl and referenced by the health check.

Bootstrap

resource "talos_machine_bootstrap" "this" {
  depends_on           = [talos_machine_configuration_apply.controlplane]
  node                 = local.first_cp_ip
  endpoint             = local.first_cp_ip
  client_configuration = talos_machine_secrets.this.client_configuration
}

“Is it done yet?” - Just wait for the health check instead of refreshing frantically

Runs once on the first control plane node. Initializes etcd and the Kubernetes API.

Health Check

data "talos_cluster_health" "this" {
  depends_on = [
    talos_machine_bootstrap.this,
    talos_machine_configuration_apply.controlplane,
    talos_machine_configuration_apply.worker,
  ]

  client_configuration = data.talos_client_configuration.this.client_configuration
  control_plane_nodes  = [for _, cp in local.controlplanes : cp.ip]
  worker_nodes         = [for _, w in local.workers : w.ip]
  endpoints            = data.talos_client_configuration.this.endpoints

  timeouts = { read = "10m" }
}

Blocks until all nodes are healthy or 10 minutes elapse. If a node doesn’t join, apply fails explicitly instead of producing a half-working cluster.


Credentials

“Treat your infrastructure like cattle, not pets. Except for the credentials - guard those like the Crown Jewels.” - Cloud Native Wisdom

resource "local_sensitive_file" "kubeconfig" {
  depends_on = [data.talos_cluster_health.this]

  content         = talos_cluster_kubeconfig.this.kubeconfig_raw
  filename        = "${path.module}/generated/kubeconfig"
  file_permission = "0600"
}

resource "local_sensitive_file" "talosconfig" {
  depends_on = [data.talos_cluster_health.this]

  content         = data.talos_client_configuration.this.talos_config
  filename        = "${path.module}/generated/talosconfig"
  file_permission = "0600"
}

Written to generated/ (0600, git-ignored). Or pull from outputs:

terraform output -raw kubeconfig > ~/.kube/talos.kubeconfig
terraform output -raw talosconfig > ~/.talos/config

Usage

Deploy

cp terraform.tfvars.example terraform.tfvars
# Set proxmox_endpoint and proxmox_api_token

terraform init
terraform plan
terraform apply

The apply sequence:

  1. Build Talos schematic, download image to Proxmox
  2. Create control plane + worker VMs
  3. Generate secrets, apply machine configs via Talos API
  4. Bootstrap etcd + Kubernetes API on first control plane
  5. Wait for all nodes healthy (up to 10m)
  6. Write kubeconfig + talosconfig to generated/

Verify

export KUBECONFIG=generated/kubeconfig
kubectl get nodes

export TALOSCONFIG=generated/talosconfig
talosctl health
💡 Tip

talosctl dashboard gives you a real-time TUI showing CPU, memory, network, and service status across all nodes. It’s the fastest way to confirm the cluster is healthy after a fresh deploy or upgrade:

talosctl dashboard --nodes <CP_IP>,<W1_IP>,<W2_IP>

For a single node deep-dive, talosctl dmesg -n <NODE_IP> --follow streams kernel logs in real time - useful for debugging boot issues or hardware problems.

Scale Workers

worker_count = 4

terraform apply. New VMs created, configured, joined. Existing nodes untouched.

HA Control Plane

controlplane_count = 3
cluster_vip        = "10.0.0.69"

The VIP floats between control plane nodes. API stays reachable if a node goes down.

Upgrade Talos

terraform output talos_schematic_id

talosctl upgrade \
  --image factory.talos.dev/installer/<SCHEMATIC_ID>:<NEW_VERSION> \
  --preserve \
  --nodes <NODE_IP>

Schematic ID preserves your extensions. --preserve keeps the data partition. Upgrade workers first, control plane last.

Tear Down

terraform destroy

Common Issues

SymptomCauseFix
apply hangs on machine configVM didn’t boot or wrong IPCheck Proxmox console, verify network/VLAN
Health check times outNode can’t reach API serverCheck firewall rules between nodes, verify cluster_endpoint
connection refused on port 50000Talos not ready yetWait. First boot takes 1-2 minutes. Re-run apply.
Image download failsProxmox can’t reach factory.talos.devCheck DNS and outbound HTTPS on Proxmox node

What’s Next

You have a working Kubernetes cluster. Next steps:

The cluster is API-managed end to end, so everything stays declarative.


References