Technology

I Built a 3‑VPS HA K3s Cluster With Traefik, cert‑manager, and Longhorn — Here’s the Playbook

The moment I knew it was time to grow up my cluster

It started with a very quiet Tuesday. One of those days when everything feels calm… until it doesn’t. I was sipping coffee, poking at some logs, when a simple kernel update rebooted my single VPS and my tiny Kubernetes playground vanished for eight long minutes. No alarms, no pagers, just the slow-motion realization that my “good enough” setup wasn’t actually good enough. Clients noticed. I noticed. And that was the day I promised myself I’d stop treating production like a side project.

If you’re here, chances are you’ve had that moment too. Maybe your app is getting traction. Maybe you’ve got a couple of microservices, a database that shouldn’t disappear, and users who expect your domain to behave like a grown-up. So let’s build a grown-up platform—without the drama. I’m going to walk you through a production-ready, three-VPS, high-availability K3s cluster with Traefik as ingress, cert‑manager for automated TLS, and Longhorn for persistent storage. We’ll talk architecture, installation, real-world gotchas, and the calm way to run the thing day to day.

By the end, you’ll have a clear mental model, a practical plan, and the confidence to ship on a cluster that doesn’t blink just because a single VM sneezed.

Why three VPS nodes make everything feel calmer

Think of a three-VPS cluster like a three-legged stool. Two legs can wobble. Four legs are nice, but sometimes you don’t have room. Three legs? You can sit down and exhale. That’s quorum—two nodes can disagree, but the third breaks ties and keeps the cluster consistent. In K3s land, that “brain” is etcd. When we run K3s in HA mode with embedded etcd, each node carries a piece of the truth. Lose one node? You can still write to the cluster, deploy workloads, renew certificates, the whole show.

Here’s the mental picture that clicked for me: one public domain pointing to a stable entrypoint (we’ll talk about how to make that truly stable), three small-but-capable VPS instances (2–4 vCPU and 4–8 GB RAM each is a nice starting point), and a cluster that knows how to keep going if any single box goes dark. K3s gives you the lightweight control plane. Traefik takes incoming traffic and routes it politely. cert‑manager keeps the locks on the doors with auto-renewing certificates. Longhorn spreads your persistent volumes across nodes, so a single outage doesn’t take your data with it.

It’s also worth mentioning that this stack doesn’t demand a hyperscaler. I’ve run it on modest providers and it just hums. The trick is to be intentional about networking, storage prerequisites, and small-but-important guardrails like PodDisruptionBudgets and node taints. We’ll cover those as we go.

The plan: clean base, simple network, tight doors

Before we type a single install command, a few groundwork pieces make life much easier. In my experience, there are three that matter most: a clean OS baseline, a simple private network between nodes, and a firewall stance that defaults to “nope” from the internet and “everything you need” on your private mesh.

On the OS front, I like to start from a minimal image and harden it gently. Nothing fancy, just the basics: patching, SSH keys only, a non-root user with sudo, and a handful of sensible defaults. If you want a calm, practical walkthrough that pairs nicely with what we’re building, I wrote about this in The Calm, No‑Drama Guide: How to Secure a VPS Server. Grab that mindset and keep it handy.

For networking, the easiest path is giving your three nodes a private way to talk—either the provider’s internal network or a tiny WireGuard mesh you control. K3s uses an internal CNI (Flannel by default) to route pod traffic, but you still want node-to-node transport that’s reliable. I tend to allow “any-to-any” on the private interface and lock down the public interface to just what Traefik, SSH, and the K3s API need (usually 80/443 for HTTP/HTTPS, 22 for SSH, and 6443 if you’ll be managing the cluster remotely). Longhorn replicates blocks between nodes, so that private path puts the heavy lifting off your public NICs and out of sight.

One more thing about the entrypoint. On three VPS instances, you might not have a cloud Load Balancer. That’s okay. I’ve had good results with a small floating IP managed by keepalived, or simply pointing DNS at a single “active” node with health-checked failover that flips quickly if it goes down. If you want to sleep particularly well during migrations, I shared how I run resilient DNS in How I Run Multi‑Provider DNS with octoDNS. That approach plays beautifully with a K3s ingress endpoint.

K3s in HA mode with embedded etcd (the easy-on-the-brain setup)

Now for the fun part: making the cluster. The K3s team has an excellent guide for this setup—if you like reading the source, check out the official K3s HA with embedded etcd guide. The flow is surprisingly simple. You’ll install the first node as a server with embedded etcd, grab a token, and bring up the other two nodes as peers.

On the first node, something like this gets you going:

curl -sfL https://get.k3s.io | sh -s - 
  server 
  --cluster-init 
  --write-kubeconfig-mode 644 
  --disable servicelb 
  --disable local-storage

Why disable those two? K3s ships a tiny ServiceLB and local storage provisioner. For a production-ish cluster, I prefer Traefik plus Longhorn, so I turn the bundled bits off. After the first node stabilizes, get your cluster join token:

sudo cat /var/lib/rancher/k3s/server/node-token

On the second and third nodes, join as servers (peers) so you have three control-plane nodes sharing etcd:

curl -sfL https://get.k3s.io | K3S_URL=https://<FIRST-NODE-IP>:6443 
  K3S_TOKEN=<THE-TOKEN-YOU-JUST-COPIED> 
  sh -s - server 
  --write-kubeconfig-mode 644 
  --disable servicelb 
  --disable local-storage

Give it a minute or two and check from any node:

sudo k3s kubectl get nodes -o wide

You should see three “Ready” nodes, each marked as control-plane. Don’t be surprised if taints appear on control-plane nodes by default. K3s is flexible about running workloads on them, but for a tidier production feel, I like to keep system components and light workloads there and push heavier apps onto explicit worker nodes when I have them. With three VPS nodes only, you can absolutely run your apps on these nodes—just set resource requests and PodDisruptionBudgets thoughtfully so upgrades don’t juggle everything at once.

One last thing: snapshots. K3s can periodically snapshot etcd for you. If you keep snapshots local and copy them to object storage, you’ve suddenly got a realistic recovery path. It’s not glamorous, but it’s the difference between an “oops” and a rebuild.

Traefik as your front door (without overthinking it)

K3s often includes Traefik by default, but depending on your version and flags, you might be installing it yourself. Either way, I like Traefik because it’s easy to reason about. It handles HTTP/HTTPS, it plays nicely with standard Kubernetes Ingress objects, and it respects annotations for most things you’d want—timeouts, headers, middlewares—without a lot of YAML yoga.

The philosophy here is to keep Ingress definitions boring and predictable. Something like this (as an example) is clean:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: hello
  namespace: default
  annotations:
    kubernetes.io/ingress.class: traefik
spec:
  rules:
  - host: hello.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: hello-svc
            port:
              number: 80
  tls:
  - hosts:
    - hello.example.com
    secretName: hello-tls

We’ll let cert‑manager manage the TLS secret, and Traefik will serve it. For environments where you need to route multiple hosts, apply HSTS, or sneak in basic auth for an internal tool, Traefik’s middleware stack is straightforward. Start simple, then iterate.

What about the external IP? On three VPS instances without a managed LB, I’ve used a few strategies: a floating IP with keepalived that follows a healthy node; DNS failover that points to whichever node is currently “active”; or, if your provider allows it, a small, single-node MetalLB IP pool. The first two options are usually plenty, and they keep your design compact.

Certificates that renew themselves (cert‑manager is worth the small learning curve)

If you’ve ever renewed a certificate at 2 a.m., cert‑manager will feel like magic. It watches your Ingress hosts and renews secrets automatically using ACME. The most flexible approach is DNS‑01, which lets you issue wildcards like *.example.com without worrying about HTTP challenges. The official docs are clear and approachable—bookmark the cert‑manager installation guide.

The flow I use is: install cert‑manager via Helm or YAML, create a ClusterIssuer with DNS credentials to your DNS provider, and annotate Ingress resources to request certificates. If you’re running a SaaS or multi-tenant system and want to scale auto-SSL across customer domains, I wrote a friendly deep dive in Bring Your Own Domain, Get Auto‑SSL: DNS‑01 ACME. The same principles apply beautifully in Kubernetes.

A minimal ClusterIssuer for DNS‑01 might look like this (Cloudflare as an example):

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-dns
spec:
  acme:
    email: [email protected]
    server: https://acme-v02.api.letsencrypt.org/directory
    privateKeySecretRef:
      name: acme-account-key
    solvers:
    - dns01:
        cloudflare:
          apiTokenSecretRef:
            name: cloudflare-api-token
            key: token

Then you annotate your Ingress or specify tls.secretName and a Certificate resource, and cert‑manager takes it from there. If you want to go deeper on operational resilience—fallback CAs, rate-limit strategies, and how CAA records interact with ACME automation—pair this with understanding multi-CA approaches and DNS discipline. My notes on running robust, real‑world DNS and ACME setups in that earlier article will save you hours when traffic grows.

State that survives reboots (Longhorn, the surprisingly friendly storage layer)

When I first adopted Longhorn, I expected grief. Distributed block storage always sounds like a headache. But here’s the thing: for a three-node K3s cluster, Longhorn is the right kind of boring. It handles replica scheduling, rebuilds after failures, gives you a simple UI for visibility, and integrates with PersistentVolumeClaims like any good CSI driver should. Start with the Longhorn documentation for install steps and prerequisites.

There are a few must-dos that make Longhorn hum. First, install open-iscsi on each node and make sure it starts at boot. Longhorn uses iSCSI for attaching volumes to pods. Second, give the nodes enough local disk—SSD if you can swing it—because replica writes still hit local storage before being replicated. Third, set a replica count of two for most workloads in a three-node cluster. It’s the sweet spot between resilience and resource usage. Losing a node still leaves you with two replicas to continue serving data and a rebuild path when the node returns.

Here’s a simple StorageClass you can use as a default once Longhorn is installed:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: longhorn-default
annotations:
  storageclass.kubernetes.io/is-default-class: "true"
provisioner: driver.longhorn.io
parameters:
  numberOfReplicas: "2"
  staleReplicaTimeout: "30"
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: Immediate

With that in place, you can declare PersistentVolumeClaims in your apps and not think about it too hard. Longhorn will place replicas on different nodes and figure out attachment automatically. There’s a dashboard too. I don’t live there, but when something smells off (like a node with a flaky disk), it’s nice to have a human-friendly view.

One more pro-tip: snapshots and backups. Longhorn can snapshot locally and back up to S3-compatible storage. If your database matters even a little, set that up early. I’ve had a few “whew” moments thanks to those backups when a schema migration went sideways during a late-night deploy.

Exposing the cluster cleanly: DNS, IPv6, and a steady entrypoint

Let’s talk addresses. You’ll likely have a single hostname fronting your cluster, with Traefik presenting TLS there and routing traffic internally. Make that hostname a first-class citizen in your DNS. If your provider offers a floating IP, use keepalived to move it between nodes when you need to. If not, lean on DNS health checks to flip quickly when a node goes silent.

Also, don’t sleep on IPv6. Many providers now give each VPS a v6 address for free, and users on modern networks will reach you that way. I wrote the story of how I made v6 adoption painless in The Calm Sprint to IPv6. The gist: enable it where you can, terminate TLS properly, and let dual-stack work for you rather than against you.

For a cluster entrypoint pattern I like: A records to your active node’s IPv4 and IPv6, DNS health checks that switch to a backup node if needed, and Traefik listening on 80/443 with redirects to HTTPS. It’s not fancy. It’s not fragile. It just works. And when combined with multi‑provider DNS that you can migrate without breaking a sweat, you get resilience that scales with you.

Day 2 reality: health, upgrades, and the quiet guardrails that save you

This is the part most guides skip, but it’s where clusters either feel gentle or chaotic. Start by giving Kubernetes the hints it needs to treat your apps kindly. Resource requests keep the scheduler honest. Liveness and readiness probes tell it when to stop sending traffic. PodDisruptionBudgets ensure rollouts and node drains don’t take all replicas down at once. I usually start with a PDB that allows one replica to be down and set deployment replicas to at least two. That alone prevents a whole category of self-inflicted outages.

Upgrades are straightforward once you respect the rhythm. For K3s, drain one node at a time, upgrade, uncordon, watch it rejoin, then move on. The embedded etcd spreads the risk; you’re never touching quorum by upgrading a single node. Traefik rolls cleanly. cert‑manager barely notices. Longhorn will detach and reattach volumes as needed, though on very write-heavy workloads I like to pause briefly during the switchover.

Monitoring and logs don’t have to be a project either. Even a basic Prometheus + Grafana stack and a simple log pipeline give you eyes. Watch node pressure (CPU, memory, disk), watch Traefik 5xxs, and keep an eye on Longhorn’s replica health. That’s 90% of the “is it happy?” question answered.

Networking notes you’ll thank yourself for later

Flannel’s default VXLAN works fine for most three-node clusters. If you crave advanced policy or eBPF toys, you can explore other CNIs, but don’t feel pressured. What matters is making sure node-to-node traffic is unhindered on your private network. The ports change depending on the CNI; the simplest approach is allowing all on that private interface and guarding the public side tightly. If you’ll connect kubectl from your laptop, expose the API at 6443 on your entrypoint and restrict it with security groups or your firewall to your IP ranges.

One more gentle nudge: tune your basics. TCP backlog, sensible time_wait, and reuse settings, and a comfortable file descriptor limit prevent head-scratching under load. If you’re curious how to keep that tuning pragmatic and safe, you might enjoy my notes in The Calm Guide to Linux TCP Tuning for High‑Traffic Apps. Those little tweaks often matter more than exotic Kubernetes flags.

Common gotchas (and how I learned to dodge them)

I’ll never forget the first time Longhorn refused to attach a volume because open-iscsi wasn’t running on one node. It felt mysterious until I remembered the prerequisite. Double-check that service is enabled and healthy on every VPS. Another classic: cert‑manager stuck waiting on a DNS challenge because I mis-scoped an API token. The fix was simply giving the token permission to edit TXT records in the right zone. When in doubt, watch the cert‑manager and challenge logs; they’re chatty in a helpful way.

Traefik timeouts can also bite. If a service sits behind a slow upstream (say, a database query that sometimes spikes), it’s okay to bump your timeout annotations a bit. Just don’t hide real performance issues behind huge timeouts. Keep an eye on your upstream services and let autoscaling or better queries do the heavy lifting.

And the one we all step on once: draining a node without a PodDisruptionBudget on your single-replica stateful app. Kubernetes will do exactly what you asked—evict the only pod—and your users will watch an hourglass. Make your future self proud and add a PDB early, even if it’s conservative. It pays you back the first time you patch a kernel without holding your breath.

Your step-by-step, no-drama install checklist

Here’s how I tee this up in practice, keeping it simple and repeatable:

First, prep the VPSs: patch, set SSH keys, create a non-root user, and lock down inbound firewall rules. Give each node a private interface or a WireGuard mesh so they can talk freely out of the public eye. If this part makes you nervous, lean on the mindset in my calm VPS security guide and you’ll be fine.

Second, install K3s in HA mode as we covered, one server with cluster-init and two more joining as servers using the token. Confirm the three nodes are Ready. While you’re here, set up etcd snapshots to a safe location; even a daily copy to object storage replaces fear with confidence.

Third, install Traefik if you’re not using the bundled one, and make sure your DNS or floating IP points at your active node. Test a simple Ingress for a hello-world service over HTTP first. Then bring in cert‑manager, create your ClusterIssuer for DNS‑01, and flip the Ingress to TLS. Watch cert‑manager work; that first automatic certificate is a small victory every time.

Fourth, install Longhorn and its prerequisites, set the StorageClass default, and deploy a small StatefulSet that writes a bit of data. Move pods around by draining a node and verify Longhorn reattaches volumes where you expect. That hands-on test removes a lot of stress when you do it later under pressure.

If you like to keep the official references close, bookmark the K3s HA install guide, the cert‑manager docs, and the Longhorn docs. They’re short, friendly, and honest about edge cases.

Tuning the last mile: readiness, topology, and small luxuries

There are a few finishing touches that make your cluster feel smooth. Use readiness probes that reflect actual readiness—if an app needs a warm cache, test the endpoint that proves it. Add topology spread constraints so replicas don’t pile onto a single node after a drain. And give yourself a couple of luxuries: a small maintenance page in Traefik you can toggle during major changes, and a canary flavor of your app for safe testing behind an alternate host.

On small clusters, it’s also worth setting gentle resource requests for system components so application pods don’t crowd them out under load. K3s is light, but kube-proxy, CoreDNS, Traefik, cert‑manager, and Longhorn all deserve predictable CPU and RAM. That predictability shows up as stability when Friday traffic rolls in.

What about growth? When three nodes aren’t enough anymore

Here’s the funny part: a well-tuned three-node K3s cluster can carry more than you’d expect. When it’s time to grow, you’ve got choices. You can add worker nodes to offload heavy workloads while keeping your three-node control plane steady. You can scale storage capacity by adding nodes with bigger disks and letting Longhorn rebalance. Or you can split concerns—run databases on managed platforms and use the cluster for stateless services. The point is, this foundation doesn’t paint you into a corner. It gives you optionality without forcing a migration on a deadline.

If you adopt more hostnames and customer-facing domains, the same ACME + DNS‑01 ideas scale cleanly. That playbook is the core of what I shared in Bring Your Own Domain, Get Auto‑SSL: DNS‑01 ACME, and it’s exactly how I keep certificate automation boring even as domains multiply.

A few words on confidence and calm operations

One of my clients once told me, “I don’t want the fanciest cluster. I want the cluster I forget about.” That stuck with me. The stack we’ve walked through—K3s HA, Traefik, cert‑manager, Longhorn—aims at that feeling. It’s minimal on moving parts, friendly to debug, and forgiving when a single VPS has a bad day. You don’t get everything you’d get from a hyperscale platform, but you get something arguably more valuable: a setup you can hold in your head and run with a small team, on a sensible budget, without drama.

Over time, you’ll add your own touches. Maybe you’ll toss in a GitOps flow for manifests. Maybe you’ll teach Traefik a few more tricks. Maybe you’ll refine your DNS approach with health checks and, if you’re curious about the deeper DNS rabbit hole, strategies like multi-provider authority and stable migration techniques from my octoDNS guide. None of that is required on day one. The beauty is you can grow into it, one calm improvement at a time.

Wrap-up: your 3‑VPS HA K3s cluster, quietly dependable

So there you have it: a three-VPS K3s cluster that doesn’t flinch when a box reboots, accepts traffic gracefully through Traefik, learns certificates automatically with cert‑manager, and keeps your data safe with Longhorn. The pieces play well together. They don’t ask for heroics. And when something does go wrong, the failure modes are understandable—fixable in minutes, not hours.

If you take one thing with you, let it be this: keep the design simple and consistent. Use DNS and a steady entrypoint instead of wrestling with exotic LBs. Let Kubernetes guard the health of your apps with probes and PDBs. Give Longhorn enough room to breathe and back up what matters. And don’t forget the basics—clean OS, private node-to-node network, a firewall stance you can explain to a friend. If you want a refresher on safe, grounded VPS habits, I keep pointing folks to this calm VPS security guide because it removes the anxiety at the edges.

Hope this helped you sketch your own “grown-up” cluster. If you spin one up and run into a weird edge case, I’d love to hear the story. We all get better by sharing the small wins and the odd surprises. Until next time—ship calmly, sleep better, and let your cluster be the boring, dependable engine under the hood.

Frequently Asked Questions

Great question! On three VPS nodes, the simplest wins. Use a floating IP with keepalived so the address follows a healthy node, or lean on DNS health checks that fail over to a secondary node quickly when the first goes dark. Point your domain to that active node, let Traefik listen on 80/443, and keep TLS handled by cert‑manager. If you want extra resilience on the DNS side, running multi‑provider authority and scripted failovers with something like octoDNS is a calm, proven approach.

You can run 1 server + 2 workers, but you’ll lose the big benefit: HA for the control plane and etcd. With a single control-plane node, if that instance dies, your workloads can keep running for a while, but you can’t deploy, scale, or renew certs until the server comes back. Three K3s servers with embedded etcd gives you quorum, write availability during a failure, and a far calmer upgrade story. In a small cluster, the tiny overhead is worth the resilience.

For the control plane, enable K3s’s etcd snapshots and ship them off the box daily to object storage. That covers cluster state. For application data, use Longhorn’s built-in backup to S3-compatible storage and schedule it per volume. Practice a small restore before you need it—spin up a test namespace, restore a snapshot, and verify the app comes back. That fifteen‑minute drill turns a worst‑case day into a minor detour.