Technology

Private Overlay Networks with Tailscale/ZeroTier: Multi‑Cloud Mesh

İçindekiler

Private Overlay Networks with Tailscale/ZeroTier: Multi‑Cloud Mesh

If you’ve ever stitched together workloads across DigitalOcean, Hetzner, OVH, Linode, and a sprinkling of Lightsail or bare metal, you know the pain of inconsistent east–west network paths. In this post, we’ll build and harden Private Overlay Networks with Tailscale/ZeroTier to deliver a site‑to‑site mesh across multi‑provider VPS. This isn’t a fantasy architecture. It’s the model we turned to after a nasty incident that burned 17.3 minutes of a 99.95% monthly error budget and sent the on‑call through a Saturday that felt like three. We’ll walk discovery → mitigation → prevention, with metrics, CLI snippets, and runbook steps.

The Incident That Triggered the Mesh

It started as a garden‑variety blip. p95 API latencies from US‑East to EU‑Central edged from 120 ms to 420 ms over six hours, with sporadic 1–3% packet loss between DO (NYC) and Hetzner (FSN). Our dashboards showed SYN retransmits climbing, particularly on services pinned to public IPs with provider firewalls. East–west calls retried through a mishmash of NATs and middleboxes. We weren’t down, but we were wobbling: 2.3% of requests exceeded our 300 ms SLO in the hottest path. That’s a budget you can’t spend for long.

By 13:40 UTC, we saw a pattern: most failures clustered on cross‑provider traffic when reverse paths crossed CGNAT. Our infra was “cloud‑agnostic,” but the network clearly was not. We needed a private, stable address space and a predictable, encrypted path between sites—without backhauling all traffic through a single chokepoint.

The decision: deploy an overlay mesh—first with Tailscale (WireGuard‑based) for a quick win, and, in a parallel lane, ZeroTier for teams that needed L2‑like semantics and controller‑level policy. Both had to be represented in IaC, observable, and survivable when a provider or region had a bad day.

What Is a Private Overlay Network?

A private overlay is a virtual network that rides over the existing internet (or any IP network). Nodes keep their normal public/private interfaces, but they also join a secure mesh with stable addresses. Traffic between nodes is encrypted end‑to‑end and, when possible, flows directly via NAT traversal (hole‑punching). When direct paths fail, traffic relays through a middle layer (DERP in Tailscale, relays/planets/moons in ZeroTier).

Tailscale vs. ZeroTier: Pragmatic Differences

Capability Tailscale ZeroTier
Core protocol WireGuard (ChaCha20‑Poly1305) Custom overlay; encryption comparable to WG
NAT traversal Direct UDP when possible; DERP relay fallback Direct when possible; planet/relay fallback; optional moons
Addressing Stable 100.x.x.x (CGNAT block) per node; MagicDNS Private network CIDRs; can simulate L2 or L3
Site‑to‑site Subnet routers (advertise‑routes), exit nodes Managed routes, optional bridging
Policy ACL file; identity‑centric; SSO/SCIM friendly Controller rules; member auth; tags
Control plane Hosted; self‑host with Headscale possible Hosted controller; self‑host controller; moons
MTU defaults Conservative (~1280) Higher virtual MTU; adjust to path
Client ecosystem Strong across OSs/containers; lightweight Strong; good for embedded/L2 scenarios

They’re both excellent. If you want identity‑driven ACLs and a dead‑simple path to subnet routing, Tailscale is fast to land. If your use case leans L2 adjacency, custom controllers, or you already live in ZeroTier, it’s equally viable.

Reference Architecture: Site‑to‑Site Mesh Across Multi‑Provider VPS

Our baseline topology per provider/region:

       +---------------------+            +----------------------+
       |  DO - NYC1         |            |  Hetzner - FSN1      |
       |  gw-do-nyc1 (GW)   |            |  gw-hz-fsn1 (GW)     |
       |  10.10.10.1/24     |            |  10.20.10.1/24       |
       |  app/db nodes      |            |  app/db nodes        |
       +----------+---------+            +-----------+----------+
                  |                                      |
             [Overlay Interface]                    [Overlay Interface]
                  |                                      |
                  +------------------ Mesh ----------------+
                             (Direct UDP when possible)

Each region gets at least two gateway nodes (for HA) that:

  • Participate in the overlay as regular nodes.
  • Act as subnet routers advertising local RFC1918 ranges to the mesh.
  • Enforce ACLs so east–west is least‑privilege by default.

Addressing and routes (example):

  • DO‑NYC1: 10.10.10.0/24
  • Hetzner‑FSN1: 10.20.10.0/24
  • OVH‑GRA: 10.30.10.0/24

We keep per‑region /24s and reserve /16 per provider for growth. Overlay MTU is set conservatively (1280) to avoid fragmentation across the internet path.

Tailscale Implementation

Step 1 — Org setup and guardrails

  • Enable SSO and device approval.
  • Short key expiry (30–90 days) for servers; ephemeral keys for CI hosts.
  • MagicDNS with split DNS for service discovery (e.g., db.service.tailnet.yourcorp).
  • Tailnet policy: default‑deny; explicit allows between service tags.

Step 2 — Install and enroll nodes

On Debian/Ubuntu gateways:

curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up  
  --authkey=<TSKEY-PREAUTH>  
  --hostname=gw-do-nyc1  
  --accept-dns=false

Enable IP forwarding and basic forwarding rules:

sudo sysctl -w net.ipv4.ip_forward=1
sudo sysctl -w net.ipv6.conf.all.forwarding=1
# Persist
sudo bash -c 'cat >> /etc/sysctl.d/99-overlay.conf <<EOF
net.ipv4.ip_forward=1
net.ipv6.conf.all.forwarding=1
EOF'

Step 3 — Advertise routes (subnet router)

On gw‑do‑nyc1:

sudo tailscale up  
  --authkey=<TSKEY-PREAUTH>  
  --advertise-routes=10.10.10.0/24  
  --advertise-exit-node=false  
  --hostname=gw-do-nyc1

On gw‑hz‑fsn1:

sudo tailscale up  
  --authkey=<TSKEY-PREAUTH>  
  --advertise-routes=10.20.10.0/24  
  --hostname=gw-hz-fsn1

Approve the routes in the Tailscale admin UI (or via API) to make them active.

Step 4 — ACL policy: identity‑first access

A minimal ACL that allows app → db across sites without opening the world:

{
  "tagOwners": {
    "tag:gateway": ["group:netops"],
    "tag:app": ["group:platform"],
    "tag:db": ["group:dba"]
  },
  "acls": [
    { "action": "accept", "src": ["tag:app"], "dst": ["tag:db:*:5432"] },
    { "action": "accept", "src": ["group:netops"], "dst": ["*:*"] }
  ],
  "ssh": [
    { "action": "check", "src": ["group:netops"], "dst": ["tag:gateway"], "users": ["root"] }
  ]
}

Tag nodes at enrollment time:

sudo tailscale set --tags=tag:gateway

Step 5 — Observability and SLOs

Key overlays we chart weekly:

  • p95 overlay RTT per site pair (derived from periodic ICMP/TCP checks over tailnet IPs).
  • Packet loss per site pair.
  • Handshake time distribution (from service logs or synthetic checks).
  • Route health: subnet route availability, last change timestamp.

Example: Prometheus blackbox checks between gateways (tailnet IPs):

# probe_icmp overlays
probe_success{target="100.100.23.10"}
probe_icmp_rtt_seconds_bucket{...}

We also scrape host metrics for WireGuard/Tailscale processes (CPU, RSS) to track encryption overhead. Under our load (200–400 Mbps bursts), CPU stayed under 6% on 2 vCPU gateways with AES‑NI/AVX support.

Step 6 — Terraform the basics

We keep routes, tags, and ACLs in code. An example using the Tailscale provider:

terraform {
  required_providers {
    tailscale = {
      source = "tailscale/tailscale"
      version = "~> 0.16"
    }
  }
}

provider "tailscale" {}

resource "tailscale_acl" "tailnet" {
  acl = file("./acl.json")
}

resource "tailscale_device_subnet_routes" "gw_do_nyc1" {
  device_id = var.gw_do_nyc1_device_id
  routes    = ["10.10.10.0/24"]
}

resource "tailscale_device_tags" "gw_do_nyc1" {
  device_id = var.gw_do_nyc1_device_id
  tags      = ["tag:gateway"]
}

Observed outcomes (Tailscale)

  • p95 handshake time dropped from 220 ms (public IPs + NAT retries) to 26 ms across DO‑NYC1 ↔ Hetzner‑FSN1.
  • Packet loss on inter‑service calls fell from 0.7% to 0.05% during peak.
  • Throughput on a noisy pair improved from 340 Mbps to 760 Mbps after direct UDP was established; DERP fallback rarely engaged (<1% of flows).
  • Error budget burn for 99.95% SLO cut from 17.3 min/month to 1.9 min/month over the next quarter, mostly from removing path flakiness.

ZeroTier Implementation

Step 1 — Install and join

curl -s https://install.zerotier.com | sudo bash
sudo zerotier-cli join <NETWORK_ID>

Authorize members in the controller. Assign managed IPs (e.g., 10.42.0.0/16). For site‑to‑site, configure managed routes to your on‑host subnets:

  • DO‑NYC1: route 10.10.10.0/24 via gw‑do‑nyc1
  • Hetzner‑FSN1: route 10.20.10.0/24 via gw‑hz‑fsn1

Step 2 — Enable forwarding and routing on gateways

sudo sysctl -w net.ipv4.ip_forward=1
sudo sysctl -w net.ipv6.conf.all.forwarding=1
# Linux: identify ZeroTier interface, usually zt<id>
ip -br a | grep zt

Ensure your provider firewall allows overlay‑initiated traffic (you can keep public ingress closed). East–west will ride the overlay interface.

Step 3 — Optional moons for predictable relay locality

In some geographies, we improved relay fallback latency by deploying a moon near our regions.

# On a stable VM with static IP
zerotier-idtool initmoon identity.public > moon.json
# Edit moon.json to set a stable reachable address
zerotier-idtool genmoon moon.json
# Distribute the .moon file and orbit from members
echo "zerotier-cli orbit <moonid> <moonid>"

Result: when direct paths fail, relay fallbacks landed closer to traffic sources, trimming p95 relay RTT from ~180 ms to ~92 ms in EMEA.

Step 4 — Flow rules (policy)

ZeroTier rules let you express network policy at L2/L3. A simple L3‑only, default‑deny policy allowing SSH and Postgres between tags:

;
; Minimal rules
;
drop
  not ethertype ipv4 and not ethertype ipv6 and not ethertype arp;
# Allow ICMP for health
accept
  ethertype ipv4 and ipprotocol icmp and chr ip.ttl  >= 1;
# Allow SSH and Postgres between tagged members
accept
  ethertype ipv4 and ipprotocol tcp and dport 22 and tag ssh=1 and tag ops=1;
accept
  ethertype ipv4 and ipprotocol tcp and dport 5432 and tag app=1 and tag db=1;
# Drop the rest

Tagging members in the controller (e.g., ops=1, app=1, db=1) gates access. Keep rules human‑readable; they’re your audit trail during incidents.

Observed outcomes (ZeroTier)

  • Direct path success >98% after first minute; relay fallback rare.
  • p95 TCP connect time stabilized at 35–45 ms for NYC1 ↔ FSN1.
  • Overlay throughput kept pace with Tailscale for our workloads (400–700 Mbps bursts on 2 vCPU gateways).

Performance Tuning and Observability

MTU and fragmentation

We default to 1280 MTU on overlay interfaces to avoid PMTU gotchas through the public internet. If you control both edges and can validate, you can probe higher MTUs—just don’t trade consistency for a few extra Mbps on paper.

# Tailscale (Linux)
sudo ip link set dev tailscale0 mtu 1280

# ZeroTier interface discovery and MTU set
IF=$(ip -o link | awk -F': ' '/zt[0-9a-f]+/ {print $2; exit}')
sudo ip link set dev "$IF" mtu 1280

Throughput and CPU

Quick checkpoints we log in runbooks:

  • iperf3 between gateways (both directions).
  • Per‑core CPU on encryption threads.
  • IRQ balance and offload settings (make sure virtio/net offloads aren’t neutered).
# Server
iperf3 -s
# Client
iperf3 -c 100.100.23.10 -P 4 -t 30

Sample numbers from a DO‑NYC1 ↔ Hetzner‑FSN1 pair (2 vCPU, 2 GB):

  • Before (public IPs + NAT flakiness): 340–430 Mbps, 0.6–0.9% loss spikes.
  • After (overlay direct): 620–780 Mbps sustained, loss <0.1%.

Latency and SLOs

We measure:

  • p95/p99 overlay RTT
  • p95 TCP handshake time
  • Route availability (did we lose a subnet router?)

PromQL sketches for blackbox probes:

overlay_rtt_p95_ms = histogram_quantile(0.95, sum(rate(probe_icmp_duration_seconds_bucket{job="overlay"}[5m])) by (le, target)) * 1000

handshake_p95_ms = histogram_quantile(0.95, sum(rate(tcp_connect_duration_seconds_bucket{job="overlay"}[5m])) by (le, target)) * 1000

Operationally, we reset expectations with product on Day 1: the overlay is a reliability lever, not a speed cheat code. When it improves latency, it’s usually because we eliminated retransmits and middlebox weirdness, not because encryption made packets go faster.

Security, Compliance, and Governance

Key hygiene

  • 30–90 day key expiry for servers; alarms for “expiring within 7 days.”
  • Ephemeral keys for CI runners and canaries (auto‑expiry within hours).
  • Device approval required; no auto‑admit to production overlays.

Least privilege by default

  • Segment by service role: app → db only on required ports.
  • Block inter‑region by default; allow per service dependency.
  • Keep a blocklist for known risky ports; only open with change control.

Auditability

  • Log joins/leaves, route advertisements, policy changes (ship to your SIEM).
  • Daily diff of ACLs/rules in Git; pull request reviews required.
  • Quarterly key rotation fire drills.

Runbooks: From Zero to Mesh and Back Again

Runbook A — Bring up a new region (Tailscale)

  1. Provision two small gateways (2 vCPU, 2–4 GB) behind provider firewalls.
  2. Install Tailscale, enable IP forwarding.
  3. tailscale up --authkey=<preauth> --hostname=gw-<prov>-<reg>
  4. Advertise routes: --advertise-routes=<cidr>
  5. Approve routes in admin; tag gateways.
  6. Validate connectivity from other regions: ping, traceroute, iperf3.
  7. Update ACLs with least‑privilege rules for new services.
  8. Push Terraform changes for routes/tags/ACLs; peer review before apply.
  9. Set alerts: route withdrawal, device offline >5 min, key expiry in 7 days.

Runbook B — Bring up a new region (ZeroTier)

  1. Provision two gateways and join them to the network ID.
  2. Authorize members, assign managed IPs.
  3. Add managed routes to the region CIDR; map to the gateways.
  4. Enable IP forwarding; confirm ZeroTier interface name.
  5. Apply flow rules granting the minimum required access.
  6. Connectivity tests and baseline measurements.
  7. Commit controller changes to Git (exported JSON/rules) for audit.

Runbook C — Common failure modes and mitigations

  • Symptom: Route advertised but unreachable.

    Checks: tailscale status --peers or zerotier-cli listpeers; ensure forwarding enabled; confirm ACL/rule allows path.

    Fix: Re‑announce routes; bounce service; verify provider firewall doesn’t block overlay interface traffic.
  • Symptom: Sudden fall back to relays; throughput tanks.

    Checks: NAT type change (provider reboot?); packet loss spike on public path.

    Fix: Restart overlay processes; verify UDP allowed outbound; consider local relay (DERP region pin or ZeroTier moon).
  • Symptom: Key expiry mid‑deploy.

    Checks: Node event logs; CI failures.

    Fix: Rotate keys; use ephemeral keys for short‑lived nodes; alerting with 7‑day headroom.
  • Symptom: Route blackhole (two gateways advertise same CIDR, asymmetric path).

    Checks: Route tables, overlay peer choice.

    Fix: Standardize route priority; in Tailscale, use primary route selection; in ZeroTier, consolidate managed routes.

Operational Metrics Before/After

Across a quarter after rollout on three regions:

  • p95 TCP handshake: 180–240 ms → 24–44 ms
  • Packet loss: 0.4–0.9% → 0.03–0.08%
  • Failed deploys due to flaky cross‑region calls: 7.1% → 0.6%
  • Error budget burn (99.95% SLO): 17.3 min → 1.9 min

We also saw developer cycle time improve. Our CI jobs that hit dependencies across sites used to run with guard‑timers and retries; with the overlay, median runtime dropped by 14–22% depending on the job graph.

Cost and Capacity Planning

Compute overhead

  • Gateway cost: small VMs (2 vCPU) were enough up to ~800 Mbps.
  • Per‑workload CPU overhead for overlay daemons was negligible (<1–2%) on general servers.

Network egress

  • Direct overlay traffic still pays provider egress; we avoided central backhaul to keep costs near the theoretical minimum.
  • Relay fallback can add surprise egress; we monitored relay usage and optimized NAT paths to keep it <1%.

Licensing and ops time

  • Both tools have generous free/paid tiers; the real spend is your time hardening policy and observability.

When Not to Use an Overlay

Overlays are powerful, but not always the right tool. Consider alternatives when:

  • You can bring native interconnects online (e.g., private interconnects, IPSec with BGP between DCs) with predictable latency and SLAs.
  • You need deterministic L2 semantics with strict broadcast controls—ZeroTier can do L2‑ish, but at scale, it’s easier to use L3 with clear routes or dedicated WAN.
  • Regulatory requirements mandate specific control planes you can’t meet without self‑hosting; in that case, plan for Headscale (Tailscale) or self‑hosted ZeroTier controller + moons.

Culture and On‑Call Health

After we shipped the mesh, we wrote down two promises to ourselves:

  1. No heroics. If the overlay misbehaves, we roll forward or back using the runbook, not wizardry at 03:00.
  2. Blameless learning. Every incident gets the same respect—timeline, facts, metrics, and one thing we’ll do to make it boring next time.

Team burnout usually hides in the glue code between systems. Overlays remove a lot of that glue. But the real antidote is steady instrumentation, guardrails in code, and the psychological safety to say “I don’t know yet” on a call.

Appendix: Concrete Config and Snippets

Systemd health checks for gateways

# /etc/systemd/system/overlay-health.service
[Unit]
Description=Overlay Health Probe
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/bin/overlay-health.sh
Restart=always
RestartSec=15

[Install]
WantedBy=multi-user.target
# /usr/local/bin/overlay-health.sh
#!/usr/bin/env bash
set -euo pipefail
TARGETS=(100.100.23.10 100.77.12.5)
while true; do
  for t in "${TARGETS[@]}"; do
    if ! ping -c1 -W1 "$t" >/dev/null; then
      logger -t overlay-health "WARN: overlay target $t unreachable"
    fi
  done
  sleep 10
done

nftables baseline to protect gateways

table inet overlay {
  chain input {
    type filter hook input priority 0; policy drop;
    iif lo accept
    ct state established,related accept
    iifname "tailscale0" accept
    iifname "zt*" accept
    tcp dport {22} ct state new accept
  }
  chain forward {
    type filter hook forward priority 0; policy drop;
    iifname "tailscale0" oifname != "tailscale0" accept
    iifname "zt*" oifname != "zt*" accept
  }
}

Connectivity smoke test script

#!/usr/bin/env bash
set -euo pipefail
PEERS=(100.64.0.10 100.80.1.20 10.10.10.10 10.20.10.20)
for p in "${PEERS[@]}"; do
  echo "Testing $p"
  if ! timeout 2 bash -c ">/dev/tcp/$p/22" 2>/dev/null; then
    echo "FAIL: $p:22"
  else
    echo "OK: $p:22"
  fi
  ping -c2 -W1 "$p" || true
  traceroute -n -w1 -q1 "$p" || true
  echo
done

Key Takeaways

  • Overlays give you stable addressing, encrypted paths, and policy you can reason about across providers.
  • Tailscale is a fast path to L3 site‑to‑site via subnet routers and identity‑based ACLs.
  • ZeroTier shines when you want controller‑driven networks and flexible L2/L3 behavior.
  • Keep MTU conservative, measure p95/p99s, and alert on route health and key expiry.
  • Codify everything: routes, ACLs/rules, device tags, and health checks belong in Git.
  • Practice failure: relay fallbacks, key rotations, and route withdrawals should be boring drills.

Closing

We didn’t adopt overlays to be clever. We adopted them because they let us say “yes” to multi‑provider without trading away reliability. With Private Overlay Networks with Tailscale/ZeroTier, you can ship a site‑to‑site mesh in days, observe it in hours, and stop apologizing for the internet in front of your SLOs. Start small, tag ruthlessly, measure honestly, and make your post‑mortems a little shorter this quarter.

Frequently Asked Questions

You don’t strictly need them, but two small gateway VMs per region make subnet routing, failover, and policy enforcement clean. They also isolate overlay upgrades from app nodes.

Often they stabilize latency by avoiding NAT retries and middleboxes. Sometimes p95 improves; don’t bank on miracles. The main win is reliability and predictable, encrypted paths.

Put ACLs/rules, routes, and tags in Git; require PR reviews; ship join/leave and policy-change logs to your SIEM; set key-expiry alerts; and run quarterly rotation/fire-drill tests.