{"id":2076,"date":"2025-11-18T18:33:41","date_gmt":"2025-11-18T15:33:41","guid":{"rendered":"https:\/\/www.dchost.com\/blog\/private-overlay-networks-with-tailscale-zerotier-multi%e2%80%91cloud-mesh\/"},"modified":"2025-11-18T18:33:41","modified_gmt":"2025-11-18T15:33:41","slug":"private-overlay-networks-with-tailscale-zerotier-multi%e2%80%91cloud-mesh","status":"publish","type":"post","link":"https:\/\/www.dchost.com\/blog\/en\/private-overlay-networks-with-tailscale-zerotier-multi%e2%80%91cloud-mesh\/","title":{"rendered":"Private Overlay Networks with Tailscale\/ZeroTier: Multi\u2011Cloud Mesh"},"content":{"rendered":"<div class=\"dchost-blog-content-wrapper\"><div id=\"toc_container\" class=\"toc_transparent no_bullets\"><p class=\"toc_title\">\u0130&ccedil;indekiler<\/p><ul class=\"toc_list\"><li><a href=\"#Private_Overlay_Networks_with_TailscaleZeroTier_MultiCloud_Mesh\"><span class=\"toc_number toc_depth_1\">1<\/span> Private Overlay Networks with Tailscale\/ZeroTier: Multi\u2011Cloud Mesh<\/a><ul><li><a href=\"#The_Incident_That_Triggered_the_Mesh\"><span class=\"toc_number toc_depth_2\">1.1<\/span> The Incident That Triggered the Mesh<\/a><\/li><li><a href=\"#What_Is_a_Private_Overlay_Network\"><span class=\"toc_number toc_depth_2\">1.2<\/span> What Is a Private Overlay Network?<\/a><ul><li><a href=\"#Tailscale_vs_ZeroTier_Pragmatic_Differences\"><span class=\"toc_number toc_depth_3\">1.2.1<\/span> Tailscale vs. ZeroTier: Pragmatic Differences<\/a><\/li><\/ul><\/li><li><a href=\"#Reference_Architecture_SitetoSite_Mesh_Across_MultiProvider_VPS\"><span class=\"toc_number toc_depth_2\">1.3<\/span> Reference Architecture: Site\u2011to\u2011Site Mesh Across Multi\u2011Provider VPS<\/a><\/li><li><a href=\"#Tailscale_Implementation\"><span class=\"toc_number toc_depth_2\">1.4<\/span> Tailscale Implementation<\/a><ul><li><a href=\"#Step_1_Org_setup_and_guardrails\"><span class=\"toc_number toc_depth_3\">1.4.1<\/span> Step 1 \u2014 Org setup and guardrails<\/a><\/li><li><a href=\"#Step_2_Install_and_enroll_nodes\"><span class=\"toc_number toc_depth_3\">1.4.2<\/span> Step 2 \u2014 Install and enroll nodes<\/a><\/li><li><a href=\"#Step_3_Advertise_routes_subnet_router\"><span class=\"toc_number toc_depth_3\">1.4.3<\/span> Step 3 \u2014 Advertise routes (subnet router)<\/a><\/li><li><a href=\"#Step_4_ACL_policy_identityfirst_access\"><span class=\"toc_number toc_depth_3\">1.4.4<\/span> Step 4 \u2014 ACL policy: identity\u2011first access<\/a><\/li><li><a href=\"#Step_5_Observability_and_SLOs\"><span class=\"toc_number toc_depth_3\">1.4.5<\/span> Step 5 \u2014 Observability and SLOs<\/a><\/li><li><a href=\"#Step_6_Terraform_the_basics\"><span class=\"toc_number toc_depth_3\">1.4.6<\/span> Step 6 \u2014 Terraform the basics<\/a><\/li><li><a href=\"#Observed_outcomes_Tailscale\"><span class=\"toc_number toc_depth_3\">1.4.7<\/span> Observed outcomes (Tailscale)<\/a><\/li><\/ul><\/li><li><a href=\"#ZeroTier_Implementation\"><span class=\"toc_number toc_depth_2\">1.5<\/span> ZeroTier Implementation<\/a><ul><li><a href=\"#Step_1_Install_and_join\"><span class=\"toc_number toc_depth_3\">1.5.1<\/span> Step 1 \u2014 Install and join<\/a><\/li><li><a href=\"#Step_2_Enable_forwarding_and_routing_on_gateways\"><span class=\"toc_number toc_depth_3\">1.5.2<\/span> Step 2 \u2014 Enable forwarding and routing on gateways<\/a><\/li><li><a href=\"#Step_3_Optional_moons_for_predictable_relay_locality\"><span class=\"toc_number toc_depth_3\">1.5.3<\/span> Step 3 \u2014 Optional moons for predictable relay locality<\/a><\/li><li><a href=\"#Step_4_Flow_rules_policy\"><span class=\"toc_number toc_depth_3\">1.5.4<\/span> Step 4 \u2014 Flow rules (policy)<\/a><\/li><li><a href=\"#Observed_outcomes_ZeroTier\"><span class=\"toc_number toc_depth_3\">1.5.5<\/span> Observed outcomes (ZeroTier)<\/a><\/li><\/ul><\/li><li><a href=\"#Performance_Tuning_and_Observability\"><span class=\"toc_number toc_depth_2\">1.6<\/span> Performance Tuning and Observability<\/a><ul><li><a href=\"#MTU_and_fragmentation\"><span class=\"toc_number toc_depth_3\">1.6.1<\/span> MTU and fragmentation<\/a><\/li><li><a href=\"#Throughput_and_CPU\"><span class=\"toc_number toc_depth_3\">1.6.2<\/span> Throughput and CPU<\/a><\/li><li><a href=\"#Latency_and_SLOs\"><span class=\"toc_number toc_depth_3\">1.6.3<\/span> Latency and SLOs<\/a><\/li><\/ul><\/li><li><a href=\"#Security_Compliance_and_Governance\"><span class=\"toc_number toc_depth_2\">1.7<\/span> Security, Compliance, and Governance<\/a><ul><li><a href=\"#Key_hygiene\"><span class=\"toc_number toc_depth_3\">1.7.1<\/span> Key hygiene<\/a><\/li><li><a href=\"#Least_privilege_by_default\"><span class=\"toc_number toc_depth_3\">1.7.2<\/span> Least privilege by default<\/a><\/li><li><a href=\"#Auditability\"><span class=\"toc_number toc_depth_3\">1.7.3<\/span> Auditability<\/a><\/li><\/ul><\/li><li><a href=\"#Runbooks_From_Zero_to_Mesh_and_Back_Again\"><span class=\"toc_number toc_depth_2\">1.8<\/span> Runbooks: From Zero to Mesh and Back Again<\/a><ul><li><a href=\"#Runbook_A_Bring_up_a_new_region_Tailscale\"><span class=\"toc_number toc_depth_3\">1.8.1<\/span> Runbook A \u2014 Bring up a new region (Tailscale)<\/a><\/li><li><a href=\"#Runbook_B_Bring_up_a_new_region_ZeroTier\"><span class=\"toc_number toc_depth_3\">1.8.2<\/span> Runbook B \u2014 Bring up a new region (ZeroTier)<\/a><\/li><li><a href=\"#Runbook_C_Common_failure_modes_and_mitigations\"><span class=\"toc_number toc_depth_3\">1.8.3<\/span> Runbook C \u2014 Common failure modes and mitigations<\/a><\/li><\/ul><\/li><li><a href=\"#Operational_Metrics_BeforeAfter\"><span class=\"toc_number toc_depth_2\">1.9<\/span> Operational Metrics Before\/After<\/a><\/li><li><a href=\"#Cost_and_Capacity_Planning\"><span class=\"toc_number toc_depth_2\">1.10<\/span> Cost and Capacity Planning<\/a><ul><li><a href=\"#Compute_overhead\"><span class=\"toc_number toc_depth_3\">1.10.1<\/span> Compute overhead<\/a><\/li><li><a href=\"#Network_egress\"><span class=\"toc_number toc_depth_3\">1.10.2<\/span> Network egress<\/a><\/li><li><a href=\"#Licensing_and_ops_time\"><span class=\"toc_number toc_depth_3\">1.10.3<\/span> Licensing and ops time<\/a><\/li><\/ul><\/li><li><a href=\"#When_Not_to_Use_an_Overlay\"><span class=\"toc_number toc_depth_2\">1.11<\/span> When Not to Use an Overlay<\/a><\/li><li><a href=\"#Culture_and_OnCall_Health\"><span class=\"toc_number toc_depth_2\">1.12<\/span> Culture and On\u2011Call Health<\/a><\/li><li><a href=\"#Appendix_Concrete_Config_and_Snippets\"><span class=\"toc_number toc_depth_2\">1.13<\/span> Appendix: Concrete Config and Snippets<\/a><ul><li><a href=\"#Systemd_health_checks_for_gateways\"><span class=\"toc_number toc_depth_3\">1.13.1<\/span> Systemd health checks for gateways<\/a><\/li><li><a href=\"#nftables_baseline_to_protect_gateways\"><span class=\"toc_number toc_depth_3\">1.13.2<\/span> nftables baseline to protect gateways<\/a><\/li><li><a href=\"#Connectivity_smoke_test_script\"><span class=\"toc_number toc_depth_3\">1.13.3<\/span> Connectivity smoke test script<\/a><\/li><\/ul><\/li><li><a href=\"#Key_Takeaways\"><span class=\"toc_number toc_depth_2\">1.14<\/span> Key Takeaways<\/a><\/li><li><a href=\"#Closing\"><span class=\"toc_number toc_depth_2\">1.15<\/span> Closing<\/a><\/li><\/ul><\/li><\/ul><\/div>\n<h1><span id=\"Private_Overlay_Networks_with_TailscaleZeroTier_MultiCloud_Mesh\">Private Overlay Networks with Tailscale\/ZeroTier: Multi\u2011Cloud Mesh<\/span><\/h1>\n<p>If you\u2019ve ever stitched together workloads across DigitalOcean, Hetzner, OVH, Linode, and a sprinkling of Lightsail or bare metal, you know the pain of inconsistent east\u2013west network paths. In this post, we\u2019ll build and harden <strong>Private Overlay Networks with Tailscale\/ZeroTier<\/strong> to deliver a site\u2011to\u2011site mesh across multi\u2011provider <a href=\"https:\/\/www.dchost.com\/vps\">VPS<\/a>. This isn\u2019t a fantasy architecture. It\u2019s the model we turned to after a nasty incident that burned 17.3 minutes of a 99.95% monthly error budget and sent the on\u2011call through a Saturday that felt like three. We\u2019ll walk discovery \u2192 mitigation \u2192 prevention, with metrics, CLI snippets, and runbook steps.<\/p>\n<h2><span id=\"The_Incident_That_Triggered_the_Mesh\">The Incident That Triggered the Mesh<\/span><\/h2>\n<p>It started as a garden\u2011variety blip. p95 API latencies from US\u2011East to EU\u2011Central edged from 120 ms to 420 ms over six hours, with sporadic 1\u20133% packet loss between DO (NYC) and Hetzner (FSN). Our dashboards showed SYN retransmits climbing, particularly on services pinned to public IPs with provider firewalls. East\u2013west calls retried through a mishmash of NATs and middleboxes. We weren\u2019t down, but we were wobbling: 2.3% of requests exceeded our 300 ms SLO in the hottest path. That\u2019s a budget you can\u2019t spend for long.<\/p>\n<p>By 13:40 UTC, we saw a pattern: most failures clustered on cross\u2011provider traffic when reverse paths crossed CGNAT. Our infra was \u201ccloud\u2011agnostic,\u201d but the network clearly was not. We needed a private, stable address space and a predictable, encrypted path between sites\u2014without backhauling all traffic through a single chokepoint.<\/p>\n<p>The decision: deploy an overlay mesh\u2014first with Tailscale (WireGuard\u2011based) for a quick win, and, in a parallel lane, ZeroTier for teams that needed L2\u2011like semantics and controller\u2011level policy. Both had to be represented in IaC, observable, and survivable when a provider or region had a bad day.<\/p>\n<h2><span id=\"What_Is_a_Private_Overlay_Network\">What Is a Private Overlay Network?<\/span><\/h2>\n<p>A private overlay is a virtual network that rides over the existing internet (or any IP network). Nodes keep their normal public\/private interfaces, but they also join a secure mesh with stable addresses. Traffic between nodes is encrypted end\u2011to\u2011end and, when possible, flows directly via NAT traversal (hole\u2011punching). When direct paths fail, traffic relays through a middle layer (DERP in Tailscale, relays\/planets\/moons in ZeroTier).<\/p>\n<h3><span id=\"Tailscale_vs_ZeroTier_Pragmatic_Differences\">Tailscale vs. ZeroTier: Pragmatic Differences<\/span><\/h3>\n<table>\n<thead>\n<tr>\n<th>Capability<\/th>\n<th>Tailscale<\/th>\n<th>ZeroTier<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Core protocol<\/td>\n<td>WireGuard (ChaCha20\u2011Poly1305)<\/td>\n<td>Custom overlay; encryption comparable to WG<\/td>\n<\/tr>\n<tr>\n<td>NAT traversal<\/td>\n<td>Direct UDP when possible; DERP relay fallback<\/td>\n<td>Direct when possible; planet\/relay fallback; optional moons<\/td>\n<\/tr>\n<tr>\n<td>Addressing<\/td>\n<td>Stable 100.x.x.x (CGNAT block) per node; MagicDNS<\/td>\n<td>Private network CIDRs; can simulate L2 or L3<\/td>\n<\/tr>\n<tr>\n<td>Site\u2011to\u2011site<\/td>\n<td>Subnet routers (advertise\u2011routes), exit nodes<\/td>\n<td>Managed routes, optional bridging<\/td>\n<\/tr>\n<tr>\n<td>Policy<\/td>\n<td>ACL file; identity\u2011centric; SSO\/SCIM friendly<\/td>\n<td>Controller rules; member auth; tags<\/td>\n<\/tr>\n<tr>\n<td>Control plane<\/td>\n<td>Hosted; self\u2011host with Headscale possible<\/td>\n<td>Hosted controller; self\u2011host controller; moons<\/td>\n<\/tr>\n<tr>\n<td>MTU defaults<\/td>\n<td>Conservative (~1280)<\/td>\n<td>Higher virtual MTU; adjust to path<\/td>\n<\/tr>\n<tr>\n<td>Client ecosystem<\/td>\n<td>Strong across OSs\/containers; lightweight<\/td>\n<td>Strong; good for embedded\/L2 scenarios<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>They\u2019re both excellent. If you want identity\u2011driven ACLs and a dead\u2011simple path to subnet routing, Tailscale is fast to land. If your use case leans L2 adjacency, custom controllers, or you already live in ZeroTier, it\u2019s equally viable.<\/p>\n<h2><span id=\"Reference_Architecture_SitetoSite_Mesh_Across_MultiProvider_VPS\">Reference Architecture: Site\u2011to\u2011Site Mesh Across Multi\u2011Provider VPS<\/span><\/h2>\n<p>Our baseline topology per provider\/region:<\/p>\n<pre class=\"language-bash line-numbers\"><code class=\"language-bash\">       +---------------------+            +----------------------+\n       |  DO - NYC1         |            |  Hetzner - FSN1      |\n       |  gw-do-nyc1 (GW)   |            |  gw-hz-fsn1 (GW)     |\n       |  10.10.10.1\/24     |            |  10.20.10.1\/24       |\n       |  app\/db nodes      |            |  app\/db nodes        |\n       +----------+---------+            +-----------+----------+\n                  |                                      |\n             [Overlay Interface]                    [Overlay Interface]\n                  |                                      |\n                  +------------------ Mesh ----------------+\n                             (Direct UDP when possible)\n<\/code><\/pre>\n<p>Each region gets at least two gateway nodes (for HA) that:<\/p>\n<ul>\n<li>Participate in the overlay as regular nodes.<\/li>\n<li>Act as subnet routers advertising local RFC1918 ranges to the mesh.<\/li>\n<li>Enforce ACLs so east\u2013west is least\u2011privilege by default.<\/li>\n<\/ul>\n<p>Addressing and routes (example):<\/p>\n<ul>\n<li>DO\u2011NYC1: 10.10.10.0\/24<\/li>\n<li>Hetzner\u2011FSN1: 10.20.10.0\/24<\/li>\n<li>OVH\u2011GRA: 10.30.10.0\/24<\/li>\n<\/ul>\n<p>We keep per\u2011region \/24s and reserve \/16 per provider for growth. Overlay MTU is set conservatively (1280) to avoid fragmentation across the internet path.<\/p>\n<h2><span id=\"Tailscale_Implementation\">Tailscale Implementation<\/span><\/h2>\n<h3><span id=\"Step_1_Org_setup_and_guardrails\">Step 1 \u2014 Org setup and guardrails<\/span><\/h3>\n<ul>\n<li>Enable SSO and device approval.<\/li>\n<li>Short key expiry (30\u201390 days) for servers; ephemeral keys for CI hosts.<\/li>\n<li>MagicDNS with split DNS for service discovery (e.g., <code>db.service.tailnet.yourcorp<\/code>).<\/li>\n<li>Tailnet policy: default\u2011deny; explicit allows between service tags.<\/li>\n<\/ul>\n<h3><span id=\"Step_2_Install_and_enroll_nodes\">Step 2 \u2014 Install and enroll nodes<\/span><\/h3>\n<p>On Debian\/Ubuntu gateways:<\/p>\n<pre class=\"language-bash line-numbers\"><code class=\"language-bash\">curl -fsSL https:\/\/tailscale.com\/install.sh | sh\nsudo tailscale up  \n  --authkey=&lt;TSKEY-PREAUTH&gt;  \n  --hostname=gw-do-nyc1  \n  --accept-dns=false\n<\/code><\/pre>\n<p>Enable IP forwarding and basic forwarding rules:<\/p>\n<pre class=\"language-bash line-numbers\"><code class=\"language-bash\">sudo sysctl -w net.ipv4.ip_forward=1\nsudo sysctl -w net.ipv6.conf.all.forwarding=1\n# Persist\nsudo bash -c 'cat &gt;&gt; \/etc\/sysctl.d\/99-overlay.conf &lt;&lt;EOF\nnet.ipv4.ip_forward=1\nnet.ipv6.conf.all.forwarding=1\nEOF'\n<\/code><\/pre>\n<h3><span id=\"Step_3_Advertise_routes_subnet_router\">Step 3 \u2014 Advertise routes (subnet router)<\/span><\/h3>\n<p>On gw\u2011do\u2011nyc1:<\/p>\n<pre class=\"language-bash line-numbers\"><code class=\"language-bash\">sudo tailscale up  \n  --authkey=&lt;TSKEY-PREAUTH&gt;  \n  --advertise-routes=10.10.10.0\/24  \n  --advertise-exit-node=false  \n  --hostname=gw-do-nyc1\n<\/code><\/pre>\n<p>On gw\u2011hz\u2011fsn1:<\/p>\n<pre class=\"language-bash line-numbers\"><code class=\"language-bash\">sudo tailscale up  \n  --authkey=&lt;TSKEY-PREAUTH&gt;  \n  --advertise-routes=10.20.10.0\/24  \n  --hostname=gw-hz-fsn1\n<\/code><\/pre>\n<p>Approve the routes in the Tailscale admin UI (or via API) to make them active.<\/p>\n<h3><span id=\"Step_4_ACL_policy_identityfirst_access\">Step 4 \u2014 ACL policy: identity\u2011first access<\/span><\/h3>\n<p>A minimal ACL that allows app \u2192 db across sites without opening the world:<\/p>\n<pre class=\"language-bash line-numbers\"><code class=\"language-bash\">{\n  &quot;tagOwners&quot;: {\n    &quot;tag:gateway&quot;: [&quot;group:netops&quot;],\n    &quot;tag:app&quot;: [&quot;group:platform&quot;],\n    &quot;tag:db&quot;: [&quot;group:dba&quot;]\n  },\n  &quot;acls&quot;: [\n    { &quot;action&quot;: &quot;accept&quot;, &quot;src&quot;: [&quot;tag:app&quot;], &quot;dst&quot;: [&quot;tag:db:*:5432&quot;] },\n    { &quot;action&quot;: &quot;accept&quot;, &quot;src&quot;: [&quot;group:netops&quot;], &quot;dst&quot;: [&quot;*:*&quot;] }\n  ],\n  &quot;ssh&quot;: [\n    { &quot;action&quot;: &quot;check&quot;, &quot;src&quot;: [&quot;group:netops&quot;], &quot;dst&quot;: [&quot;tag:gateway&quot;], &quot;users&quot;: [&quot;root&quot;] }\n  ]\n}\n<\/code><\/pre>\n<p>Tag nodes at enrollment time:<\/p>\n<pre class=\"language-bash line-numbers\"><code class=\"language-bash\">sudo tailscale set --tags=tag:gateway\n<\/code><\/pre>\n<h3><span id=\"Step_5_Observability_and_SLOs\">Step 5 \u2014 Observability and SLOs<\/span><\/h3>\n<p>Key overlays we chart weekly:<\/p>\n<ul>\n<li>p95 overlay RTT per site pair (derived from periodic ICMP\/TCP checks over tailnet IPs).<\/li>\n<li>Packet loss per site pair.<\/li>\n<li>Handshake time distribution (from service logs or synthetic checks).<\/li>\n<li>Route health: subnet route availability, last change timestamp.<\/li>\n<\/ul>\n<p>Example: Prometheus blackbox checks between gateways (tailnet IPs):<\/p>\n<pre class=\"language-bash line-numbers\"><code class=\"language-bash\"># probe_icmp overlays\nprobe_success{target=&quot;100.100.23.10&quot;}\nprobe_icmp_rtt_seconds_bucket{...}\n<\/code><\/pre>\n<p>We also scrape host metrics for WireGuard\/Tailscale processes (CPU, RSS) to track encryption overhead. Under our load (200\u2013400 Mbps bursts), CPU stayed under 6% on 2 vCPU gateways with AES\u2011NI\/AVX support.<\/p>\n<h3><span id=\"Step_6_Terraform_the_basics\">Step 6 \u2014 Terraform the basics<\/span><\/h3>\n<p>We keep routes, tags, and ACLs in code. An example using the Tailscale provider:<\/p>\n<pre class=\"language-bash line-numbers\"><code class=\"language-bash\">terraform {\n  required_providers {\n    tailscale = {\n      source = &quot;tailscale\/tailscale&quot;\n      version = &quot;~&gt; 0.16&quot;\n    }\n  }\n}\n\nprovider &quot;tailscale&quot; {}\n\nresource &quot;tailscale_acl&quot; &quot;tailnet&quot; {\n  acl = file(&quot;.\/acl.json&quot;)\n}\n\nresource &quot;tailscale_device_subnet_routes&quot; &quot;gw_do_nyc1&quot; {\n  device_id = var.gw_do_nyc1_device_id\n  routes    = [&quot;10.10.10.0\/24&quot;]\n}\n\nresource &quot;tailscale_device_tags&quot; &quot;gw_do_nyc1&quot; {\n  device_id = var.gw_do_nyc1_device_id\n  tags      = [&quot;tag:gateway&quot;]\n}\n<\/code><\/pre>\n<h3><span id=\"Observed_outcomes_Tailscale\">Observed outcomes (Tailscale)<\/span><\/h3>\n<ul>\n<li>p95 handshake time dropped from 220 ms (public IPs + NAT retries) to 26 ms across DO\u2011NYC1 \u2194 Hetzner\u2011FSN1.<\/li>\n<li>Packet loss on inter\u2011service calls fell from 0.7% to 0.05% during peak.<\/li>\n<li>Throughput on a noisy pair improved from 340 Mbps to 760 Mbps after direct UDP was established; DERP fallback rarely engaged (&lt;1% of flows).<\/li>\n<li>Error budget burn for 99.95% SLO cut from 17.3 min\/month to 1.9 min\/month over the next quarter, mostly from removing path flakiness.<\/li>\n<\/ul>\n<h2><span id=\"ZeroTier_Implementation\">ZeroTier Implementation<\/span><\/h2>\n<h3><span id=\"Step_1_Install_and_join\">Step 1 \u2014 Install and join<\/span><\/h3>\n<pre class=\"language-bash line-numbers\"><code class=\"language-bash\">curl -s https:\/\/install.zerotier.com | sudo bash\nsudo zerotier-cli join &lt;NETWORK_ID&gt;\n<\/code><\/pre>\n<p>Authorize members in the controller. Assign managed IPs (e.g., 10.42.0.0\/16). For site\u2011to\u2011site, configure managed routes to your on\u2011host subnets:<\/p>\n<ul>\n<li>DO\u2011NYC1: route 10.10.10.0\/24 via gw\u2011do\u2011nyc1<\/li>\n<li>Hetzner\u2011FSN1: route 10.20.10.0\/24 via gw\u2011hz\u2011fsn1<\/li>\n<\/ul>\n<h3><span id=\"Step_2_Enable_forwarding_and_routing_on_gateways\">Step 2 \u2014 Enable forwarding and routing on gateways<\/span><\/h3>\n<pre class=\"language-bash line-numbers\"><code class=\"language-bash\">sudo sysctl -w net.ipv4.ip_forward=1\nsudo sysctl -w net.ipv6.conf.all.forwarding=1\n# Linux: identify ZeroTier interface, usually zt&lt;id&gt;\nip -br a | grep zt\n<\/code><\/pre>\n<p>Ensure your provider firewall allows overlay\u2011initiated traffic (you can keep public ingress closed). East\u2013west will ride the overlay interface.<\/p>\n<h3><span id=\"Step_3_Optional_moons_for_predictable_relay_locality\">Step 3 \u2014 Optional moons for predictable relay locality<\/span><\/h3>\n<p>In some geographies, we improved relay fallback latency by deploying a moon near our regions.<\/p>\n<pre class=\"language-bash line-numbers\"><code class=\"language-bash\"># On a stable VM with static IP\nzerotier-idtool initmoon identity.public &gt; moon.json\n# Edit moon.json to set a stable reachable address\nzerotier-idtool genmoon moon.json\n# Distribute the .moon file and orbit from members\necho &quot;zerotier-cli orbit &lt;moonid&gt; &lt;moonid&gt;&quot;\n<\/code><\/pre>\n<p>Result: when direct paths fail, relay fallbacks landed closer to traffic sources, trimming p95 relay RTT from ~180 ms to ~92 ms in EMEA.<\/p>\n<h3><span id=\"Step_4_Flow_rules_policy\">Step 4 \u2014 Flow rules (policy)<\/span><\/h3>\n<p>ZeroTier rules let you express network policy at L2\/L3. A simple L3\u2011only, default\u2011deny policy allowing SSH and Postgres between tags:<\/p>\n<pre class=\"language-bash line-numbers\"><code class=\"language-bash\">;\n; Minimal rules\n;\ndrop\n  not ethertype ipv4 and not ethertype ipv6 and not ethertype arp;\n# Allow ICMP for health\naccept\n  ethertype ipv4 and ipprotocol icmp and chr ip.ttl  &gt;= 1;\n# Allow SSH and Postgres between tagged members\naccept\n  ethertype ipv4 and ipprotocol tcp and dport 22 and tag ssh=1 and tag ops=1;\naccept\n  ethertype ipv4 and ipprotocol tcp and dport 5432 and tag app=1 and tag db=1;\n# Drop the rest\n<\/code><\/pre>\n<p>Tagging members in the controller (e.g., <code>ops=1<\/code>, <code>app=1<\/code>, <code>db=1<\/code>) gates access. Keep rules human\u2011readable; they\u2019re your audit trail during incidents.<\/p>\n<h3><span id=\"Observed_outcomes_ZeroTier\">Observed outcomes (ZeroTier)<\/span><\/h3>\n<ul>\n<li>Direct path success &gt;98% after first minute; relay fallback rare.<\/li>\n<li>p95 TCP connect time stabilized at 35\u201345 ms for NYC1 \u2194 FSN1.<\/li>\n<li>Overlay throughput kept pace with Tailscale for our workloads (400\u2013700 Mbps bursts on 2 vCPU gateways).<\/li>\n<\/ul>\n<h2><span id=\"Performance_Tuning_and_Observability\">Performance Tuning and Observability<\/span><\/h2>\n<h3><span id=\"MTU_and_fragmentation\">MTU and fragmentation<\/span><\/h3>\n<p>We default to 1280 MTU on overlay interfaces to avoid PMTU gotchas through the public internet. If you control both edges and can validate, you can probe higher MTUs\u2014just don\u2019t trade consistency for a few extra Mbps on paper.<\/p>\n<pre class=\"language-bash line-numbers\"><code class=\"language-bash\"># Tailscale (Linux)\nsudo ip link set dev tailscale0 mtu 1280\n\n# ZeroTier interface discovery and MTU set\nIF=$(ip -o link | awk -F': ' '\/zt[0-9a-f]+\/ {print $2; exit}')\nsudo ip link set dev &quot;$IF&quot; mtu 1280\n<\/code><\/pre>\n<h3><span id=\"Throughput_and_CPU\">Throughput and CPU<\/span><\/h3>\n<p>Quick checkpoints we log in runbooks:<\/p>\n<ul>\n<li>iperf3 between gateways (both directions).<\/li>\n<li>Per\u2011core CPU on encryption threads.<\/li>\n<li>IRQ balance and offload settings (make sure virtio\/net offloads aren\u2019t neutered).<\/li>\n<\/ul>\n<pre class=\"language-bash line-numbers\"><code class=\"language-bash\"># Server\niperf3 -s\n# Client\niperf3 -c 100.100.23.10 -P 4 -t 30\n<\/code><\/pre>\n<p>Sample numbers from a DO\u2011NYC1 \u2194 Hetzner\u2011FSN1 pair (2 vCPU, 2 GB):<\/p>\n<ul>\n<li>Before (public IPs + NAT flakiness): 340\u2013430 Mbps, 0.6\u20130.9% loss spikes.<\/li>\n<li>After (overlay direct): 620\u2013780 Mbps sustained, loss &lt;0.1%.<\/li>\n<\/ul>\n<h3><span id=\"Latency_and_SLOs\">Latency and SLOs<\/span><\/h3>\n<p>We measure:<\/p>\n<ul>\n<li>p95\/p99 overlay RTT<\/li>\n<li>p95 TCP handshake time<\/li>\n<li>Route availability (did we lose a subnet router?)<\/li>\n<\/ul>\n<p>PromQL sketches for blackbox probes:<\/p>\n<pre class=\"language-bash line-numbers\"><code class=\"language-bash\">overlay_rtt_p95_ms = histogram_quantile(0.95, sum(rate(probe_icmp_duration_seconds_bucket{job=&quot;overlay&quot;}[5m])) by (le, target)) * 1000\n\nhandshake_p95_ms = histogram_quantile(0.95, sum(rate(tcp_connect_duration_seconds_bucket{job=&quot;overlay&quot;}[5m])) by (le, target)) * 1000\n<\/code><\/pre>\n<p>Operationally, we reset expectations with product on Day 1: the overlay is a reliability lever, not a speed cheat code. When it improves latency, it\u2019s usually because we eliminated retransmits and middlebox weirdness, not because encryption made packets go faster.<\/p>\n<h2><span id=\"Security_Compliance_and_Governance\">Security, Compliance, and Governance<\/span><\/h2>\n<h3><span id=\"Key_hygiene\">Key hygiene<\/span><\/h3>\n<ul>\n<li>30\u201390 day key expiry for servers; alarms for \u201cexpiring within 7 days.\u201d<\/li>\n<li>Ephemeral keys for CI runners and canaries (auto\u2011expiry within hours).<\/li>\n<li>Device approval required; no auto\u2011admit to production overlays.<\/li>\n<\/ul>\n<h3><span id=\"Least_privilege_by_default\">Least privilege by default<\/span><\/h3>\n<ul>\n<li>Segment by service role: app \u2192 db only on required ports.<\/li>\n<li>Block inter\u2011region by default; allow per service dependency.<\/li>\n<li>Keep a blocklist for known risky ports; only open with change control.<\/li>\n<\/ul>\n<h3><span id=\"Auditability\">Auditability<\/span><\/h3>\n<ul>\n<li>Log joins\/leaves, route advertisements, policy changes (ship to your SIEM).<\/li>\n<li>Daily diff of ACLs\/rules in Git; pull request reviews required.<\/li>\n<li>Quarterly key rotation fire drills.<\/li>\n<\/ul>\n<h2><span id=\"Runbooks_From_Zero_to_Mesh_and_Back_Again\">Runbooks: From Zero to Mesh and Back Again<\/span><\/h2>\n<h3><span id=\"Runbook_A_Bring_up_a_new_region_Tailscale\">Runbook A \u2014 Bring up a new region (Tailscale)<\/span><\/h3>\n<ol>\n<li>Provision two small gateways (2 vCPU, 2\u20134 GB) behind provider firewalls.<\/li>\n<li>Install Tailscale, enable IP forwarding.<\/li>\n<li><code>tailscale up --authkey=&lt;preauth&gt; --hostname=gw-&lt;prov&gt;-&lt;reg&gt;<\/code><\/li>\n<li>Advertise routes: <code>--advertise-routes=&lt;cidr&gt;<\/code><\/li>\n<li>Approve routes in admin; tag gateways.<\/li>\n<li>Validate connectivity from other regions: <code>ping<\/code>, <code>traceroute<\/code>, <code>iperf3<\/code>.<\/li>\n<li>Update ACLs with least\u2011privilege rules for new services.<\/li>\n<li>Push Terraform changes for routes\/tags\/ACLs; peer review before apply.<\/li>\n<li>Set alerts: route withdrawal, device offline &gt;5 min, key expiry in 7 days.<\/li>\n<\/ol>\n<h3><span id=\"Runbook_B_Bring_up_a_new_region_ZeroTier\">Runbook B \u2014 Bring up a new region (ZeroTier)<\/span><\/h3>\n<ol>\n<li>Provision two gateways and join them to the network ID.<\/li>\n<li>Authorize members, assign managed IPs.<\/li>\n<li>Add managed routes to the region CIDR; map to the gateways.<\/li>\n<li>Enable IP forwarding; confirm ZeroTier interface name.<\/li>\n<li>Apply flow rules granting the minimum required access.<\/li>\n<li>Connectivity tests and baseline measurements.<\/li>\n<li>Commit controller changes to Git (exported JSON\/rules) for audit.<\/li>\n<\/ol>\n<h3><span id=\"Runbook_C_Common_failure_modes_and_mitigations\">Runbook C \u2014 Common failure modes and mitigations<\/span><\/h3>\n<ul>\n<li>\n    <strong>Symptom:<\/strong> Route advertised but unreachable.<br \/>\n    <br \/><strong>Checks:<\/strong> <code>tailscale status --peers<\/code> or <code>zerotier-cli listpeers<\/code>; ensure forwarding enabled; confirm ACL\/rule allows path.<br \/>\n    <br \/><strong>Fix:<\/strong> Re\u2011announce routes; bounce service; verify provider firewall doesn\u2019t block overlay interface traffic.\n  <\/li>\n<li>\n    <strong>Symptom:<\/strong> Sudden fall back to relays; throughput tanks.<br \/>\n    <br \/><strong>Checks:<\/strong> NAT type change (provider reboot?); packet loss spike on public path.<br \/>\n    <br \/><strong>Fix:<\/strong> Restart overlay processes; verify UDP allowed outbound; consider local relay (DERP region pin or ZeroTier moon).\n  <\/li>\n<li>\n    <strong>Symptom:<\/strong> Key expiry mid\u2011deploy.<br \/>\n    <br \/><strong>Checks:<\/strong> Node event logs; CI failures.<br \/>\n    <br \/><strong>Fix:<\/strong> Rotate keys; use ephemeral keys for short\u2011lived nodes; alerting with 7\u2011day headroom.\n  <\/li>\n<li>\n    <strong>Symptom:<\/strong> Route blackhole (two gateways advertise same CIDR, asymmetric path).<br \/>\n    <br \/><strong>Checks:<\/strong> Route tables, overlay peer choice.<br \/>\n    <br \/><strong>Fix:<\/strong> Standardize route priority; in Tailscale, use primary route selection; in ZeroTier, consolidate managed routes.\n  <\/li>\n<\/ul>\n<h2><span id=\"Operational_Metrics_BeforeAfter\">Operational Metrics Before\/After<\/span><\/h2>\n<p>Across a quarter after rollout on three regions:<\/p>\n<ul>\n<li>p95 TCP handshake: 180\u2013240 ms \u2192 24\u201344 ms<\/li>\n<li>Packet loss: 0.4\u20130.9% \u2192 0.03\u20130.08%<\/li>\n<li>Failed deploys due to flaky cross\u2011region calls: 7.1% \u2192 0.6%<\/li>\n<li>Error budget burn (99.95% SLO): 17.3 min \u2192 1.9 min<\/li>\n<\/ul>\n<p>We also saw developer cycle time improve. Our CI jobs that hit dependencies across sites used to run with guard\u2011timers and retries; with the overlay, median runtime dropped by 14\u201322% depending on the job graph.<\/p>\n<h2><span id=\"Cost_and_Capacity_Planning\">Cost and Capacity Planning<\/span><\/h2>\n<h3><span id=\"Compute_overhead\">Compute overhead<\/span><\/h3>\n<ul>\n<li>Gateway cost: small VMs (2 vCPU) were enough up to ~800 Mbps.<\/li>\n<li>Per\u2011workload CPU overhead for overlay daemons was negligible (&lt;1\u20132%) on general servers.<\/li>\n<\/ul>\n<h3><span id=\"Network_egress\">Network egress<\/span><\/h3>\n<ul>\n<li>Direct overlay traffic still pays provider egress; we avoided central backhaul to keep costs near the theoretical minimum.<\/li>\n<li>Relay fallback can add surprise egress; we monitored relay usage and optimized NAT paths to keep it &lt;1%.<\/li>\n<\/ul>\n<h3><span id=\"Licensing_and_ops_time\">Licensing and ops time<\/span><\/h3>\n<ul>\n<li>Both tools have generous free\/paid tiers; the real spend is your time hardening policy and observability.<\/li>\n<\/ul>\n<h2><span id=\"When_Not_to_Use_an_Overlay\">When Not to Use an Overlay<\/span><\/h2>\n<p>Overlays are powerful, but not always the right tool. Consider alternatives when:<\/p>\n<ul>\n<li>You can bring native interconnects online (e.g., private interconnects, IPSec with BGP between DCs) with predictable latency and SLAs.<\/li>\n<li>You need deterministic L2 semantics with strict broadcast controls\u2014ZeroTier can do L2\u2011ish, but at scale, it\u2019s easier to use L3 with clear routes or dedicated WAN.<\/li>\n<li>Regulatory requirements mandate specific control planes you can\u2019t meet without self\u2011hosting; in that case, plan for Headscale (Tailscale) or self\u2011hosted ZeroTier controller + moons.<\/li>\n<\/ul>\n<h2><span id=\"Culture_and_OnCall_Health\">Culture and On\u2011Call Health<\/span><\/h2>\n<p>After we shipped the mesh, we wrote down two promises to ourselves:<\/p>\n<ol>\n<li>No heroics. If the overlay misbehaves, we roll forward or back using the runbook, not wizardry at 03:00.<\/li>\n<li>Blameless learning. Every incident gets the same respect\u2014timeline, facts, metrics, and one thing we\u2019ll do to make it boring next time.<\/li>\n<\/ol>\n<p>Team burnout usually hides in the glue code between systems. Overlays remove a lot of that glue. But the real antidote is steady instrumentation, guardrails in code, and the psychological safety to say \u201cI don\u2019t know yet\u201d on a call.<\/p>\n<h2><span id=\"Appendix_Concrete_Config_and_Snippets\">Appendix: Concrete Config and Snippets<\/span><\/h2>\n<h3><span id=\"Systemd_health_checks_for_gateways\">Systemd health checks for gateways<\/span><\/h3>\n<pre class=\"language-bash line-numbers\"><code class=\"language-bash\"># \/etc\/systemd\/system\/overlay-health.service\n[Unit]\nDescription=Overlay Health Probe\nAfter=network.target\n\n[Service]\nType=simple\nExecStart=\/usr\/local\/bin\/overlay-health.sh\nRestart=always\nRestartSec=15\n\n[Install]\nWantedBy=multi-user.target\n<\/code><\/pre>\n<pre class=\"language-bash line-numbers\"><code class=\"language-bash\"># \/usr\/local\/bin\/overlay-health.sh\n#!\/usr\/bin\/env bash\nset -euo pipefail\nTARGETS=(100.100.23.10 100.77.12.5)\nwhile true; do\n  for t in &quot;${TARGETS[@]}&quot;; do\n    if ! ping -c1 -W1 &quot;$t&quot; &gt;\/dev\/null; then\n      logger -t overlay-health &quot;WARN: overlay target $t unreachable&quot;\n    fi\n  done\n  sleep 10\ndone\n<\/code><\/pre>\n<h3><span id=\"nftables_baseline_to_protect_gateways\">nftables baseline to protect gateways<\/span><\/h3>\n<pre class=\"language-bash line-numbers\"><code class=\"language-bash\">table inet overlay {\n  chain input {\n    type filter hook input priority 0; policy drop;\n    iif lo accept\n    ct state established,related accept\n    iifname &quot;tailscale0&quot; accept\n    iifname &quot;zt*&quot; accept\n    tcp dport {22} ct state new accept\n  }\n  chain forward {\n    type filter hook forward priority 0; policy drop;\n    iifname &quot;tailscale0&quot; oifname != &quot;tailscale0&quot; accept\n    iifname &quot;zt*&quot; oifname != &quot;zt*&quot; accept\n  }\n}\n<\/code><\/pre>\n<h3><span id=\"Connectivity_smoke_test_script\">Connectivity smoke test script<\/span><\/h3>\n<pre class=\"language-bash line-numbers\"><code class=\"language-bash\">#!\/usr\/bin\/env bash\nset -euo pipefail\nPEERS=(100.64.0.10 100.80.1.20 10.10.10.10 10.20.10.20)\nfor p in &quot;${PEERS[@]}&quot;; do\n  echo &quot;Testing $p&quot;\n  if ! timeout 2 bash -c &quot;&gt;\/dev\/tcp\/$p\/22&quot; 2&gt;\/dev\/null; then\n    echo &quot;FAIL: $p:22&quot;\n  else\n    echo &quot;OK: $p:22&quot;\n  fi\n  ping -c2 -W1 &quot;$p&quot; || true\n  traceroute -n -w1 -q1 &quot;$p&quot; || true\n  echo\ndone\n<\/code><\/pre>\n<h2><span id=\"Key_Takeaways\">Key Takeaways<\/span><\/h2>\n<ul>\n<li>Overlays give you stable addressing, encrypted paths, and policy you can reason about across providers.<\/li>\n<li>Tailscale is a fast path to L3 site\u2011to\u2011site via subnet routers and identity\u2011based ACLs.<\/li>\n<li>ZeroTier shines when you want controller\u2011driven networks and flexible L2\/L3 behavior.<\/li>\n<li>Keep MTU conservative, measure p95\/p99s, and alert on route health and key expiry.<\/li>\n<li>Codify everything: routes, ACLs\/rules, device tags, and health checks belong in Git.<\/li>\n<li>Practice failure: relay fallbacks, key rotations, and route withdrawals should be boring drills.<\/li>\n<\/ul>\n<h2><span id=\"Closing\">Closing<\/span><\/h2>\n<p>We didn\u2019t adopt overlays to be clever. We adopted them because they let us say \u201cyes\u201d to multi\u2011provider without trading away reliability. With Private Overlay Networks with Tailscale\/ZeroTier, you can ship a site\u2011to\u2011site mesh in days, observe it in hours, and stop apologizing for the internet in front of your SLOs. Start small, tag ruthlessly, measure honestly, and make your post\u2011mortems a little shorter this quarter.<\/p>\n<\/div>","protected":false},"excerpt":{"rendered":"<p>\u0130&ccedil;indekiler1 Private Overlay Networks with Tailscale\/ZeroTier: Multi\u2011Cloud Mesh1.1 The Incident That Triggered the Mesh1.2 What Is a Private Overlay Network?1.2.1 Tailscale vs. ZeroTier: Pragmatic Differences1.3 Reference Architecture: Site\u2011to\u2011Site Mesh Across Multi\u2011Provider VPS1.4 Tailscale Implementation1.4.1 Step 1 \u2014 Org setup and guardrails1.4.2 Step 2 \u2014 Install and enroll nodes1.4.3 Step 3 \u2014 Advertise routes (subnet router)1.4.4 [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":2077,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[26],"tags":[],"class_list":["post-2076","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-teknoloji"],"_links":{"self":[{"href":"https:\/\/www.dchost.com\/blog\/en\/wp-json\/wp\/v2\/posts\/2076","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.dchost.com\/blog\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.dchost.com\/blog\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.dchost.com\/blog\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.dchost.com\/blog\/en\/wp-json\/wp\/v2\/comments?post=2076"}],"version-history":[{"count":0,"href":"https:\/\/www.dchost.com\/blog\/en\/wp-json\/wp\/v2\/posts\/2076\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.dchost.com\/blog\/en\/wp-json\/wp\/v2\/media\/2077"}],"wp:attachment":[{"href":"https:\/\/www.dchost.com\/blog\/en\/wp-json\/wp\/v2\/media?parent=2076"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.dchost.com\/blog\/en\/wp-json\/wp\/v2\/categories?post=2076"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.dchost.com\/blog\/en\/wp-json\/wp\/v2\/tags?post=2076"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}