Technology

Geo and Weighted DNS Without Tears: How I Route Traffic with Cloudflare, Route 53, and a Dash of Split‑Horizon

So there I was, late one Thursday, staring at a map of pings like it was a strange weather forecast. Europe was a neat green, Asia looked like a storm moving in, and US West was the weird patch of yellow nobody could explain. Ever had that moment when your site is fast here but mysteriously sluggish there? That was me, sipping cold coffee and trying to nudge traffic to the right regions without breaking anything else. The fix wasn’t one magic switch. It was a handful of routing ideas working together: geo DNS to send people to the closest region, weighted DNS to control the rollout like a dimmer switch, and split‑horizon to keep internal traffic private and predictable.

In this post, I’ll walk you through how I think about advanced DNS routing using Cloudflare and Amazon Route 53. We’ll chat about geo and weighted policies, where they shine and where they bite, plus a practical view on split‑horizon that doesn’t make you want to run from the room. I’ll share the missteps I’ve made and the tricks I keep reusing. If DNS ever felt like a black box, I promise it can feel more like a friendly control panel by the end of this.

The mental model: DNS as traffic director, not just a phonebook

Years ago, I used to think DNS was just a phonebook mapping names to IPs. Then global audiences happened. You spin up a second region, add a CDN, sprinkle in a health check, and suddenly DNS is less phonebook and more air traffic controller with a headset. Here’s the thing: even though DNS is simple at its core, modern providers let you layer logic on top. It’s like telling the controller, ‘If the runway in Frankfurt is busy, land in Amsterdam; but if we’re testing new engines, send 10% to Dublin first.’

In my experience, the first lightbulb moment is understanding where the decision is made. Authoritative DNS (Cloudflare DNS, Route 53) decides what answer to give, but that decision is based on signals it sees, like the resolver’s location or your policy weights. Clients rarely talk to your authoritative DNS directly; their local resolver (often their ISP or a public resolver) does. That’s why geolocation can be a little approximate: we’re often guessing a user’s location based on their resolver. Many resolvers pass helpful hints via EDNS Client Subnet, but you should assume some fuzziness and test from multiple points.

And then there’s TTL. Set it too long and changes take ages to propagate. Set it too short and you increase query load and make caches less helpful. I usually find a rhythm: use shorter TTLs during migrations or rollouts, then relax slightly once things stabilize. DNS is all about finding your balance of speed, safety, and sanity.

Geo DNS: steer users to the closest healthy home

Geo DNS is the friendly bouncer at a club who says, ‘No need to cross town, there’s a great spot right here.’ The idea is simple: answer DNS queries with region‑specific IPs so users land closest to your infrastructure. With Cloudflare’s load balancing and geo steering, you can define pools for regions and route based on the resolver’s location. Route 53 offers geolocation policies and something called geoproximity, which lets you nudge borders using bias if you want to pull traffic slightly toward one region.

Let me tell you about a migration I did for a content-heavy site with audiences in Europe, the US, and Southeast Asia. We had three clusters behind regional load balancers. Without geo DNS, some users were bouncing across continents because of how their ISP resolvers behaved. We turned on geo routing, created pools, and set up health checks. Most of the audience snapped to their closest region. The surprises came from corporate networks where the resolver sat in a different country than the actual user. That’s when EDNS Client Subnet hints helped, but not every resolver passes it along, so you should assume there will always be a small fraction who get a slightly less optimal path.

If you’re using Cloudflare for DNS and want to keep using origin load balancers in each region, their load balancing layer is straightforward to set up and comes with health checks and session awareness options. If you’re on AWS, Route 53’s geolocation policy works nicely with regional ALBs or NLBs, and pairing it with Route 53 health checks gives you graceful failover. Either way, keep your TTLs reasonable during rollout, look at the analytics for the first week, and expect to tweak region boundaries at least once.

One more thing I see a lot: mixing geo DNS with a CDN and assuming the CDN’s own edge logic will save you from a bad DNS answer. Sometimes it will, sometimes it won’t. If your DNS points users across the ocean, the CDN might still do a decent job, but you’ll carry extra latency to the origin that you could have avoided. The cleanest approach is to let each layer do its job: DNS steers traffic to the right neighborhood, and the CDN makes it snappy at the curb.

If you’re curious about nuts and bolts, check the provider docs for their exact terms and settings. Cloudflare has a clear overview of their load balancing and geo steering features here: Cloudflare Load Balancing docs. For AWS folks, the patterns live under Route 53’s routing policies: Amazon Route 53 routing policies.

Weighted DNS: the dimmer switch for rollouts and traffic shaping

Weighted DNS is my favorite tool for canaries and blue/green deploys. It’s basically saying, ‘Send 5% here, 95% there,’ and gradually ramp until you’re confident. I remember a launch where we suspected a subtle memory leak in a new API path. Instead of bracing for impact, we tunneled 10% of traffic to the new pool via weighted routing and let it run for a day. We watched metrics, bumped to 25%, and only when it behaved did we go to 50% and beyond. Nobody noticed, and we slept great.

On Cloudflare, you can assign weights to origin pools within a load balancer. On Route 53, it’s a weighted routing policy—multiple records for the same name, each with a weight, and health checks to automatically remove bad ones from the rotation. The joy of this setup is the control. The gotcha is state. If your app maintains sticky sessions (either at the load balancer or in the app), weighted DNS won’t respect stickiness across queries. You may find users hopping between backends if the rest of your stack isn’t designed for it. I either disable stickiness temporarily during weighted rollouts or use a layer that supports consistent routing for the duration of a session.

Weighted DNS is also a lifesaver for capacity management. Imagine you stood up a new region and don’t yet trust it at 100%. Give it 20% weight to warm caches and catch issues in daylight hours. Or if your east‑coast region needs breathing room during a promo, tip a bit more traffic to central. It’s a dial you can turn, not a switch that shocks.

One practical tip: treat TTL as part of your rollout plan. If you need near‑real‑time control, keep TTLs short while you’re adjusting weights. After you land on a stable split, extend TTLs a little so resolvers can help you with caching. And always keep health checks tight. Weighted routing with bad health checks is like cruise control with a broken speedometer—you’ll be slow to react when a node or a region starts wobbling.

Split‑horizon DNS: public outside, private inside, and zero drama in between

Split‑horizon (sometimes called split‑view) DNS is one of those ideas that makes you feel fancy until you actually need it, and then you realize it’s just a sensible way to show different answers to internal and external users. Picture this: employees in the office should use a private IP for the app so traffic stays on the internal network. Everyone else on the internet should hit the public IP behind a reverse proxy or CDN. You could try to make the network do acrobatics, or you could let DNS present different records based on who’s asking.

In AWS, Route 53 private hosted zones make this clean. Your VPC’s resolvers will answer with private IPs for internal users, while the public hosted zone serves public records to everyone else. It’s a split personality that behaves. On Cloudflare, the classic public DNS is their bread and butter, while private DNS can be achieved with their Zero Trust stack and Magic DNS features to handle internal names for connected devices and networks. In the self‑managed world, I’ve used Bind and CoreDNS to maintain separate views: one view for internal networks, another for public. The trick is operational discipline—keep records aligned and automate the boring parts.

I once helped a team that had a staging cluster only reachable via VPN. They wanted devs on the VPN to resolve app.example.com to a private RFC1918 address, while everyone else saw the production site. They tried to hack it with hosts files (we’ve all done it), and predictably, it became a mess. We set up a private hosted zone for the internal view and kept production in the public zone. Devs connected to the VPN got clean, private responses. The rest of the world was none the wiser.

Watch out for recursive resolver behavior, though. If a laptop sometimes uses the office DNS and sometimes the hotel Wi‑Fi DNS, caches can linger. Short TTLs help during transitions, and split‑DNS works best when devices are clearly on one side or the other. Also be mindful with DNSSEC; signing public zones is routine, but private zones and split views need careful planning so you don’t create validation headaches outside your trust boundary.

If you’re at the stage of running your own nameservers and need a refresher on the fundamentals, I wrote a friendly step‑by‑step that many readers found helpful: The Friendly Guide to Private Nameservers and Glue Records: Step‑by‑Step Setup. It pairs nicely with split‑horizon thinking because you’ll understand exactly where your answers are coming from.

Blending geo, weighted, and failover: the patterns I keep using

Now for the fun part—putting it all together. Real systems don’t live in neat boxes. Traffic surges, regions wobble, and someone launches a marketing campaign without telling you. The best routing setups I’ve seen and used tend to blend policies in layers, keeping each layer focused and predictable.

My go‑to pattern for global apps is geo routing as the primary layer. Each region gets its own pool, and health checks mark members in or out. Inside each region, I like a second layer at the load balancer that can shift between old and new backends gradually—basically weighted routing at the origin or application gateway level. If I want even finer control globally, I’ll add weighted DNS on top of geo for a short time. For example, say you want to launch a new EU cluster and start at 10%. Keep geo boundaries intact so Europeans still hit Europe, but shift a small percentage within Europe to the new cluster using weights. When you’re happy, dial to 100% within that region. You’re changing just one variable at a time, which keeps it sane.

I did this during a holiday sale for an e‑commerce client. Their US‑East region was melting from a viral TikTok that none of us saw coming. We kept the US still landing in the US via geo DNS, but weighted 25% of the US traffic to a US‑Central pool that had headroom. Meanwhile, we left Europe and Asia untouched. We paired this with application‑level circuit breakers so if a database shard got grumpy, it gave us backpressure instead of smoke. The result: the sale ran, orders flowed, and we watched dashboards with the weird combination of panic and pride we all know.

Failover is the safety net underneath all of this. Whether you’re in Cloudflare or Route 53, make your health checks honest. Probe the real dependency chain—don’t just check a 200 on a ‘/healthz’ endpoint that doesn’t touch the database. If you’re doing geo, define fallback pools. If you’re doing weighted, let failed members drop out automatically and redistribute traffic quickly. And don’t forget the human factor: document the runbook. Nothing is worse than waking up at 3 a.m. and trying to recall which weight to flip where.

Testing, gotchas, and the craftsmanship mindset

It’s tempting to set policies, take a victory lap, and go to bed. But DNS has quirks, and most of them show up during testing. I like to think in layers. Layer one: validate the answer. Use dig or kdig against your authoritative servers and public resolvers, and compare. Check a few places: a workstation on your network, a cloud VM in another region, and at least one mobile network. Layer two: validate the path. Curl the endpoint, look at headers, and check any X‑trace or region indicators you expose. Layer three: validate the user experience. Load the site on a real device on a real network, because the last 10% of weirdness only shows up there.

EDNS Client Subnet is a classic source of confusion. Some resolvers pass a client subnet hint to your authoritative DNS so it can make a better geo decision. Others don’t. And some public resolvers make different choices depending on privacy modes. The net effect: expect a little noise in your geo distribution. If you want a deeper dive, Cloudflare has a solid explainer on the topic that I often share with teams: EDNS Client Subnet — the good, the bad, the unexpected.

Here are the snags I see most often and how I tiptoe around them:

First, CNAMEs at the apex. Route 53 solves this with alias records to AWS resources. Cloudflare does flattening at the root so you can CNAME‑ish to a target while still answering with an A or AAAA. The key is understanding what your provider does under the hood so you don’t break standards while still getting the convenience.

Second, negative caching. If you delete a record or flip policies, some resolvers will cache the non‑existence (NXDOMAIN) for a while based on the SOA’s minimum TTL. That can make it feel like your new record isn’t showing up when, in fact, a resolver is confidently remembering that you once told it ‘no such name.’ During migrations, keep an eye on that and perhaps stage changes rather than yank records abruptly.

Third, latency mismatches with edge and origin. If you’re using a CDN or Cloudflare’s reverse proxy, you might see lovely TTFB locally but random spikes for users in specific countries. Often that isn’t the edge; it’s the path to the origin. Geo DNS that points users to an origin region near their edge location helps a lot. If you’re proxying through a global network, try aligning your DNS steering so the edge and the origin don’t argue about geography.

Fourth, testing from your desk. It sounds silly, but local DNS caches, VPNs, and corporate resolvers will give you a skewed view. I keep a small stable of test environments: a cheap VM in another continent, a tethered phone, and a friend in a different country who is weirdly happy to copy‑paste dig commands for me. There are online tools to check propagation, and they’re useful sanity checks, but I prefer to validate with a few hands‑on endpoints too.

Finally, observability. Both Cloudflare and AWS give you insights. Cloudflare’s load balancing analytics show pool health and steering. Route 53 can log queries to CloudWatch, and health checks give colorful graphs that tell you if your app went peach‑colored or full red. Marry those with your application metrics, and you’ll diagnose routing problems faster than you think. Good logs and good dashboards aren’t optional once you start using advanced DNS routing—they’re your headlights.

One more practical thought: don’t forget IPv6. Modern resolvers and clients will happily use AAAA records if you publish them. If your infra supports it (and it probably does more than you think), turning up IPv6 alongside IPv4 makes routing and performance more resilient. If you want a no‑drama refresher on the topic, I’ve written about gently rolling out AAAA and testing it end‑to‑end—worth a weekend project before your next big launch.

A friendly, realistic workflow you can reuse

When I help teams set this up, the workflow is usually the same pattern, tuned to their stack. It goes a little like this:

Start by writing down your regions and what you want to achieve. If it isn’t on paper, it’s a guess. Are you trying to reduce latency for APAC, or are you trying to make rollouts safer, or both? Once you know the goal, decide your primary knob: geo routing for speed and locality, weighted for control and safety. You can use both, but one should lead.

Define health checks with real signals. Have them touch the database if that’s your bottleneck, or the external API if that often flakes. Set frequency and thresholds that match your tolerance; too sensitive and you’ll flap, too lazy and you’ll burn users before failover kicks in.

Turn it on in a small way first. For geo, start with one region’s audience and see if your analytics show the intended shift. For weighted, begin with 5% to the new pool and watch. If your users are global, test from multiple eyeballs. I like to run a short‑TTL phase for a few days, then raise TTLs when I’m confident.

Document the dials you might turn at 3 a.m. If US‑East is hot, which weight do you change? If Europe goes dark, which fallback pool takes over? If your CDN throws tantrums in one country, what’s your temporary DNS workaround? The more familiar the knobs feel, the calmer you’ll be when you need them.

And because we’re all human: schedule a quiet, deliberate revisit a week later. Look at metrics, look at costs, look at incident logs. Do you need to rebalance? Are there hidden long‑tail latencies somewhere odd? Small course corrections after the first week are cheap and save a ton of headaches later.

Wrap‑up: make DNS your friend, not a rumor mill

The longer I work with global apps, the more I treat DNS like a helpful colleague—one who’s not flashy but keeps everything flowing. Geo DNS gets users to the closest healthy region without you shouting in Slack. Weighted DNS lets you roll out changes with confidence instead of crossed fingers. Split‑horizon keeps internal paths private and fast while public traffic goes where it should. Put those together, and your routing story starts to feel calm, even on the days when traffic does something exciting.

If you’re just getting into this, start with one simple win. Maybe you enable a weighted canary for your next release and keep TTLs short. Or you set up geo routing for a region that’s always been a little laggy. Or you finally give your internal users a private split‑view so they stop hair‑pinning out to the internet and back. Each small improvement translates into happier users and fewer late‑night mysteries.

And hey, if you go down this road and hit an odd detour, you’re in good company. We’ve all had the moment where the map of pings looks like modern art. The difference is, once you’ve got geo, weights, and split‑horizon in your toolkit, you’ll know exactly which dial to turn. Hope this was helpful! See you in the next post—and may your health checks stay green and your TTLs be exactly as short as you need.

Frequently Asked Questions

Great question. Geo DNS chooses an answer based on where the resolver appears to be, steering users to a nearby region for lower latency. Weighted DNS ignores location and splits traffic by percentages, which is perfect for gradual rollouts or load shaping. In practice, I use geo to send people to the closest healthy home and weights to control risk when I’m deploying or balancing capacity.

They solve different problems. Geo DNS gets users to the right origin region. A CDN accelerates delivery by caching content at the edge and smoothing out last‑mile hiccups. If your app is dynamic, geo DNS still helps reduce origin latency, while a CDN helps with assets and TLS termination. I like using both, but I let DNS do the steering and the CDN do the accelerating.

I keep TTLs short during tests, then I validate from multiple vantage points: a different region VM, a mobile network, and a corporate network if possible. For weighted, start tiny—5% or 10%—and watch app metrics closely. For geo, check that users in a target region actually land in the intended pool. Use provider analytics and application logs together so you can see both the DNS answer and the real user experience.