Technology

The Playbook I Use to Keep a VPS Calm: Prometheus + Grafana + Node Exporter for Alerts That Actually Help

So picture this: it’s late, I’m half a mug deep into a lukewarm coffee, and a client’s site is crawling like it’s stuck in syrup. We’ve all been there—tabs open everywhere, htop running, and that quiet panic of not knowing what changed. Ever had that moment when a VPS feels moody, and you swear nothing’s different… but everything is slower? That night reminded me why I stopped guessing and started measuring. Not with random scripts, but with a setup I trust: Prometheus, Grafana, and Node Exporter. It’s like putting a stethoscope on your server, except it actually talks back.

In this guide, I’ll walk you through how I set up a lightweight monitoring stack for CPU, RAM, Disk I/O, and uptime alerts that don’t spam me into ignoring them. I’ll show you the exact Prometheus rules I use, how I wire Alertmanager for notifications, and how I shape Grafana dashboards so they read like a story, not an eye chart. The goal isn’t just to have pretty graphs—it’s to get calm, timely alerts and clear visibility so you can fix problems before users feel them. Let’s make your VPS a little less mysterious.

Why Monitoring Before You Need It Is the Best Kind of Insurance

I remember a project where page loads sporadically jumped from half a second to five seconds. It wasn’t constant, and nothing obvious showed up in logs. You know that feeling: you’re staring at a screen thinking, ‘Is it CPU, memory, disk… or the network?’ Here’s the thing—without metrics, troubleshooting becomes a guessing game. With metrics, the story writes itself. High CPU ready or steal? That points to noisy neighbors or under-provisioned vCPUs. RAM pressure and swap creeping up? That’s your application telling you it’s hungry. Disk I/O flooded or iowait spiking? Your database or a backup script probably took a big bite. And if the server just disappears from the map, you want to know instantly and confidently, not ten minutes later because a customer emailed first.

Prometheus, Grafana, and Node Exporter are my go-to trio because they’re simple, honest, and fast. Prometheus pulls metrics in plain text. Node Exporter exposes what the host is feeling—CPU, memory, disks, filesystems, and more. Grafana turns those metrics into visual cues your brain can digest in a second. Think of it like a car dashboard: a glance tells you your speed, fuel, and temperature. You don’t need a thesis, you need a nudge at the right time. That’s what a good monitoring setup does—it nudges, it doesn’t nag.

Meet the Stack: Prometheus, Node Exporter, Alertmanager, and Grafana

Here’s the quick mental model I use. Prometheus is the historian. Every few seconds it asks your VPS how it’s doing and writes the answers as time‑stamped data. Node Exporter is the translator living on the VPS, speaking in straightforward numbers about CPU, RAM, disk, and network. Alertmanager is the messenger—when a rule fires, it knows who to notify and how to keep things sane with grouping and silence windows. And Grafana is the storyteller, giving you clear, customizable dashboards so your eyes can spot trends before they become fires.

I like to keep Prometheus and Grafana together on a small monitoring VM. Node Exporter runs on each VPS you care about. Prometheus scrapes them over your private network or a firewall‑pinned port. It’s lightweight enough that even modest servers barely notice it’s there, and it scales surprisingly well for most small to mid‑sized fleets. If you ever want long retention or heavy historical analysis, that’s when you look at external storage or remote write—save that thought for later. Start small, start clean, and let your alerts pay for themselves in peace of mind.

If you want to go deeper later, the official docs are clear and practical. I often keep the Prometheus alerting docs, the Grafana documentation, and the Node Exporter repository close at hand while I’m setting things up.

Installing Node Exporter on Your VPS (The Gentle Way)

Let’s start where the data lives—your VPS. Node Exporter is the tiny agent that lets Prometheus read system metrics. The rhythm I follow is simple: create a system user, drop the binary, run it as a service, and make sure only your monitoring server can talk to it. Keep it boring and secure.

Step 1: Create a user and install Node Exporter

I usually grab the latest release from the official repo and set up a systemd service. It looks like this:

# As root or with sudo
useradd --no-create-home --shell /usr/sbin/nologin nodeexp
mkdir -p /opt/node_exporter
# Download the latest release tarball for your architecture
# Example shown for Linux x86_64; check the repo for current version
cd /tmp
curl -LO https://github.com/prometheus/node_exporter/releases/download/v1.8.1/node_exporter-1.8.1.linux-amd64.tar.gz
tar -xzf node_exporter-1.8.1.linux-amd64.tar.gz
mv node_exporter-1.8.1.linux-amd64/node_exporter /opt/node_exporter/
chown -R nodeexp:nodeexp /opt/node_exporter

Step 2: Create a systemd service

I like to be explicit about which collectors run. Most defaults are safe, and you can adjust later if you want extra metrics.

cat > /etc/systemd/system/node_exporter.service <<'EOF'
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=nodeexp
Group=nodeexp
Type=simple
ExecStart=/opt/node_exporter/node_exporter 
  --web.listen-address=:9100 
  --collector.textfile.directory=/var/lib/node_exporter/textfile

Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

mkdir -p /var/lib/node_exporter/textfile
chown -R nodeexp:nodeexp /var/lib/node_exporter
systemctl daemon-reload
systemctl enable --now node_exporter
systemctl status node_exporter

Step 3: Firewall the port

Prometheus will scrape port 9100. Don’t expose it to the world. Allow only your monitoring server’s IP.

# Example with ufw
ufw allow from <PROMETHEUS_IP> to any port 9100 proto tcp
# Or with firewalld
firewall-cmd --permanent --add-rich-rule='rule family=ipv4 source address=<PROMETHEUS_IP> port protocol=tcp port=9100 accept'
firewall-cmd --reload

That’s it for the agent. Keep it simple and forget it’s even there. If you prefer containers, Node Exporter runs great under Docker too—same idea, just map the host namespaces and the 9100 port, then firewall accordingly.

Prometheus and Alertmanager: Turning Raw Numbers into Calm, Useful Alerts

Now we give those numbers a home and a voice. Prometheus scrapes, stores, and evaluates alert rules. Alertmanager sends the messages and keeps them sane. I’m a fan of installing them on a small dedicated VM so they don’t compete with your app.

Step 1: Install Prometheus

Set up directories, add a config, and run it as a service. Here’s a minimal but friendly configuration that scrapes itself and a couple of VPS instances:

mkdir -p /etc/prometheus /var/lib/prometheus
useradd --no-create-home --shell /usr/sbin/nologin prometheus

# prometheus.yml
cat > /etc/prometheus/prometheus.yml <<'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - /etc/prometheus/alerts/*.yml

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'nodes'
    static_configs:
      - targets: ['10.0.0.11:9100']
        labels:
          instance: 'web-1'
      - targets: ['10.0.0.12:9100']
        labels:
          instance: 'db-1'
EOF

Make sure file ownerships belong to the Prometheus user, then run it with a systemd service. Retention is worth choosing deliberately; I often start with a few days while I tune alerts, then bump it.

cat > /etc/systemd/system/prometheus.service <<'EOF'
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus 
  --config.file=/etc/prometheus/prometheus.yml 
  --storage.tsdb.path=/var/lib/prometheus 
  --web.listen-address=:9090 
  --storage.tsdb.retention.time=15d

Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now prometheus

Step 2: Write alert rules that mean something

This is the heart of it. Alerts should announce a problem you can act on, not a curiosity. I like to start with CPU saturation, memory pressure, disk space, disk I/O wait, and host down. Time windows matter—using a short ‘for’ keeps flapping to a minimum by waiting a little before firing.

mkdir -p /etc/prometheus/alerts

cat > /etc/prometheus/alerts/host.yml <<'EOF'
# CPU saturation (user + system high)
- alert: HighCPU
  expr: avg by (instance) (rate(process_cpu_seconds_total{job="nodes"}[5m])) > 0.7
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: 'High CPU on {{ $labels.instance }}'
    description: 'Avg CPU usage > 70% over 5m. Check processes and load.'

# Alternatively using node exporter CPU: 1 - idle
- alert: HighCPU_node
  expr: avg by (instance) (1 - rate(node_cpu_seconds_total{mode="idle"}[5m])) > 0.85
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: 'CPU near saturation on {{ $labels.instance }}'
    description: 'Non-idle CPU > 85% for 5m.'

# Memory pressure (available below a threshold)
- alert: LowMemory
  expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) < 0.1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: 'Low memory on {{ $labels.instance }}'
    description: 'Available memory < 10% for 10m. Consider leaks, caches, or limits.'

# Disk space (free below threshold)
- alert: LowDiskSpace
  expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay",mountpoint!~"/run"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay",mountpoint!~"/run"}) < 0.1
  for: 10m
  labels:
    severity: critical
  annotations:
    summary: 'Disk space low on {{ $labels.instance }}'
    description: 'Less than 10% free on one or more filesystems.'

# Disk I/O wait (host spending too much time waiting on disk)
- alert: HighIOWait
  expr: avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m])) > 0.2
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: 'High iowait on {{ $labels.instance }}'
    description: 'I/O wait > 20% over 10m. Check storage load or queries.'

# Host down (node exporter scrape failed)
- alert: HostDown
  expr: up{job="nodes"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: '{{ $labels.instance }} is not responding'
    description: 'Prometheus cannot scrape node exporter for 1m.'

# Recent reboot (uptime too low) – useful to notice unexpected restarts
- alert: RecentReboot
  expr: (time() - node_boot_time_seconds) < 600
  for: 5m
  labels:
    severity: info
  annotations:
    summary: '{{ $labels.instance }} restarted'
    description: 'Host uptime < 10m. If not planned, investigate dmesg/journal.'
EOF

That’s a starting point. Tune thresholds to your environment. For databases, I’ll often soften CPU alerts but be much stricter with iowait and disk space. For app servers, I’ll be more sensitive to memory and swap. The official alerting guide is worth skimming as you refine your rules.

Step 3: Wire up Alertmanager for notifications

Prometheus fires alerts, but Alertmanager decides who hears about them and when. I like to group by instance and severity so a small storm becomes a single message with context, not twenty notifications at once.

# /etc/alertmanager/alertmanager.yml
route:
  receiver: 'team-default'
  group_by: ['instance', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h

receivers:
  - name: 'team-default'
    email_configs:
      - to: '[email protected]'
        from: '[email protected]'
        smarthost: 'smtp.example.com:587'
        auth_username: '[email protected]'
        auth_password: 'REDACTED'

# Add Slack, Telegram, or PagerDuty as needed with their configs

Point Prometheus to Alertmanager in your prometheus.yml, reload, and send a test. One of the best quality‑of‑life moves is using silence windows during maintenance. Fifteen quiet minutes can save your sanity when patching multiple hosts.

Grafana: The Part Your Brain Loves

Dashboards don’t fix problems, but they make diagnosis fast. With Grafana, I try to build panels that answer one question each. Is CPU healthy? Is memory stable? Is disk happy? Are we moving packets as expected? When panels carry a single idea, the entire dashboard becomes effortlessly scannable.

Step 1: Install Grafana and add Prometheus as a data source

Install Grafana using your distro’s repo or a container, then log in, head to Data Sources, and add Prometheus at http://<prometheus_ip>:9090. The Grafana docs walk you through the clicks if you need a refresher.

Step 2: Build a ‘Server Overview’ dashboard that tells a story

I start with a row for CPU, a row for memory, then disk, network, and finally uptime and status. Three or four panels per row usually feels right. Here are the PromQL snippets I reach for:

CPU usage:

avg by (instance) (1 - rate(node_cpu_seconds_total{mode='idle'}[5m]))

Pair that with a repeating panel showing per‑core usage if you like detail. If this graph creeps up over time, something changed—deploy logs or cron jobs often tell the story.

Memory availability:

node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes

Scale it to percent. Watch for step changes after releases or traffic spikes. It’s a great early warning for memory leaks or runaway caches.

Disk space:

node_filesystem_avail_bytes{fstype!~'tmpfs|overlay',mountpoint!~'/run'} / 
node_filesystem_size_bytes{fstype!~'tmpfs|overlay',mountpoint!~'/run'}

Use Grafana’s thresholds to turn the panel yellow at 20% and red at 10%. That visual nudge is surprisingly effective.

Disk I/O wait:

avg by (instance) (rate(node_cpu_seconds_total{mode='iowait'}[5m]))

Short pops happen; sustained iowait is a smell. If this stays high, look at database queries, backup schedules, or noisy storage neighbors.

Uptime and host reachability:

time() - node_boot_time_seconds

Turn that into a friendly number of hours or days. Add a singlestat or gauge for up{job='nodes'} as well—1 is reachable, 0 is down. When that gauge flips, you want an alert that tells you quickly and calmly.

Don’t over‑decorate the dashboard. A few annotations for deployments and maintenance windows go a long way. When I ship something big, I drop a note so future‑me can correlate a spike to a release without digging through git logs.

Uptime Alerts Without the Noise (And How to Avoid Crying Wolf)

Alerts are easy to write and hard to love. The trick is intention. I ask myself two questions before adding any alert: will I take action if this fires, and will I ignore it if it fires too often? If the answer to the second question is yes, I either turn it into a dashboard visualization or I add a ‘for’ to de‑noise it.

For host reachability, the up metric is your friend. When Prometheus can’t scrape Node Exporter, up becomes 0. That might mean the host is down, the network is split, or the firewall changed. I set the HostDown alert to wait one minute before firing. If it’s a blip, it disappears; if it’s real, I know fast. For service‑level uptime (like ‘Is my homepage returning 200?’), you can add a blackbox exporter later to probe HTTP endpoints. It’s lightweight, and it answers the question users actually care about: can they reach your site?

Uptime alerts aren’t only about down vs. up, though. I’ve seen ‘RecentReboot’ catch accidental restarts after a kernel update. That alert is less urgent, but it’s a great breadcrumb: if numbers look off, and you see a reboot annotation, now you know why. Similarly, I’ll sometimes add a ‘NoScrapes’ or ‘NoSamples’ alert to catch silent failures where metrics look oddly flat. A calm monitoring system feels like a helpful colleague tapping your shoulder, not a fire alarm every five minutes.

One more tip: escalate carefully. I’ll send warnings to email or a quiet chat channel and reserve paging for critical outages that need eyes now. In Alertmanager, grouping by instance keeps your phone from exploding when multiple rules fire for the same host. Maintenance silences, even for 30 minutes, are worth their weight in gold during patch days.

Security and Sanity: Keep Metrics Private and Names Clear

Metrics are like a diary—helpful to you, too revealing for strangers. Keep Node Exporter firewalled to the Prometheus server only. Don’t put it behind a public reverse proxy unless you secure it first. Grafana should sit behind HTTPS with a strong password or SSO if you have it. If you’re already comfortable with Let’s Encrypt, set it up and auto‑renew so that’s one less thing to remember.

Names matter. In Prometheus, label instances with something meaningful: ‘web‑1’, ‘db‑prod‑eu’, or ‘queue‑east’. That way, when an alert fires, you know exactly where to look. If you manage multiple environments, add a ‘env’ label like ‘staging’ or ‘prod’ and route alerts differently. I’ve avoided so many late‑night goose chases just by labeling cleanly and grouping alerts the way my brain actually triages issues.

For resilience, monitoring shouldn’t become your single point of failure. If Grafana goes down, your app should keep running; if Prometheus restarts, you’ll lose a short slice of data, not your sanity. Back up your configs, export dashboards, and write down the ‘how we silence alerts’ steps where your team can find them. Monitoring is part tools, part habits.

Real‑World Tuning: From Noisy to Trustworthy

When I first wire a new host, I expect a little noise for a day or two. It’s normal—thresholds don’t match reality yet. I watch the graphs and adjust. If CPU hovers around 60% most of the day, I bump the alert to 85% with a longer ‘for’. If I see memory dip under 10% for minutes at a time during backups, I nudge the threshold or schedule the task differently. Your server has its own heartbeat. Tune to the rhythm, not the theory.

Disk I/O is the sneaky one. A database that sings during business hours might be crushed by a midnight report or a nightly dump. If I see iowait recurring around the same time each day, I either move the task, increase IOPS where possible, or tune queries and indexes to reduce pressure. Sometimes the biggest fix is outside the server: moving cached content to a CDN, trimming log verbosity, or cutting chatty debug features can ease the load without touching CPU or RAM at all. If uptime is your obsession (and for most production shops, it is), pairing good monitoring with smart redundancy is a winning combo. It’s why I often talk about how Anycast DNS and automatic failover keep your site up when everything else goes sideways—monitoring tells you something broke; failover helps users never notice.

One of my clients once complained about ‘random slowness’ every Friday afternoon. Graphs told the real story: CPU was fine, memory was healthy, but iowait spiked right before the team left the office. Turns out, a weekly export script was hammering the disk. We split the job into smaller chunks and shifted it later in the evening. Problem gone, morale restored. That’s the beauty of good monitoring—it gives you the honest clues without drama.

Troubleshooting the Setup: When Things Don’t Line Up

Every setup has a moment where a graph is empty or an alert won’t fire. My routine is simple. First, open Prometheus and use the ‘Targets’ page to confirm Node Exporter is being scraped. If it’s down, check the firewall or the node_exporter service status. Next, try a raw query like up or node_uname_info to confirm samples exist. If the data is there but the alert isn’t firing, paste your PromQL into the Prometheus expression browser and verify the value and labels match your rule. Sometimes a label mismatch—like using job=’node’ vs job=’nodes’—is the whole issue.

When Grafana panels look wrong, I toggle the panel’s ‘Inspect’ to see the query and response. Half the time I spot a missing label or an interval mismatch. If graphs look jagged, try aligning your step with the scrape interval. If alerts feel chatty, extend the ‘for’ window or tighten the condition. Remember, the goal isn’t to catch every blip; it’s to catch every problem that matters.

Going a Little Further (Only If You Need To)

Once the basics hum, you can layer in more insight without making things complicated. The textfile collector lets you expose custom metrics from your app with a tiny script—things like queue depth, cache hits, or order rates. If you want to monitor HTTP uptime from the outside, a blackbox exporter probes URLs and ports and reports the result as metrics, which is perfect for ‘is the homepage actually answering?’ checks. And if storage is your bottleneck, consider watching disk latency and the read/write operations rate per device to see which mount points need love.

For long‑term retention, remote_write to a dedicated time‑series backend is a great step later on, but only if you genuinely need months of history at full resolution. Most of us only need a few weeks of detail and summarized trends, which Prometheus handles just fine.

Wrap‑Up: Less Guessing, More Knowing

Let’s bring it home. Monitoring that earns your trust is simple, tidy, and tuned to your reality. Prometheus pulls the truth from your VPS every few seconds. Node Exporter gives it a clear voice about CPU, RAM, disk I/O, and uptime. Grafana arranges those truths so your eyes instantly know what changed. And Alertmanager turns them into a handful of alerts you’ll actually act on, not a chorus you’ll mute.

If you’re just getting started, begin with one host and a small set of rules. Watch the graphs for a week. Adjust thresholds until the alerts feel like helpful nudges instead of nagging sensations. Then add another host. Before long, you’ll know how your VPSs behave on a good day, and you’ll spot the bad days from a mile away. That’s the quiet confidence good monitoring gives you: fewer surprises, faster fixes, and more time to work on the parts you truly enjoy.

Hope this was helpful! If you want me to dig into dashboard templates or share a ‘drop‑in’ set of rules for databases and caches, let me know. I’ve got a bunch of proven bits I’d be happy to share in a future post. Until then, may your alerts be calm and your graphs tell a clear story.

Frequently Asked Questions

Great question! Prometheus evaluates rules, but Alertmanager handles routing, grouping, and silences. You can hack emails from Prometheus, sure, but Alertmanager is what keeps your inbox sane when multiple alerts fire. It’s the difference between a usable setup and an overwhelming one.

I like CPU warning around 70–80% and critical around 85–90% with a 5-minute window, memory available below 10–15% for 10 minutes, and disk free below 15% (warning) and 10% (critical). But tune to your reality. If your app normally sits at 65% CPU, bump the threshold. The right value is the one that catches real issues without nagging about normal behavior.

Use the ‘for’ clause so alerts wait a minute or two before firing, and group by instance in Alertmanager. Add silences during maintenance. If you want to test external reachability, add a blackbox probe for your homepage, but keep HostDown for infrastructure health. Done together, you’ll catch real outages and ignore momentary blips.