Technology

VPS Monitoring and Alerts Without Tears: Getting Started with Prometheus, Grafana, and Uptime Kuma

The Quiet Panic That Made Me Take Monitoring Seriously

It wasn’t a dramatic outage. No sirens. No frantic calls. Just a quiet Monday morning, a lukewarm coffee, and a weirdly slow site that refused to speed up. I remember staring at a graph that didn’t exist yet—because I hadn’t set one up. That was the moment I realized how blind I was flying. Ever had that sinking feeling when something’s off, but you don’t know where to look? That was me. CPU felt fine, memory looked okay when I pinged in manually, and yet requests were queuing somewhere in the dark.

Here’s the thing about VPS monitoring: it’s not for when things go wrong—it’s for knowing things are about to go wrong and catching them quietly, before anyone else notices. The trio that changed my day-to-day is dead simple: Prometheus for metrics, Grafana for dashboards and alerts, and Uptime Kuma for external checks and friendly status pages. In this post, I’ll walk you through how I set them up, how I keep the alerts friendly (not noisy), and the small tweaks that make a big difference. Think of it like showing a friend how to keep their house cozy without obsessing over every draft.

What We’re Building: The Trio That Keeps a VPS Calm

When folks ask me where to start with VPS monitoring and alerts, I always suggest two views of reality: the inside view and the outside view. Prometheus and Grafana give you the inside—CPU, memory, disks, network, app behavior—while Uptime Kuma gives you the outside—can people reach your site, do APIs respond quickly, is TLS valid? Together, they feel like finally turning on the lights in a room you’ve been walking through in the dark.

Prometheus is your metric collector. It “scrapes” numbers from exporters (like Node Exporter on your VPS) and stores time-series data. It’s fast, reliable, and amazingly honest. If there’s a spike in IOwait or a sneaky memory leak, Prometheus doesn’t just tell you—it draws the picture. If you want a friendly deep dive into Node Exporter and why it’s worth the extra minute to install, I’ve shared a complete playbook in the stack I trust: Prometheus + Grafana + Node Exporter.

Grafana sits on top like your mission control. I like it because it doesn’t make you feel dumb. You can build panels that just make sense: CPU usage with a shaded average, RAM with a subtle threshold line, disk latency with a red “danger zone.” The new alerting system lets you send notifications where you actually live—Slack, Telegram, email—and adjust rules so you’re not waking up for nothing. The biggest win is designing dashboards that are calm by default and urgent when needed.

Then there’s Uptime Kuma, the friendliest uptime monitor I’ve ever rolled out. It checks your sites and services from the outside, pings your ports, measures response times, and even handles push-based checks if you want your app to say “I’m alive” on a schedule. And the status page? It’s like a little window you can show your team or clients so they don’t have to ask what’s going on—they can see it.

Put these together and you’ve got a layered safety net. Metrics tell you why, uptime tells you whether, and alerts tie it into action. And if you’re running a busy store or an app with bursty traffic, pairing this stack with good capacity planning is a lifesaver. I’ve seen it over and over, especially with shops that suddenly hit a promotion wave—if this is you, take a peek at my thoughts on right-sizing vCPU, RAM, and IOPS for WooCommerce without guesswork.

If you’ve ever wondered how storage plays into this, you’re not alone. Disk performance is sneaky, and IOwait numbers can look like ghosts until you track them properly. If you’re curious, I’ve unpacked what really moves the needle in my NVMe VPS hosting guide—where speed actually comes from. Seeing IO wait time alongside CPU usage in Grafana tells a story you can act on.

Prometheus Setup: The Simple Path That Actually Works

The mental model

Prometheus works by scraping endpoints that expose metrics in plain text. Your VPS will run Node Exporter to expose system metrics. Prometheus itself can live on the same server for a small setup, or on a separate monitoring box if you want to scale later. Start small, keep it simple, and you’ll be fine.

Install Node Exporter on the VPS you want to monitor

I usually start with Node Exporter because it’s lightweight and instantly useful. It exposes CPU loads, memory, disks, filesystems, network, and even systemd status. On Debian/Ubuntu:

# Create a user
sudo useradd --no-create-home --shell /usr/sbin/nologin node_exporter

# Download latest (check GitHub releases for the newest version)
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz

tar -xzf node_exporter-1.7.0.linux-amd64.tar.gz
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/

# Systemd service
sudo tee /etc/systemd/system/node_exporter.service >/dev/null << 'EOF'
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter

# Default port is 9100
# Confirm it works
curl http://127.0.0.1:9100/metrics | head -n 5

Open your firewall if needed. I usually allow port 9100 only to the Prometheus server’s IP, not the whole internet. A tiny bit of paranoia goes a long way.

Install Prometheus

You can run Prometheus on the same machine while testing, or spin up a small VPS just for monitoring. For a single VPS, same-box is fine. For a few servers or anything customer-facing, I separate it.

# Create user and directories
sudo useradd --no-create-home --shell /usr/sbin/nologin prometheus
sudo mkdir -p /etc/prometheus /var/lib/prometheus

# Download (check releases for the latest)
wget https://github.com/prometheus/prometheus/releases/download/v2.53.1/prometheus-2.53.1.linux-amd64.tar.gz

tar -xzf prometheus-2.53.1.linux-amd64.tar.gz
sudo cp prometheus-2.53.1.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-2.53.1.linux-amd64/promtool /usr/local/bin/
sudo cp -r prometheus-2.53.1.linux-amd64/consoles /etc/prometheus/
sudo cp -r prometheus-2.53.1.linux-amd64/console_libraries /etc/prometheus/

# Basic config
sudo tee /etc/prometheus/prometheus.yml >/dev/null << 'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s
	scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['127.0.0.1:9100']
EOF

sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus

# Systemd service
sudo tee /etc/systemd/system/prometheus.service >/dev/null << 'EOF'
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus 
  --config.file=/etc/prometheus/prometheus.yml 
  --storage.tsdb.path=/var/lib/prometheus 
  --storage.tsdb.retention.time=15d 
  --web.listen-address=0.0.0.0:9090

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable --now prometheus

# Visit http://<server-ip>:9090 to confirm

A note on retention: the default is quite generous. I prefer to set something like 15 days to keep the disk footprint reasonable at first, then tune later. If you have a beefy disk or remote storage plans, stretch it. If you’re on a tiny VPS, keep it conservative.

Scrape multiple VPS nodes

To add more servers, install Node Exporter on each, and update Prometheus:

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['10.0.0.11:9100', '10.0.0.12:9100', '10.0.0.13:9100']

Or if you like organizing by role, add labels:

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['10.0.0.11:9100']
        labels:
          role: 'web'
      - targets: ['10.0.0.12:9100']
        labels:
          role: 'db'

Labels are your future self’s best friend. A month from now you’ll be grateful you can filter dashboards by role or environment.

Optional: HTTP checks from the inside with Blackbox Exporter

If you want Prometheus to probe endpoints (HTTP, TCP, ICMP) from inside the network, add Blackbox Exporter:

# Download and install similarly to Node Exporter
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.24.0/blackbox_exporter-0.24.0.linux-amd64.tar.gz
...

# Example Prometheus config
scrape_configs:
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://example.com
          - https://api.example.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - target_label: __address__
        replacement: 127.0.0.1:9115  # blackbox_exporter

Prometheus is the backbone here. If you’re curious about the core philosophy and what makes it tick, the official docs are short and sweet: the Prometheus overview explains the model clearly.

Grafana: Paint the Picture and Make Alerts You Actually Trust

Connect Grafana to Prometheus

Grafana installation is straightforward. Most distros have a package, but the downloads page is also easy to follow. Once it’s running, add Prometheus as a data source by pointing Grafana at your Prometheus URL, usually http://YOUR_PROM_SERVER:9090. That’s it—now you can query metrics using PromQL.

In my experience, the first panel I build is CPU load averaged over five minutes, per instance. Then a memory usage panel that subtracts cache and buffers, because raw “used memory” is misleading. Disk IO time and IOwait follow, and finally network throughput. Each panel gets a threshold or two, but I keep the colors gentle. The idea is to keep the dashboard calm. You should feel your shoulders drop when you open it.

Starter panels that tell a real story

Here’s a simple set of queries I reach for:

CPU usage (per instance):

100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory used (excluding cache/buffers):

(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

Disk IO time (per device):

avg by (instance, device) (irate(node_disk_io_time_seconds_total[5m]) * 100)

Network receive/transmit:

sum by (instance) (irate(node_network_receive_bytes_total[5m]))
sum by (instance) (irate(node_network_transmit_bytes_total[5m]))

Once you have these, you’ll start spotting patterns. Maybe CPU is fine, but IO time spikes during backups. Or memory usage creeps up daily until a service restart resets it. It’s like having timestamps on your headaches—you can finally explain them.

Alerts that whisper until they need to shout

Alerting is where good intentions go to die if you’re not careful. Too many alerts and you’ll start ignoring all of them. Too few and you’ll miss the early warnings. I prefer a layered approach: warnings that nudge you (Slack, email), and critical alerts that break through (Pager, SMS, Telegram). Set sensible durations: CPU over 90% for 10 minutes is a warning; over 95% for 20 minutes is critical. Spikes happen; sustained pressure is the danger.

Here’s a sample CPU alert in Grafana’s new alerting system (you can also define alerts in Prometheus, but Grafana’s workflow is a little friendlier if you’re just starting):

Expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
For: 10m
Labels: severity = 'warning'
Annotations: summary = 'High CPU on {{ $labels.instance }}', description = 'CPU over 90% for 10m'

Duplicate it, bump the threshold and duration, and you’ve got your critical version. Do the same for memory (e.g., over 90%), disk space (e.g., 80% warning, 90% critical), and IOwait. Don’t forget the “dead man’s switch”—an alert that fires if no data comes in for a few minutes. That one catches the “Prometheus died” scenario, which is easy to miss.

For documentation and nuances like contact points, mute times, and silences, I find the official guide genuinely helpful. If you’re bringing teammates on board, share Grafana’s alerting and dashboard docs—they’re readable and practical.

Keep the dashboards calm

It’s tempting to build the dashboard equivalent of a cockpit with a thousand blinking lights. Resist. Start with a single row per host, one row for storage, one row for networking, and a top row for a heatmap-style overview. Color is for urgency; everything else should be easy on the eyes. Your future self, opening this at 2 a.m., will thank you.

By the way, if you’re running Laravel, WordPress, or any framework that has its own quirks, blend app-level metrics into your system dashboard when possible. Seeing queue depth next to CPU, or cache hit rate next to disk IO, connects the dots. If you’re optimizing PHP-FPM, OPcache, Horizon, or Redis, the “then this, then that” chain becomes obvious. I covered a lot of that practical tuning in a real-world story here: the Laravel production tune-up I do on every server. Even if Laravel isn’t your stack, the logic applies.

Uptime Kuma: External Checks and a Status Page People Actually Read

Why you want the outside view too

Internal metrics tell you why a server is struggling, but they can’t answer the simplest user question: can I reach it? That’s where Uptime Kuma shines. It feels like a friendly craftsman tool—easy to deploy, easy to use, and surprisingly capable. I often run it on a small separate VPS, because outside-in checks should come from, well, the outside. If your DNS breaks or a firewall rule goes rogue, you’ll catch it fast.

Quick install and first monitors

You can run Uptime Kuma via Docker or as a standalone Node.js app. Docker is my default because upgrades are easy:

docker run -d 
  --name uptime-kuma 
  -p 3001:3001 
  -v uptime-kuma:/app/data 
  louislam/uptime-kuma:latest

Open the web UI at http://YOUR_MONITORING_VPS:3001, set your admin account, and add your first monitor. Start with HTTP(s) checks to your website, your admin panel, and any critical APIs. Then add TCP/Port checks for MySQL, Redis, or any service you care about. If you have a cron-driven health endpoint, add an HTTP keyword check to make sure specific text is present—perfect for “readiness” or “I’m healthy” messages.

For push-based scenarios (where a job must check in on time), Uptime Kuma’s “Push” monitor is incredibly handy. Your backup job can hit a URL after it completes. If Uptime Kuma doesn’t receive the push within the interval you set, it flags it. I once caught a backup job stuck on a permissions issue this way—without the push, I might not have noticed until it was too late.

Notifications that fit your life

Set up notifications where you actually look. Telegram is fast and reliable; Slack is great if your team lives there; email is a good fallback. Keep priority in mind: low-urgency issues can be Slack-only; critical issues might ping your phone. And remember maintenance windows: schedule downtime in Uptime Kuma so you don’t get spurious alerts during planned work.

Status pages that reduce support pings

I love plain-language status pages. “Europe API degraded” is better than a wall of metrics—include short updates and action steps only if necessary. Uptime Kuma lets you curate which monitors appear on a public page, so you can share the right level of detail. It’s not about oversharing; it’s about being helpful. If you manage client sites, this can reduce “Is it down?” messages by a lot.

If you want to peek under the hood, the project is open-source and easy to follow. Here’s the repo: Uptime Kuma on GitHub. I send folks here when they want to automate deploys or tweak advanced settings.

Wrap-Up: A Calm, Helpful Monitoring Setup You’ll Actually Keep

So that quiet Monday morning? It doesn’t scare me anymore. With Prometheus collecting the story, Grafana drawing it clearly, and Uptime Kuma watching from the outside, I feel like I know my servers the way a barista knows their espresso machine—by the sound, the timing, the tiny shifts that tell you when it needs a little love. That’s the real win of a practical VPS monitoring and alerts setup: fewer surprises, faster fixes, and more confidence when traffic gets weird.

If you’re just starting, keep it simple. Install Node Exporter, stand up Prometheus, add Grafana, and build four panels you understand at a glance. Add three alerts that match your life: CPU sustained, disk space, and downtime. Then expand slowly—IOwait, queue depth, HTTP probe latency, status pages for your team. When you tighten the screws later (alert thresholds, mute times, silences, service-level burn alerts), you’ll be tuning something that already works, not fixing chaos.

And don’t forget the bigger picture: metrics aren’t the goal; calmer days are. If you’re dealing with DDoS noise or bot swarms on a WordPress site, monitoring pairs beautifully with smart edge protection. I’ve shared my approach to layering Cloudflare, ModSecurity, and Fail2ban in the layered shield I trust for real projects. It all works together: watch carefully, defend wisely, and act early.

Wherever you start, start. Set the first panel, wire the first alert, and let the data teach you. Hope this was helpful! If you’ve got a story from your own setup—or a dashboard you’re proud of—I’d love to hear it next time. Until then, keep your graphs calm and your pages fast.

Frequently Asked Questions

Great question! For a single VPS or a tiny cluster, you can run Prometheus and Grafana on the same box to keep things simple. As you add more nodes or need higher retention, move them to a small dedicated monitoring VPS. You can migrate the data later—start small, iterate, and keep it simple.

I like to begin with CPU sustained high usage, disk space nearing full, and an external uptime check. Those three catch 80% of the surprises. Make the warning gentle and the critical strict, and add a dead-man’s switch so you’re alerted if metrics stop flowing. Start there, then layer in memory and IOwait once you’re comfortable.

They do different jobs. Uptime Kuma tells you if the outside world can reach your site and how fast it responds. Prometheus and Grafana tell you why performance changes—CPU, memory, disk, network, app behavior. Use both. For a good mental model of metrics, the official overviews for Prometheus and Grafana are worth a skim once you’ve got the basics running.