Homelab Infrastructure

October 2024

🔍 The Challenge

I wanted to learn Kubernetes properly—not from tutorials or labs, but by running real workloads that I actually depend on. The problem? Most learning environments are temporary. I needed something that would keep running 24/7 and teach me what breaks in production.

⚡ The Solution

Built a production-grade K3s cluster on Raspberry Pis in my homelab. It's been running for months, handling real traffic, and forcing me to learn everything from distributed storage to observability when things actually go wrong.

📊 Impact & Results

99.9%+

Uptime

Monitored continuously, with real downtime events that taught me about resilience

Months

Runtime

This isn't a weekend project—it's been running continuously

70%

Cost Savings

Compared to running equivalent infrastructure in the cloud

Zero

Security Incidents

Because monitoring and automation catch issues before they become problems

Homelab Infrastructure

Why I Started This

I wanted to learn Kubernetes properly—not just from tutorials, but by running real workloads 24/7. So I built a production-grade cluster on Raspberry Pis in my homelab. Turns out, running Prometheus at 3am when pods crash teaches you a lot about observability.

The challenge wasn’t just installing K3s. It was building something resilient enough that I could actually depend on it. That meant dealing with power outages, network issues, and hardware failures—the stuff that doesn’t happen in a clean lab environment.

The Challenge

Running production workloads at home is different. You don’t have a dedicated ops team, a support contract, or clean hardware. You have:

Raspberry Pis that sometimes decide they don’t want to boot
Power outages that test your backup strategies
Network issues that break ingress routing
Hardware failures that require actual disaster recovery

I needed a setup that could handle real-world problems, not just a perfect demo environment.

What I Built

The Cluster

A K3s cluster running on Raspberry Pis. K3s because it’s lightweight enough for ARM, but full-featured enough to run real workloads. I started with three nodes and learned the hard way why distributed systems need redundancy.

Core Stack:

K3s - Lightweight Kubernetes that actually works on ARM
Cilium - CNI that handles networking properly (no more flannel headaches)
Longhorn - Distributed storage because I needed persistent volumes that survive node failures
Traefik - Ingress controller because I got tired of fighting with nginx configurations
Cert-Manager - Automated TLS because manually renewing certificates is for masochists

Observability (Because You Can’t Debug What You Can’t See)

After my first incident where I spent hours trying to figure out why pods were crashing, I realized monitoring isn’t optional. I built:

Prometheus - Metrics collection because you need to know what’s happening
Grafana - Dashboards that actually help when things go wrong
Uptime Kuma - External monitoring because you can’t trust internal monitoring to tell you about network issues
AlertManager - So I actually get notified when things break

The cool part? Building dashboards that show me what matters. Not every metric—just the ones that help me debug actual problems.

Security (Because Defaults Aren’t Secure)

I learned quickly that Kubernetes defaults aren’t production-ready. So I implemented:

TLS everywhere - Automated cert management with cert-manager and Let’s Encrypt
Network policies - Pod-to-pod communication control (surprisingly tricky to get right)
RBAC - Role-based access because running everything as admin is a bad idea
Pi-hole - DNS filtering because security starts at the network layer

What I Learned

Kubernetes on ARM is tricky but doable. Most guides assume x86, so you learn to read documentation carefully and test everything.

Distributed storage requires careful planning. Longhorn is great, but setting up replication and understanding when volumes will be unavailable taught me a lot about stateful workloads.

Monitoring isn’t optional. When you’re responsible for keeping things running, you need visibility. But good observability isn’t about collecting everything—it’s about collecting the right things.

Automation pays off fast. Every manual step is a step that will eventually break at 3am. The time I spent automating certificate renewals was worth it the first time a cert expired.

Documentation is infrastructure. If I can’t remember how something works in six months, it doesn’t count as working. I document everything, even the mistakes.

Real Challenges I Faced

The time pods couldn’t reach each other: Network policies seemed simple until I realized default deny means nothing works until you explicitly allow it. Spent a weekend debugging before realizing the problem was my own policies.

The storage migration that broke everything: Decided to reorganize Longhorn volumes. Forgot that persistent volumes are stateful. Learned the hard way why you don’t mess with running systems on a whim.

The certificate expiration: Thought manual cert renewal would be fine “just this once.” Three months later, found out why automation exists. Now cert-manager handles everything.

The power outage that tested backups: UPS ran out faster than expected. Cluster came back up, but some volumes needed manual intervention. That’s when I learned the difference between “backed up” and “tested restore procedure.”

The Architecture

Core Infrastructure

K3s Kubernetes Cluster - Five Raspberry Pi nodes running K3s with Cilium for CNI. Why five? Redundancy. Learned that three nodes means one failure takes out 33% of capacity. Five means I can lose a node without impact.

Longhorn Storage - Distributed block storage with 3x replication. Every volume exists on three nodes, so one node failure doesn’t lose data. Setting this up taught me more about storage than any tutorial.

Traefik Ingress - Reverse proxy that handles TLS termination, routing, and load balancing. The configuration is complex, but it handles everything I need.

Monitoring Stack

Prometheus - Scrapes metrics from every service and node. Stores time-series data that Grafana queries.

Grafana - Custom dashboards for cluster health, pod status, resource usage. Built dashboards that show me what matters when debugging.

Uptime Kuma - External uptime monitoring. Runs outside the cluster so it can tell me if the cluster itself is down.

Security & Automation

Cert-Manager - Automated TLS certificate management. Watches ingresses, requests certificates from Let’s Encrypt, renews them automatically. Never think about certs anymore.

Pi-hole - DNS filtering at the network level. Blocks ads and known malicious domains before they reach the cluster.

Argo CD - GitOps deployment. Pushes to Git, Argo CD syncs to cluster. Infrastructure as code that actually works.

Live Infrastructure

This isn’t a demo—it’s running right now:

Status Page: Uptime Kuma Dashboard
Monitoring: Grafana Dashboards
Storage: Longhorn UI
Automation: N8N Workflows

The Result

A cluster that’s been running for months with 99.9%+ uptime. It’s not perfect—I’ve had outages, data loss scares, and plenty of “why isn’t this working” moments. But those failures taught me more than any certification or tutorial could.

The real win? Running infrastructure that doesn’t require constant babysitting. The best infra is invisible—it just works. I’m not there yet, but I’m closer than I was when I started.

What’s Next

Always learning, always improving:

Service Mesh - Exploring Istio to understand advanced networking patterns
Runtime Security - Falco for detecting suspicious behavior in real-time
Multi-Cluster - Planning federation for disaster recovery
Performance Tuning - Getting more out of the hardware I have

The infrastructure keeps evolving because the problems keep changing. That’s the fun part.

This isn’t just a project—it’s how I learned Kubernetes. By breaking things, fixing them, and learning from every failure. The best infrastructure lessons come from running things that actually matter.