Homelab 3.0

Back to Home | Back to Posts

Written:
---

I’ve been making some improvements/changes to the homelab, and I would now consider this to be the third iteration of the homelab. In the Dickension tradition, I’ll go through the past, current, and future states of my homelab.

Ghosts of Homelabs Past

The first iteration was an ever-growing pile of NUCs running Debian. Initially, the idea was to run Kubernetes orchestrated by Pulumi, but the overall complexity of the system was unmanageable, and I also made some odd architectural decisions, such as writing persistent volume claims to SMB stores (yikes). As a result, it worked - but, like 80% of the time. At some point, I must have accidentally deleted the private key for that cluster, because I couldn’t get kubectl to connect to the other nodes.

I don’t count this as a separate iteration (at least, from a hardware perspective), but where this iteration ended up was using docker-compose to deploy various services to dedicated debian hosts. It worked well enough, but was sometimes unreliable because my physical network uses mesh wifi backhaul to bridge the physical beachheads in my network - namely, my office and my living room. I also was in the habit of setting node-local reverse proxies, so in order to create or change a proxy setting, I had to remember what physical node that service was being deployed from. Hard to manage. I also didn’t have much experience with provisioning and managing SSL certs, so I doggedly ran everything over HTTP, which is perfectly safe since it’s not exposed to the internet, but it was hostile to mobile browsers, and a few systems (notably, MySQL) actually refuse to run over HTTP. Fair enough. Finally, I was accumulating NUCs based on whatever deal I could find on eBay, and this meant that I was inevitably buying very old hardware - 4th or 5th generation Core i5s, and one of them was even a 4th gen Core i3 ($20!). Not ideal from the perspective of power consumption, but also too slow to run some applications. In particular, I remember running a PVR service that encoded video from a physical antenna, and the poor NUCs really chugged when processing even 1080p video.

The turning point between Homelabs 1.0 and 2.0 was adding a firewall appliance - a Sophos XG-105 rev 3 running PFSense. This let me run PFBlocker-NG instead of PiHole - and then, it wasn’t so important to have an always-up docker host. In general, a lot of the services I was running in Homelab 1.0 were toys at best. For example, running Gitea and Drone CI was a fun exercise in architecture, but was not particularly sustainable. The other thing I’ve come to realize about myself is that I don’t consume a ton of media, nor do I play a lot of games, so things like Navidrome or Audiobookshelf were pretty underutilized. Instead, I find myself playing with code projects more often that not, and share them with others. So, back to GitHub with me (although perhaps Codeberg is in my future).

Homelab 2.0 was simpler and taller than 1.0 - a single very large Thinkstation P720 compute node running Debian and HPE Proliant Microserver Gen8 running TrueNAS (aka MEDIA-NAS). The GPU in the P720 let me deploy LLMs locally in such a way that they don’t drink all the water on the planet while sending my data into the ether for Google/Meta/OpenAI to train on. I was still using docker-compose to deploy, but because its a single node, organizing all the different compose templates was easier, and made for more rational organization in git. This was a great setup, to be honest. Highly stable - I had something like 6 months of uptime on the p720 at one point, until it was taken down by a power outage. There was only one hardware failure to speak of - the SSD (a Mushkin NVMe SSD) died while I was out of town (inconvenient, but it’s good to know my computers have comedic timing). A quick replace and restore from git - back up and running! The NAS was and remains stable and reliable.

The stability of Homelab 2.0 really helped me to identify ways to improve the relative stability of Precis, too. Up until about a week ago, Precis had been up for about 6 months (not including brief downtimes when deploying new images, of course), without any errors or unexpected crashes. However, while I was in HK two weeks ago, there was a brief power outage, and it didn’t come back up when power returned (I think because it depends on network connectivity to grab RSS feeds, and it checks feeds on startup, but that node came up a little faster than my firewall). Using Tailscale, I was able to ssh into the box and start it up without issue (from a moving train, nonetheless! Tailscale truly is magic, sometimes).

The Ghost of Homelab Present

Homelab 3.0 is all about virtualization and networking. It took me a while, but I finally got around to playing with virtualization, and I get it now. Being able to abstract away the compute is super useful for setting resource boundaries, and it’s much easier to manage compute from a UI than from different SSH sessions, especially as your node count grows. As such, I’ve converted both the extra-large P720 as well as my less richly-provisioned P520 into Proxmox hosts. Both nodes have GPUs, too - right now the RTX 2080 on the P520 is passed through to a Windows VM that I use for the rare occasion when I feel like gaming. In the long term, I plan to pass through the RTX 3060 to Open Web UI and Ollama again on the P720 and use that for local LLM serving for both Precis and Continue (code completion).

Both of these Proxmox hosts are located in my office (someday, I’ll have a proper server room…), and therefore have to use wireless backhaul to get to the internet. So - I make a point of not deploying anything too important on these nodes, or at least nothing that my network at large depends on. Namely, a bunch of toy linux VMs, and I also run a SSD-backed TrueNAS instance to serve as a hot backup for main NAS, which is… a new Beelink SER7 (AMD Ryzen 7 7840 HS w/ 32GB of RAM), also running Proxmox! This node is physically networked to my modem and firewall, and it’s my intention to maximize the availability of this node. As such, I have an Ubuntu server that serves as a basic general purpose docker host on this node, which currently serves Nginx Proxy Manager, as well as Precis.

The other VM on this node is also Ubunutu Server, but deploying CasaOS. I’ve always been interested in having a network-level homepage/directory, which might be especially useful as services and servers proliferate. What’s nice about CasaOS is that it bundles a Portainer-like Docker management UI with SMB file hosting and system management. To that end, I’ve added 1TB of a 2TB Sabrent Rocket NVME SSD (a screaming deal for $80-90 by the way) as a LVM thinpool to use as primary data storage, and exposed those using SMB.

When it comes to networking, I’m prioritizing hardware that has 2.5 Gbps or 10 Gbps networking. I upgraded my wi-fi to use a pair of Deco BE63 Wifi 7 units, which both have 4x2.5Gbps RJ45 ports. I’m waiting to receive a pair of Dell-made NICs that use the Intel X550-T2 chipset for 2x 10Gbps RJ45. I’m also going to implement a Tenda TEM2007X unmanaged switch. This one has a pair of 10Gbps SFP+ outs, as well as 5 2.5Gbps RJ45 ports - I’m a little concerned about thermal management for this switch, as I will have to use 10Gbps SFP+ to RJ45 transceivers to connect the Proxmox nodes. However - in the worst case, it is quite inexpensive to replace the NICs with ones that take SFP+ and use a DAC instead. Starting out, I plan to run just one of these switches, and provide a direct connect interface between my two Proxmox nodes as a separate network interface, but I may also try adding another such switch and bonding the two ports on each NIC to go through the switches in a distributed manner (it’ll have to be failover though, as switch itself is quite cheap, so it doesn’t support static nor dynamic aggregation).

The other theme of Homelab 3.0 is doing things conventionally, or put another way, following best practices. As I mentioned before, I love a good hack, but sometimes this results in taking the path of least resistance in a bad way. Take, for example, the SSL cert thing - it’s certainly true that my network environment doesn’t necessitate HTTPS, but that resulted in other complications. As it turns out, using HTTPS also lets you play some games with compression, so it sometimes even turns out to be faster.

Another example - when it comes to implementing backup between CasaOS and the proxmox-backed TrueNas instance, my first instinct was to write a shell script and run a cronjob. But - as it turns out the most sustainable way to build this is actually to use something like Syncthing; all of the eventualities that I would encounter have already been solved for, because everyone runs into the same eventualities. The resilience of Syncthing justifies its complexity, and I think it’s fair to say that the interface does a good enough job of making the process comprehensible (ie, hiding the complexity). I’ve got big feelings about this, but that’s a loooooong post for another time.

The last component is storage - as I mentioned before, I’m using CasaOS with a LVM thinpool to provide SMB shares. I then use Syncthing to sync those shares with a TrueNAS VM, which serves as a hot backup for that storage. As you should with TrueNAS, I take snapshots of the data on a regular basis. I then periodically sync those snapshots to MEDIA-NAS, which will be living out its retirement as a cold backup. My data is pretty slowly changing, so weekly snapshots with 3 months of retention will do the trick is my current starting point.

Ghosts of Homelabs Yet to Come

Of course, the work is never done. In the medium term, there’s a few loose threads that I should chase down:

Tags: