Updating a Proxmox Cluster from 4.4 to 5.4 the safe way

Most of us have had instances where a production virtualization cluster can’t be updated for whatever reasons: stability, fear of screwing up, orders from above, or scared co-workers that might think that once it’s in production you shouldn’t (just don’t) touch anything that might hinder the intended behaviour of a system.

However, there are some things that do need to be upgraded, especially in cases where a bug that has been pestering you has been fixed on a newer version or a new functionality has been added. Now, I’m not one of those guys that likes to have absolute bleeding-edge software on production servers (obvious reasons might include: it just hasn’t been tested long enough), but having software on production that might soon reach End Of Life or that isn’t behaving up to current standards, that, I have a problem with.

I started to work on a new company not long ago and I came to work on the first day to see a few Proxmox clusters running on 4.4, which isn’t that bad (don’t touch production servers remember?), so I tested the functionality and I came to see bugs I hadn’t seen in years (including issues with the no-vnc conections to the VMs) and decided to do something about it. So I convinced my manager to upgrade the clusters.

At the time of writing this article the current version of Proxmox is 6.0. I’ve had bad experiences upgrading to the first major version of Proxmox before so I decided to upgrade to 5.4 atleast until a more polished version of Proxmox comes arround.

The procedure is pretty simple, if you have classical shared storage (NFS, iSCSI disks, etc.).

Beware CEPH Users

{c:red}This article doesn’t cover the steps for a CEPH upgrade, so please be careful{/c}

Beware CEPH Users

So, we start with a cluster. We need to balance our the whole cluster so we can get an empty node. That means migrating all the VMs to other nodes, all while making sure the target nodes don’t run out of RAM and in my case making sure the CPU load wasn’t topping out (production servers remember?).

The migration procedure is pretty straightforward:

Right click on a VM -> Migrate -> Select the target node -> Get a cup of coffee

After you finish this process you can go ahead and upgrade the node.

First off, start with a check from the terminal so we can make sure there’s no VMs running on the node:

qm list

Once we make sure there are no running VMs and we’re sure the node is empty (go ahead, check again.) we can proceed.

First step is to do an update on the repos with:

apt update

Then we can upgrade and reboot the current system with the newest kernel for 4.4:

apt upgrade -y; reboot

Once the node comes back up we need to add-in the new repos for the upgrade, as well as the GPG signatures for the repos:

sed -i 's/jessie/stretch/g' /etc/apt/sources.list
sed -i 's/jessie/stretch/g' /etc/apt/sources.list.d/pve-enterprise.list
wget http://download.proxmox.com/debian/proxmox-ve-release-5.x.gpg -O /etc/apt/trusted.gpg.d/proxmox-ve-release-5.x.gpg

Then we proceed with updating the repos on APT:

apt update

Now, this is very important, we need to remote sysv-rc, openrc, etc from the system so we’re able to get the latest systemd graceness on the system DO NOT REBOOT THE SYSTEM:

apt purge insserv sysv-rc initscripts openrc

DO NOT REBOOT THE SYSTEM Seriously, unless you wanna spend hours fixing the system or just formatting it and adding it as a new node. (which in hindsight isn’t such a bad idea)

After the removal is done we can finally upgrade the node to the 5.4 version of Proxmox:

apt-get dist-upgrade

This step can take a while, and it’s going to prompt you for answers a couple of times regarding configuration files, you can leave them as-is or you can upgrade them, up to you. I leave them as-is in case I need to recover anything.

Finally, you can reboot the system:


Et voilà!

Now you can migrate the VMs back to the node and start all over again with another node. This process can take a while and you might be able to automate it, however I’ve heard some stories both from RL and online from people that have had some bad luck with automating it and end up screwing up the cluster in some way or another.

Good Luck!