Fixing a non-obvious LVM “missing volume” error

I have a love-hate relationship with LVM. When it’s working, it’s awesome. Every time something fails, though, it’s a nightmare and I hate it. It always restores everything without data loss, which is the key, but it doesn’t make it easy. Today I lost a physical volume used by my primary volume group, restored it, and had some trouble getting it back online. This is the story of how I did so, since I couldn’t find any easy instructions or others with quite the same problem.

First off, let’s establish that the fundamental problem was PEBKAC — or in this case, PEBCAF: Problem Exists between Computer and Floor. I was replacing a fan in the front of my case that had started groaning, and I chose to do it without powering down the computer. In the process, I knocked the power cable off one of the drives in my volume group, causing it to go “missing” (as it should have). It regrettably wasn’t marked hot swap in the BIOS, so plugging it back in didn’t restore it to working order. Not thinking much about it, I just did a graceful reboot expecting it to come right back up.

It didn’t. When the computer booted, it failed to mount a number of volumes, as one or more of their mirrors were stored on a missing PV. Except that the PV wasn’t missing — pvck showed the volume metadata fine, the disk showed up in the device table, and pvs showed the disk, but with the missing flag. I tried a variety of obvious commands to straighten things out (pvck, vgck, vgchange -ay, etc., etc.), but everything ultimately failed with a report of a missing volume.

Then I did something else stupid, which I was fortunately prevented from completing — I tried to use pvcreate (with -ff, no less!) to restore the volume metadata from backup. LVM wouldn’t do that, though, complaining that the volume was in use. I think that might have destroyed my chances of getting it back online, so fortunately it didn’t work.

One of the commands I had tried that I really expected to work — since the disk actually was there — was vgcfgrestore. It failed, too, though, complaining about the same missing PV. At some point it occurred to me to look at the backup file. Lo and behold, what did I find but this:

                pv1 {
                        id = "71713b2f-46b7-45c2-8b33-8ef8ccd901ea"
                        device = "/dev/sda1"    # Hint only

                        status = ["ALLOCATABLE"]
                        flags = ["MISSING"]
                        dev_size = 1953521664   # 931.512 Gigabytes
                        pe_start = 2048
                        pe_count = 238466       # 931.508 Gigabytes
                }

To make a long story short, after removing the MISSING flag, vgcfgrestore did exactly what it was supposed to do, and the volume group was restored to full functionality.

I’m not sure, but I think the problem is that the clean shutdown with a missing volume saved a backup that thought the volume was missing, so I was trying to restore a backup with the same volume missing. That doesn’t seem like it’s the behavior that anyone would ever want, but there it is. If you ever have this problem, check your volume group configuration backup.

Since versions and such matter for this sort of advice, this was on Debian Jessie with LVM 2.02.111-2.2+deb8u1.

Musings from kb8ojh.net

Fri, 06 Jan 2017

Fixing a non-obvious LVM “missing volume” error

Tags