Re: qemu2 images are being corrupted

Hello, I also met that in the past once. I bet it's closely connected to
qemu snapshots.

> Dear colleagues,
> I'm posting as an anonymous user, because there's a thing that concerns me
> a little and I'd like to share my experience with you, so maybe some people
> could relate to the same. ACS is amazing, it solves my tasks for 6 years,
> I'm running a few ACS-backed clouds that contain hundreds and hundreds of
> VMs. I'm enjoying ACS really much, but there's a thing that scares me
> sometimes.
> It happens pretty seldom, but the more VMs you have is the more chances
> you run into this glitch. It usually happens on the sly and you don't get
> any error messages in log-files of your cloudstack-management server or a
> cloudstack-agent, so you don't even know that something had happened until
> you see that a virtual machine is having major problems. If you're lucky,
> you see it on the same day when it happens, but if you aren't - you won't
> suspect anything unusual for a week, but at some moment you realize that
> the filesystem had become a mess and you can't do anything to restore it.
> You're trying to restore it from a snapshot, but if you don't have a
> snapshot that would be created before the incident, your snapshots won't
> help. :-(
> I experienced it for about 5-7 times during the last 5-6 years and there
> are a few conditions that always present:
>  * it happens on KVM-based hosts (I experienced itt with CentOS 6 and
> CentOS 7) with qcow2-images (either 0.10 and 1.1 versions);
>  * it happens on primary storages running different filesystems (I
> experiences it with local XFS and network-based GFS2 and NFS);
>  * it happens when a volume snapshot is being made, according to the
> log-files inside of a VM (guest's operating system's kernel starts
> complaining on a filesystem errors);
>  * at the same time, as I wrote before, there are NO error messages in the
> log-files outside of a VM which disk image is corrupted;
>  * but when you run `qemu-img check ...` to check the image, you may see a
> lot of leaked clusters (that's why I'd strongly advice to check each and
> every image one each and every primary storage at least once per hour by a
> script being run by your monitoring system, something kind of `for
> imagefile in $(find /var/lib/libvirt/images -maxdepth 1 -type f); do {
> /usr/bin/qemu-img check "${imagfile}"; if [[ ${?} -ne 0 ]]; then { ... }
> fi; } done`);
>  * when it happens you can also find a record in the snapshot_store_ref
> table that refers to the snapshot on a primary storage (see an example here
> https://pastebin.com/BuxCXVSq) - this record should have been removed
> when the snapshot's state is being changed from "BackingUp" to "BackedUp",
> but it isn't being removed in this case. At the same time, this snapshot
> isn't being listed in the output of `qemu-img snapshot -l ...`, so that's
> why I suppose that the image is being corrupted when ACS deletes the
> snapshot that has been backed up (it tries to delete the snapshot, but
> something goes wrong, image is being corrupted, but ACS thinks that
> everything's fine and changes the status to "BackedUp" without a bit of
> qualm);
>  * if you're trying to restore this VM's image from the same snapshot that
> has caused destruction or any other snapshot that has been made after that,
> you'll find the same corrupted filesystem inside, but the snapshot's image
> that is stored in your secondary storage doesn't show anything wrong when
> you run `qemu-img check ...` (so you can restore your image only if you
> have a snapshot that had been created AND stored before the incident).
> As I wrote, I saw several times in different environments and different
> versions of ACS. I'm pretty sure that it's not only me who had such a luck
> to experience the same glitch, so let's share our stories. Maybe together
> we'll find out why does it happen and how to prevent that in future.
> Thanks in advance,
> An Anonymous ACS Fan

