Sorry for the recent downtime

Sorry for the recent downtime
by koitsu on 2010-10-09 (#68399)

I wanted to take a moment to apologise to site visitors over the past 24-48 hours. Some of you may have seen periods of time where the site (any Parodius-hosted site for that matter) wasn't responding -- or last night/today, might have responded but just sat there.

We performed some OS upgrades on Thursday and have been experiencing an ZFS + OS interoperability bug of some sort (we're not the only one). I mitigated the issue by removing use of ZFS on FreeBSD and replacing it with gmirror. I simply cannot tolerate such problems (gotta keep things up and running!), hence the mitigation.

Anyway, enough excuses aside -- just wanted to apologise for the interruptions and so on.

by blargg on 2010-10-09 (#68403)

This glimpse behind the scenes shows that it's not just keeping the server connected to electricity and the internet, as it seems to people like me who are ignorant of how web servers operate.

by Bregalad on 2010-10-09 (#68406)

Quote:
as it seems to people like me who are ignorant of how web servers operate.

Count me as one of them. But there is people around using the Internet who doesn't even have an idea about what a server is.

by Banshaku on 2010-10-09 (#68412)

The joy and pain of upgrading server... So there was a reason behind the downtime and not just some some admin "fairies" tripping on a power cord at the DC

by Dwedit on 2010-10-09 (#68416)

Is ZFS worth using yet? I've just been using ReiserFS up to now.

by cartlemmy on 2010-10-09 (#68420)

Damn admin faeries, they ruin everything.

Seriously though, it's good to see an explanation for down time. This is quickly becoming my favorite message board.

by koitsu on 2010-10-09 (#68421)

Banshaku wrote:
The joy and pain of upgrading server... So there was a reason behind the downtime and not just some some admin "fairies" tripping on a power cord at the DC ;)

I'm the only one who has access to the datacenter (I have two other folks who help me out with work, but they can't get in without me physically being there). So if someone tripped on a cable, it was probably me! ;-)

Thursday night's scheduled maintenance was an OS bare-metal install -- and I did this 100% remotely (2nd time I've done such; probably why I've written a fairly thorough document on the procedure). I ran into some snags (reboot, "Oh wait a minute, crap, this isn't going to work, I forgot to..."), which caused me to go into a frenzy/angry panic ("I'm going to have to go the datacenter to fix this"), followed by pure anger because I couldn't find my datacenter badge or cage key. Since I'm in the process of moving, all sorts of boxes and crap are scattered throughout my flat -- so I literally tore the place up (it looked like a typhoon hit it, and I'm in no way exaggerating), yadda yadda... I was super upset/irritated. Only until I calmed down did I realise I didn't have a badge/key any longer because the datacenter had upgraded to biometric (hand) scan and a pin + cage keypad locks. Sigh.

After I got things back up and working -- with ZFS in the picture -- I went to bed and thought all was well. Twelve hours later (Friday evening) I get a highly erratic call from our junior admin who didn't really do a good job of troubleshooting the problem and just wanted to reboot the box (which didn't work because of the actual problem, some kernel thread/operation was flat out hung), and he didn't have access to the remote rebooter (that's my fault). I was groggy given that I had taken Nyquil + melatonin to sleep.

Once he described the problem slowly, I was like "...this sounds familiar, I think someone on freebsd-fs posted something like this recently". Yup, the situation we experienced was identical to another guy running on completely different hardware, totally different software configuration, etc.. His workaround was to remove ZFS from the picture.

So that's exactly what we did (migrated from ZFS to using a block/sector-level RAID-1 implementation called gmirror, with standard UFS2 filesystems in use). System's been stable so far, with no signs of processes being deadlocked waiting for internal ZFS operations to complete (since ZFS isn't in use).

And no, none of this is a hardware problem or OS/hardware incompatibility. It's purely a FreeBSD 8.1-STABLE software bug, and absolutely 100% related to ZFS.

Dwedit wrote:
Is ZFS worth using yet? I've just been using ReiserFS up to now.

ZFS is worth using only if you're running on Solaris or OpenSolaris. Linux's ZFS port uses FUSE, which means performance is going to suck. There are also two ZFS kernel-level ports (super-duper-insane-patches) in progress, but I don't know anything about them. ZFS on the Solaris' literally just works -- no tuning, nada. On FreeBSD, it's a complete nightmare, as this whole situation proves.

At my day/night job, we use Solaris 10 extensively across thousands of machines (all different hardware revisions/models/specs), with absolutely zero problem. It's wonderful.

As for ReiserFS, I have no personal experience with it, but I do have a colleague who has literally had to hex edit a hard disk to recover pieces of a ReiserFS filesystem that exploded horribly due to some software bug. He could comment on its stability. If I was using Linux, I'd probably stick to using md and nothing more (I'm a KISS admin). However, on Linux, Btrfs is looking very, very nice though, and will definitely give ZFS a run for its money. Thumbs up to positive evolution that keeps it simple.

by lidnariq on 2010-10-09 (#68423)

Dwedit wrote:
Is ZFS worth using yet? I've just been using ReiserFS up to now.
This is totally a tangent. Sorry...

Years ago, when all the journalling filesystems were being pushed into the linux kernel, my friends and I went and did the legwork to figure out what was best one at the time. Conclusions then:
* ReiserFS3 was occasionally fastest, but the most likely to explode and eat all your data. (ReiserFS4 was going to be unequivocally fastest, but also basically guaranteed to explode and eat your data)
* XFS was best at big files and good otherwise
* JFS was slowest for most use cases but unlike the rest didn't degrade at gazillions of files in one directory (I think?)
* ext3 was backwards compatible and the only option that supported journalling of data (which saves you in case of a power outage)
Since then, ZFS, ext4, and btrfs all appeared and seem better. (e.g. ext4 added a number of features from XFS and research papers) and meanwhile, XFS seems to have bit rotted a little. When I install my new machine, I'm just going to use ext4 because I suspect btrfs won't be finalized yet.

by tepples on 2010-10-10 (#68454)

lidnariq wrote:
ReiserFS3 was occasionally fastest, but the most likely to explode and eat all your data. (ReiserFS4 was going to be unequivocally fastest, but also basically guaranteed to explode and eat your data)

So hide your data, hide your kids, hide your wife. (And hide your husband.) As they say on Slashdot, that's one killer file system.

Quote:
Since then, ZFS, ext4, and btrfs all appeared and seem better.

I haven't had any problems with ext4 on my laptop running Ubuntu Lucid. (Still waiting for Maverick Meerkat McCain to hit mirrors.) But then servers tend to put different demands on file systems.