Eleanor Blair (lnr) wrote,
Eleanor Blair


It is *amazing* how much more relaxed I feel now our mail server is working again!

Boring technical detail:

  • Monday
  • Network has huge problems at weekend
  • Mail server's actual mail delivery (exim) and imap location (dovecot) is on an external disk, mounted via iscsi
  • External disk ends up mounted read-only so mail hangs horribly
  • Reboot machine
  • Machine is a virtual machine - normally these come up pretty quickly, but in this case there's a big timeout related to the iscsi disk and it takes ages
  • We think it's hung so we go look at the virtual hosts
  • The actual main virtual machine maintainer is busy with other things, so we try muddle through
  • It doesn't seem to be running on any of the virtual hosts, so we pick one and start it
  • All seems well briefly, then things start going horribly wrong
  • Turns out we ran two of it at the same time and it has trashed /var - time to reinstall
  • Reinstall base system using kickstart, and then I try get the actual mail servers running
  • We have a copy of /etc from the old machine, so I rsync this across to the new one
  • Start to get services back up, then reboot, machine reboots in emergency mode
  • The network falling down again for a couple of hours in the middle of this did *not* help
  • Give up and go home
  • Tuesday
  • Log in at emergency boot screen and look through the journal
  • I've definitely upset it with that rsync - now it can't mount half its file systems at all
  • Unpicking whatever damage I've just done is going to be difficult
  • Start over and reinstall base system again
  • Decide to mount disc over NFS this time rather than iscsi
  • Mount disc, set up servers (yesterday's notes really help)
  • Dovecot seems to be working OK now!
  • Fix some symlinks so mail will actually deliver (oops I'd caused lots of bounces)
  • Realise disc is mounted NFS 4
  • Change it to mount the disk NFS 3 - it's all nearly sorted, but there's something weird going on, try a reboot
  • Machine doesn't come back up. Oh.
  • John helps get it partially up with some fiddling of boot options and chroot (this is magic to me)
  • Fix fstab to not try mount the disk until the network is up - reboot now succeeds
  • Both exim and dovecot now start up happily, mail is being delivered and people can log in
  • Trouble is once you've logged in mail clients stall "waiting for server" if you try view a folder
  • Network maintenance is due at 5pm so shut down and go home
  • Wednesday
  • Boot machine up, wonder why dovecot didn't start, til I remember we disabled it on purpose
  • Fiddle with dovecot options, fiddle with NFS options
  • Turn on debug logging, but it doesn't help much
  • Bizarrely I can *sometimes* view folders and messages, but even then it's very very very slow
  • We narrow it down to almost certainly being a locking issue
  • Resort to strace - which reveals it's stalling on fctrl() calls
  • Tell dovecot to use flock and mount the disk nolock - suddenly everything is working again!
  • Hurray!

I still have a few bits of housekeeping to sort out. And this all needs to be properly documented and in some cases packaged, and added to kickstarter post-install scripts. But it's good to be back. And boy have I learned a lot in 3.5 days!

  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded