Eleanor Blair (lnr) wrote,
Eleanor Blair
lnr

Phew

It is *amazing* how much more relaxed I feel now our mail server is working again!

Boring technical detail:

  • Monday
  • Network has huge problems at weekend
  • Mail server's actual mail delivery (exim) and imap location (dovecot) is on an external disk, mounted via iscsi
  • External disk ends up mounted read-only so mail hangs horribly
  • Reboot machine
  • Machine is a virtual machine - normally these come up pretty quickly, but in this case there's a big timeout related to the iscsi disk and it takes ages
  • We think it's hung so we go look at the virtual hosts
  • The actual main virtual machine maintainer is busy with other things, so we try muddle through
  • It doesn't seem to be running on any of the virtual hosts, so we pick one and start it
  • All seems well briefly, then things start going horribly wrong
  • Turns out we ran two of it at the same time and it has trashed /var - time to reinstall
  • Reinstall base system using kickstart, and then I try get the actual mail servers running
  • We have a copy of /etc from the old machine, so I rsync this across to the new one
  • Start to get services back up, then reboot, machine reboots in emergency mode
  • The network falling down again for a couple of hours in the middle of this did *not* help
  • Give up and go home
  • Tuesday
  • Log in at emergency boot screen and look through the journal
  • I've definitely upset it with that rsync - now it can't mount half its file systems at all
  • Unpicking whatever damage I've just done is going to be difficult
  • Start over and reinstall base system again
  • Decide to mount disc over NFS this time rather than iscsi
  • Mount disc, set up servers (yesterday's notes really help)
  • Dovecot seems to be working OK now!
  • Fix some symlinks so mail will actually deliver (oops I'd caused lots of bounces)
  • Realise disc is mounted NFS 4
  • Change it to mount the disk NFS 3 - it's all nearly sorted, but there's something weird going on, try a reboot
  • Machine doesn't come back up. Oh.
  • John helps get it partially up with some fiddling of boot options and chroot (this is magic to me)
  • Fix fstab to not try mount the disk until the network is up - reboot now succeeds
  • Both exim and dovecot now start up happily, mail is being delivered and people can log in
  • Trouble is once you've logged in mail clients stall "waiting for server" if you try view a folder
  • Network maintenance is due at 5pm so shut down and go home
  • Wednesday
  • Boot machine up, wonder why dovecot didn't start, til I remember we disabled it on purpose
  • Fiddle with dovecot options, fiddle with NFS options
  • Turn on debug logging, but it doesn't help much
  • Bizarrely I can *sometimes* view folders and messages, but even then it's very very very slow
  • We narrow it down to almost certainly being a locking issue
  • Resort to strace - which reveals it's stalling on fctrl() calls
  • Tell dovecot to use flock and mount the disk nolock - suddenly everything is working again!
  • Hurray!

I still have a few bits of housekeeping to sort out. And this all needs to be properly documented and in some cases packaged, and added to kickstarter post-install scripts. But it's good to be back. And boy have I learned a lot in 3.5 days!

Subscribe
  • Post a new comment

    Error

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
  • 2 comments