Historical

Upgrade News, and System Downtime.

The upgrade is finally, I think, complete. At least from the point of view of DearDiary users. There’s some system admin chud that needs to be finalized and obviously over the next few days we will find bits and bobs that need sorting I have no doubt.

I am glad to say however that no data was lost as a result of anything, no emails went astray (though its possible that some bounced – please resend if you had something bounce on you!) but none went into a black hole to never be seen again.

I can tell you I never ever want to upgrade a server again. I have NEVER experienced so much stress as this. Not wanting to be one to hide when I make a mistake, I include a link to an entry DearDiary SUX here.

Firstly, I apologise again to everyone affected by the downtime. It has been a long hard slog to get this right. I’ll start from the beginning;

The day before yesterday, 3rd April 2000, in the morning, I decided enough messing around had been done with the new server and it was time to shift everyone across. The timing was chosen so as few people were affected as possible (its not easy, the site is on the go pretty much 24 hours a day, so whenever you choose someone is affected). The system HAD to be shutdown completely in order to safely migrate all the data, otherwise entries could go missing, comments get lost or emails go astray.

To that end the server was taken down. It was all brought back up again with about 20 minutes or so, however, as final testing a reboot was done to ensure that the box would return to functionality if such a thing should occur ‘naturally’. After the reboot one of the processors was missing!!! Its a dual processor box, running on only one processor. A bit like your 4 cylinder car only running on 2. Things get a bit lumpy and you wonder why you paid all that money.

So, now it was time to investigate. First, everything was switched back to the old server and ‘normal’ (if slow!) service was resumed… So that was the first major outage for no good reason. Now I had to investigate just why the second processor was missing. I did some detective work, and Matt did some too and we made about a dozen changes to the Linux kernel configuration to try to get the second processor noticed. No joy.

I called the box provider. They took a look and said ‘Oh, must be a processor failure, let me switch that out for you real quick’. Which, to their credit, they did VERY quickly. The machine returned and rather splendidly it had two processors. Neat. Not being fooled a second time, I immediately rebooted the box (a ‘warm’ reboot – ie not one where the power is off for a while) and checked.

AAAARGH!!. Where is my bloody second processor? Imagine picking your car up from the garage and its running lovely. You get it home, have a coffee and decide to nip to the shop. When you go to start the car, its running rough as all hell again… You’re not pleased are you?

So I call the provider again. This time they tell me its the software that’s the problem, cos they changed the hardware (the processors) and so it can’t be those. I agree it can’t be those, but suggest its the motherboard because I don’t believe its Linux at fault. However, I can’t prove it yet…

So now I go on a hunt around the net to find ANYTHING about missing processors. I’m beginning to doubt my own ability when, after drawing a blank everywhere, I get an email from the provider (after I sent an email stating I was 99.5% convinced it was hardware related) stating that after consideration of the facts, they agreed it was hardware…

So where do I send the bill for my wasted day? (That ones still under discussion :)).

So I went to bed. But at least the site was up for most of the day. That, was Tuesday.

Wednesday comes along and the server has new hardware. A reboot later and wowsers, both processors fired up too. Splendid. We have working hardware. So I start the migration AGAIN. More downtime (sorry guys!).

Things get a little weird from here, and I really can’t remember much. Suffice to say it didn’t go well and was up and down on a regular basis throughout the day. Its now Thursday at 1:40am and we just got most of it running.

Things that went wrong;

  • DearDiary.Net itself was up and down as a result of nameserver problems, and my provider using a transparent nameserver proxy AND httpd proxy. Hence, I often saw no problems when in fact the thing was on its face. Consequently I ended up doing 8 different things at once and getting REALLY stressed.

  • Lots of STANDARD packages not installed by the provider. Such as IMAP server, POP3 server etc.

  • Bringing on www.openfiction.com broke DearDiary.Net. Some of you may have noticed the OpenFiction pages when you tried to access DearDiary.Net.

  • Clearing it up so that DearDiary.Net was resolved, broke OpenFiction. Back to square one.

  • NFS is royally broken under Linux 2.4, so I struggled for an hour with that this morning, before giving up and dropping the box back to 2.2.19. We need to reboot to 2.4 at some point as its FAR faster than 2.2.19 under pressure.

  • Our provider stated we could not use one of the IP addresses on the box. So the secondary nameserver (ns2.atomic-systems.com) was configured to use the first virtual IP address. However, this server would not present that IP address when trying to notify of new names and thus the secondary rejected the whole zone. Thus the deardiary.net domain was resolving to the wrong server on and off, intermitently… This was fixed by saying screw the provider and using the non-virtual address… This may bite us in the butt in future, but its quick to fix. Another hour or so gone while I reconfigured the Cobalt RaQ to accept a different master server.

Thats, just a list of the things that went WRONG today. Aside from the incidental things like the log analyzer being a royal pain to reinstall, and the things that did actually go right, I’m a little tired.

And was very irritable. Right about at the time someone decided to mail and criticize my server – something I take pretty personally… Apologise to PorcelainDoll for that!

I think, I can probably go to bed soon!

Its worth noting that this server would still be down now if Matt hadn’t come along, calmed me down, and dug around for some stuff and got stuck in and helped – mostly by pointing out the blisteringly obvious that had eluded me for 2 hours in my rage! Thanks Matt. And thanks to Päivi for putting up with the profanity, shouting, cursing, damn near crying that took place today too. I love you Ummush! 🙂

Steve.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *