This was a bad weekend

This weekend was supposed to be a nice quiet, calm, time… I had planned to spend it getting some files organized in the two filing cabinets I purchased from MKS (getting rid of old stuff in anticipation of our move to new digs).

Friday was the company holiday party … it’s always been enjoyable … even if I don’t like dancing (Ginny ends up dragging me onto the dance floor a few times). This time, unfortunately, Ginny’s dinner didn’t agree with her… so we headed home early.

Once home, I made sure Ginny was comfortable … and went downstairs to dink around with the computers.

I then made my fatal error.

I decided to upgrade the main linux system to Fedora Core 3. I figured it would be fairly straight forward to upgrade from Redhat 8 to FC3.

I ran the upgrade and it appeared to go OK … but when I restarted the system it seemed to be OK, but it got progressively slower and slower.

I checked the ‘top’ display and found that the ‘kswapd0’ process was eating up more and more cpu resources … and then other processes started getting killed because no memory was available.

Then I noticed this error …

Dec 3 23:20:24 linux kernel: Losing too many ticks!
Dec 3 23:20:24 linux kernel: TSC cannot be used as a timesource.
Dec 3 23:20:24 linux kernel: Possible reasons for this are:
Dec 3 23:20:24 linux kernel: You're running with Speedstep,
Dec 3 23:20:24 linux kernel: You don't have DMA enabled for your hard disk (see hdparm),
Dec 3 23:20:24 linux kernel: Incorrect TSC synchronization on an SMP system (see dmesg).
Dec 3 23:20:24 linux kernel: Falling back to a sane timesource now.

I had never seen this before … and it was true that one of my drives didn’t have DMA enabled for it. So I tried turning on DMA for the one drive again. Still had memory problems.

I tried to reboot … same thing happened.

Did a few google searches … couldn’t find anything definitive.

Ok, I decided that this upgrade was a botch … time to restore from the backup.

My last backup had been the previous morning … but nothing in the critical applications had been damaged, so I backed up the critical stuff as best I could and tried to restore to my tape backup.

It was now around 1am … and I don’t operate too well when I’m tired. I was very careful and…

  1. Booted to the Fedora Core 3 rescue CD
  2. Renamed the directories I wanted replaced so they would just get restored (/bin, /etc, /var, and so forth)
  3. Backed up the application directories (/usr/local)
  4. Started the restore from tape
  5. Went to bed

When I woke up in the morning, the restore had completed … I checked things over and everything looked OK and then rebooted.

System wouldn’t boot up properly … I got the following messages …
Freeing unused kernel memory
Attempting to read beyond end of device
03:06: rw=0 want 1219858068, limit 19743853
Attempting to read beyond end of device
03:06: rw=0 want 1219858068, limit 19743853
Kernel panic: No init found. Try passing init= option to kernel.

I ran ckfs on all my partitions and they all check out fine.

I tried reinstalling grub with no change.

I tried booting and adding init=3 to the kernel line with no change.

This is not good … not good at all.

I called Steve for advice … we discussed it a bit (on and off … he was running errands). I asked him to come over and help … two minds are better than one. I even offered him food. He agreed (Thanks again Steve).

It was going to take him a bit of time to get here … so I took the opportunity to do some more backups to a spare drive. Took me three tries, but I got the important stuff backed up.

After Steve got here we discussed it a bit more … we decided to do two things. I was going to try and actually migrate the applications over to the new system, using the backup I had just made, and Steve was going to scratch the screwed up system; install a bare minimum Redhat 8; and then restore from the tape backup.

The scratch and minimal reinstall went OK … and we got the tape restore going.

We then started looking at the new system to see how hard it was going to be to migrate … we started to do some playing around with it … and decided that it was more of a chore than either of us were up to at the moment. Too many changes between RH8 and FC3.

After the restore started, we started thinking about dinner … Steve’s friend Fred (who we liked a lot) was in town, so we decided to invite him over for dinner and geeky talk (Fred’s an Mac geek like Steve … so much of an Mac geek, he works for Apple).

After a bit of coordination with Ruth (Steve’s girlfriend), we all went to dinner at Babaluci .

After dinner … we came back. The restore was complete … and I was able to boot up into single user mode.

I transfered the backup drive from the new server back to the old one … and restored the backed up data to a temporary directory. I had to be somewhat judicious with the restore.

It was then about 9:30 … so I decided I was done for the day. I would finish the restore int he morning.

Steve and I discussed memory upgrades for the new servers we had bought … and then he went home.

I watched TV for a bit longer and went to bed.


<Yawn!> Good morning!

Ok … I’ve had some breakfast and coffee … time to get this puppie back on it’s feet.

I decided what directories needed to be restored (mostly mailman related stuff). Did the appropriate backup’s before copying the directories over, made sure everything looked OK, then copied them over.

Rebooted the system in normal mode … and it seemed to work OK. A few glitches that I had to deal with, but I was able to get those take care of …

  1. Log files didn’t get restored the way I thought they would … lucky thing I had them triple backed up
  2. mySQL database didn’t get restored … because it’s not part of the normal backup. I had to restore it from the sql dump that I do before the normal tape backup. Ginny may have lost a blog entry or two. I think I’m going to change the mysql backup to run more frequently than just before the tape backup. During the mySql restore, I encountered some errors that were kind of disturbing. The file created by mysqldump contained sql statements that the mySql client couldn’t handle. I’ll have to do some research on that. I was able to manually fix the dump file and create all the critical databases.
  3. NFS exports were kind of screwed up … not sure why. This might have been a pre-existing condition. None the less, fairly easy to fix.
  4. A few other minor issues that weren’t hard to fix.

As of now, I think the system is back up and running OK.

Anything that isn’t working properly, I suspect I will be able to fix it without too much problem.

I think next time I try to do this kind of upgrade, I’m going to do an image backup with something like Norton Ghost. The restore is far easier.


Monday morning update: I woke up this morning and checked the system … immediately noticed that the tape I had put in the tape drive hadn’t ejected after the backup (the tape status lights were not on, so the backup was not still going).

I checked my email and didn’t find my normal email telling me the status of the backup.

I figured the backup hadn’t run … I then did a few other checks and noticed I had zero space available on my main drive.

Uh oh … this is NOT good.

I dug a bit further, searching for huge files … and finally found one … /dev/tape. Huh?

Then I realized … since we had done a scratch install of Redhat 8, and the /dev file system isn’t backed up (or restored), the symlink I had initially created from /dev/st0 to /dev/tape wasn’t there… so the system tried to backup the entire file system to the regular file /dev/tape.

I deleted the file, created the necessary symlink, and started a backup manually … when I got home that evening, the backup had completed and the tape ejected.

Phew!

One thought on “This was a bad weekend

  1. D

    Holy crap on a stick. I’ve been out geeked…seriously I’m such a linux newbie I know nothing at all except for the feeling of dread I got just from reading about all this :0

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *