E-mail failure, subsequent recovery lead to some hard-won lessons learned

Sometimes “lessons learned” come easy. Sometimes they come hard. The more-than-week-long e-mail outage that directly affected some 2,200 Sandians was a lesson learned the hard way.

The e-mail failure, which occurred on Nov. 5 and lasted for some Sandians until Nov. 16, showed just how critical this relatively new technology is for the Labs.

As Chief Information Officer Pace VanDevender (4010) puts it, “We learned from this how really vital e-mail is to our business — much more so than we had realized before.

Pace, as overseer of the Labs’ computer networks, says he apologizes to Sandians who were without e-mail for an extended period. But he wants people to know that his team labored mightily through what he likened to “the fog of war” to get the system up and running again as quickly as possible.

“I want to commend the heroes of this story — Bob Pastorek, Kelly Rogers, Bill Chambers, and Mark Stilwell — who worked up to 22 hours a day, sustained, for days on end, to figure this out and bring us all back to service,” he says.

With essential help from Microsoft, developers of the Exchange e-mail software and the Windows NT operating system, and Compaq, makers of the computer hardware the software was running on, full service was finally restored for all Sandians. (Exchange is the server side of the Exchange/Outlook/Netscape Communicator client-server e-mail system recently adopted by Sandia.)

A fateful 20 minutes Bill Chambers (4911), team leader for the SEEMS e-mail implementation effort, was chief wrangler for the recovery effort. As the man perhaps closest to the action, he tells the story as he saw it develop.

“The problem occurred Thursday, Nov. 5. One of the servers in the new [Exchange] system started showing errors,” Bill says. “Over the course of about 20 minutes, the database for about 2,200 people came to a full halt. We were left with about 2,200 people without e-mail.”

This wasn’t good, not by any stretch, but it wasn’t cause for panic, either. There are tools and procedures for dealing with just such situations. But�

“We used the standard, prescribed procedures to recover the database,” Bill says. “It didn’t come up.”

Time to get on the horn to Microsoft.

“We have a premier support contract with them,” Bill says. “We worked with both their Exchange experts and their [Windows] NT experts.”

Rick Harris, Manager of Dept. 4911, interjects a point: “At this time,” Rick says, “our goal was to, as rapidly as possible, recover the e-mail that people had already received and to restore the capability for sending and receiving. We felt that combination was very important.”

Bill continues: “At this point we were relying on telephone support. One of our folks spent five hours on the line with Microsoft. There were a lot of ideas and we tried various things. We worked through the night to get the system back.”

Working from backup tapes, the recovery team was able to get the database back up on the server.

That seemed to be that. A bit of a hassle, but still a pretty much routine recovery effort. The appearance that everything was back to normal, alas, was only a cruel illusion. The database ran okay through the weekend, lulling the team into thinking its worries were over. Then, about 10 a.m. on Monday, Nov. 9, the database crashed again.

Luckily, the recovery team, conservative by nature and wary by experience, had hedged its bets a little bit and had started moving some users off the suspect server on Sunday. When the database crashed again, there were 1,600 customers affected, rather than 2,200.

When the team ran integrity checks on the database, it found the files had been corrupted, so it restored the database from backup tapes again.

Situation escalates to Level A priority Meanwhile, Microsoft was perplexed by the problem at Sandia and was as eager to find a solution as Sandia was. On Monday afternoon, at the urging of Sandia, the company escalated the situation to its Level A priority status. That meant that any technical person working the problem at Microsoft was not to work on anything else until the Sandia problem was solved. While this was going on, the Sandia recovery team was nearing sheer physical exhaustion. Microsoft, which had already expressed the highest possible professional regard for Sandia’s in-house capabilities, agreed that it needed to send its own person on-site. (Eventually, it sent two people.)

Bill resumes his narrative: “After the second crash, we decided we really needed to get people off that system as rapidly as possible. We didn’t know what the problem was. We felt we needed to just as quickly as possible move people across to a different hardware box.”

After recovering the database — again — the recovery team moved people as quickly as it could to an additional server. “By Tuesday afternoon, about 1,300 people who had originally been on the first server had been moved over to an additional, physically different hardware box.”

Rick interjects: “We wanted to believe it was hardware, we really did. Hardware is straightforward to repair or replace. Finding root causes in software can be difficult.”

Another fine mess With some 1,300 people moved to the backup hardware, Bill says the team decided it should do an on-line backup of that new system. Standard procedure.

During that on-line backup, the new system crashed.

Yes.

“This left us in a horrible state,” Bill says, “because it left us with a database that was split and we couldn’t really recover either system properly.” As frustrating as it was, though, the crash of the second system was a blessing in disguise.

“That probably led us toward the clue for the solution to the real problem,” Bill says. “At first we thought that pointed very strongly to strictly a software problem.” That’s a reasonable assumption — you wouldn’t expect the same hardware problem to afflict two different machines. However, as is often the case in the world where software and hardware interface, the reality was more complex.

By conducting an analysis with some very sophisticated diagnostic tools, Bill says, the recovery team found that were some exceedingly subtle problems with a network interface card design. The problems with the card were technically complex, but in simple terms, the way it was receiving data from the network was not correct.

“This caused a problem that cascaded up through the machine,” Bill says, “and led to a loss of a critical network service to the database, which caused the database to crash.”

Before the team had figured this out, though, it was still looking at a number of different potential causes.

Bits flying through the system “By this time,” Rick says, “we had people from Microsoft talking with people from Compaq at the design engineer level. They were thinking something on our network was mangled. In fact, we found that there was one PC on the network that had a problem that caused it to constantly send requests to the e-mail system.

“We did a — you can think of it as a snapshot of the bits flying through the network,” he says. “We were able to capture some of the information and send it to Microsoft for analysis. They were looking to see if everything was interacting correctly. These are incredibly tough jobs. The complexity is high. This is heavy, heavy-duty debugging. Only a few people in the organization know how to do it. This is work at the fundamental level.”

Bill explains what this so-called sniffer technology can do: “With the information, you can trace back what machine sent what information. There had been a machine that was sending out bad information constantly, so we thought maybe it was causing the problem. It turned out it wasn’t.”

Corruption in low places By Wednesday, Nov. 11, a number of possible causes of the error had been eliminated. The Microsoft Exchange expert on site showed the recovery team how it could go in, before attempting to restart the database, and find and repair corrupted log files.

What had been happening was that the last log had been corrupted.

“That’s what was killing our database every time we tried to restore it after it stopped,” Bill says. “We were causing the problem without knowing it.”

“Well, this [Microsoft] fellow showed us how to go in and determine where a log was corrupt, or suspect, cut the bad parts out, read the database, and restore it up to that point. He also brought us a set of tools that would allow us to merge the split database back to one server — we had to put people on one server to restore the data. So we did that.”

These were new tools and new procedures, Rick notes. “Some of the new tools were about ready to be released but hadn’t been generally released to that point, which shows the uniqueness of this problem, really.”

A red herring — and success With the databases merged back together, Bill says, “we got the system up and we had another failure.”

Again.

This threw the team another curve, but it turned out to be — really — unrelated. By using some sophisticated tools, Microsoft experts in Redmond were able to analyze an 800 megabyte “core dump” to determine what had happened and why. It had to do with a bug in the database engine that corrupted the database when a rare set of events happened simultaneously. The bug had been corrected by an Exchange service pack upgrade, but the corruption had occurred at some previous time and was just waiting to bring things to a halt. Not a difficult fix, but one that took another day.

By now, through a combination of perspiration, inspiration, professional excellence, and persistence, the recovery team had homed in on the real culprit — that funky network interface card that was causing errors under high load conditions. After devising and applying some vital workarounds, doing some component change-outs, and some other corrective measures, the recovery team brought the database back up again.

And it stayed up. And running.

By Monday, Nov. 16, at 7 a.m., full service was restored.

As a result of the e-mail failure, says Herb Pitts, Director of Computing and Communication Systems Center 4900, the situation will be handled differently in the future. The paradigm at the beginning of this process was to go for full recovery — meaning full, simultaneous restoration of all data.

“That’s not going to be the case in the future,” Herb says. “We’re going to restore send-and-receive capability to everyone as soon as we can and then work on the problem of full data recovery.”

Pace says the CIO has a clear challenge before it: to make the highly complex client-server computer network as reliable and transparent as the phone system.

“People expect the reliability of the network to be as good as the telephone and it just isn’t,” he says. “Making it so will have to be a priority if e-mail is really going to be our heart and soul for doing business.”

Because the systems are so complex — it has been said that nothing in the history of technology is as complicated as a computer network — partnerships will become increasingly important, Pace says. “The relationships with Microsoft and Compaq were essential. They brought us new tools and new capabilities that were not available to anyone before.”

One more note: While lots of Sandians were frustrated waiting for their e-mail to come back, the vast majority of phone calls handled by Pace’s office, Herb’s office, and Rick’s office, were highly supportive of the recovery team’s efforts.

And that support meant a lot, Pace says.

Adds Bill, “We even had a customer who brought in a plate of cookies to us.”

Sandia Lab News