Archives

Trust No One


Something happened at work this week, which put a fine point on the current theme of most of my articles, which is preparedness; being able to recover from a disaster.

I currently support two different business units which use M2M, each separated by approximately 1400 miles. Yesterday morning we started getting reports of computers at our remote company going down. Approximately half of them simply dropped off the network. Normally, this kind of thing would not be my issue, but since the remote company uses M2M from our base via the internet, and the ERP system is absolutely vital, I was involved.

The computers would boot up, but would not “see” the network, etc. They also would not allow systems restore and other Windows functionality was crippled. What do you do in a situation like this? Well, I see it this way.

  1. Don’t panic, no matter who is yelling at you to fix it. Take a deep breath and form a plan before doing anything. You must remain focused to fix the problem and agitation will only slow you down.
  2. Ask yourself this simple question, “What’s changed?” A similar question is what is the difference between the computers which failed and those still working? Sure enough, if 30+ computers were working fine yesterday, and are dead today, something widespread has changed.
  3. Resist the urge to “try things” to fix the problem, regardless of whether they are a logical cause. This wastes time, and you may very well break something else while looking for a fix for the original problem.
  4. Don’t forget to check error logs, if applicable.

So, what did we do? Following those principles, we came to the following conclusions.

  1. It wasn’t an internet issue as half of the company was still working perfectly.
  2. Why was the remote company heavily hit while my site only had one machine die? This is not a hard and fast observation because the machine which died on my site was a laptop and all of the remote machines were desktops. It might have just been a coincidence. However, it turned out to be a crucial clue. It’s important to note that the remote company is two hours behind us.
  3. We eliminated the possibility of network switches being bad. We had someone at the remote site swap a bad and good computer for each other and both remained consistent.
  4. We took one of the bad machines and re-imaged it. This didn’t take long, and proved the problem was software.
  5. We started looking for viruses, though we have enterprise virus protection. Since half of the company “got it” it sure looked like virus activity. We found none.
  6. We started looking at our Windows Update server to verify that an update didn’t coincide with the outage.
  7. We checked our McAfee anti-virus software and found out that an update had recently been pushed.

Well, was the culprit and their anti-virus attacked essential services in Windows XP. People who know read this blog should know that I am really anal about updates. I don’t apply M2M, Crystal, or SQL Server Updates without testing. However, one doesn’t think that they have to guard themselves from their own anti-virus software.
What can be learned from this?

  • Trust no one. Suspect every piece of software no matter how rock solid the reputation.
  • Be prepared for anything. Most of us have backups of our servers, but are they any good if we lose our desktops? We have images for every computer we own and can roll them out at a moment’s notice, even 1400 miles away.

How long would it take you to re-image (or install manually) all of your desktops? We have around 75, and the time to re-image… about an hour.

Related posts:

8 comments to Trust No One

  • Judy

    Good article – and amazing because I got hit with a similar problem this morning. My laptop was “attacked” by a software update. Really crunched into my morning routine, but recovery was easy — because I’m a backup fanatic.

  • roleki

    Finally – a tangible benefit to running Trend Micro’s OLE!!! Antivirus.

  • Kim

    I’m confused. Why was one company affected and the other was fine?

  • As Roleki said – move to TrendMicro!!!

  • I’m sorry Kim, I should have explained. If you look deeper into that article, the user only has the problem if they boot or re-boot their systems with the specific virus update installed. McAffee pushed out a corrective update a few hours after their mistakes.

    Since the remote company is 2 hours behind, the bad patch was active when they were coming into the office to start their day. As soon as they turned on their computers, they were hosed. Only 1 person at my site had the problem because everyone had logged in before the bad patch was distributed.

    Avi, we are but cogs in a giant machine, and the the AV software purchasing decision is not mine, unfortunately.

  • Nice tidbit Dave, could have used Fox Mulder and Dana Scully on this one. Feel your pain as I lost a server to MS monthly’s back in the day. I had McAfee update issues and tossed it way back. Truth be told you are on the money “Trust no one”. Oh and don’t forget “The truth is out there”.

  • Curt

    Dave,

    Your statement intrigues me, “We have images for every computer we own and can roll them out at a moment’s notice, even 1400 miles away. ”

    What do you use for imaging and how do you restore images from “1400 miles away”?

    Curt

  • Curt, I’m sorry but that isn’t my expertise. Our networking technician handles it. It involves Microsoft technology which facilitates quick and easy desktop roll out after it has been set up properly.

    I’ve seen it in action, and it works. However, since we’re an Enterprise Agreement customer, it may be something that is out of reach of the typical company which uses M2M.

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>