Pop Quiz Hotshot! Your Server Dies! What do you do? What do you do?

Dennis Hopper - Speed

Dennis Hopper - Speed

You get a frantic phone call at 3am on a Sunday. It’s your boss, and there’s been a fire. The server room, and perhaps the whole building, is still smoldering. He asks you the following questions:

  • Is our data safe?
  • Is it off site? Is it secure or on someone’s kitchen table?
  • When was the last time you checked the back up? You are checking them, right?
  • Are you sure you can restore the M2M files? You have been performing test restores, right?

There are many blog articles on the web about the importance of disaster recovery, and you’ve probably read a few of them. I won’t re-hash everything here, but do you have the answers to all of those questions? If not (and you value your job), stop reading this and address them immediately. I’ll wait….

Good! Now that the above are covered lets address some M2M specific issues.

  • Can you build your SQL Server and install everything without help? Servers seem to prefer to die on weekends, and M2M Support is not available late at night or on weekends and holidays.
  • Where is your software? Latest Service Packs? Software keys? At the bare minimum, a M2M server requires Windows Server, Visual FoxPro, SQL, and Made2Manage.
  • Do you have documentation handy with all of your M2M and other settings?
  • Do you have instructions ready to install all of your optional modules? Do you have the latest install files for them? I recommend keeping all of this (including the install files for everything) in electronic format with your back-ups if possible.

I remember watching a movie where special forces soldiers spent hours assembling and disassembling their rifles blindfolded. The point was that they became intimately familiar with their weapon and knew everything about it. Ultimately, this made them more effective soldiers.

Recently I wrote disaster recovery instructions for SQL Server, Made2Manage, and the associated modules for my current employer. One of my fellow employees performed the same process on a Domain Controller and Exchange Server. We then verified them by restoring everything from bare hardware, back-ups, and our instructions in an isolated room without access to the internet. Our manager also put us under stress by telling us our results were going to be public and that we had better not let him down.

Why a closed room and stress? It simulates a disaster. Can you be assured that you will have access to the internet in a disaster? Even if you could, would your boss want to wait while you download service packs? Since we had all of the files and had practiced, we built and tested our servers in less than 4 hours.

What are the benefits of a complete test?

  • To prove we could do it.
  • I learned exactly what makes M2M work.
  • I’m now confident that I can fix anything on that server.

Incidentally, with the exception of our database server, the rest of our servers are replicated in two other sites in the United States. In the near future, our SQL Server will be replicated the same way. So we should never actually be “under the gun” so to speak.

So Hotshot, what do you do?

18 comments to Pop Quiz Hotshot! Your Server Dies! What do you do? What do you do?

  • David,
    Question on the practice restore. If it doesn’t work and you are practicing on your production server (say on the weekend) what do you do? If not on your production server than is it valid? I back up server data to a Buffalo Terastation located in a separate building on site. The strategy assumes only one of the buildings would be destroyed. In my bosse’s words, if both the buildings were destroyed no need to worry about the data….

  • Forget the problem of both of the buildings being *destroyed* – think smaller. What would happen if both buildings had no power for two weeks? No internet connectivity? These are completely legit problems – I faced them during the aftermath of Hurricane Wilma in Miami and Hurricane Ike in Houston. Nothing in my buildings got touched, but without power or internet, I needed another plan – fast.

  • Brent,
    You make a valid point. However, as a manufacturer we simply do not operate when there is no power. Several years ago there was a meltdown that left a good part of the midwest and parts of the east coast without power. The culprit was supposedly FirstEnergy, located in our neck of the woods. We had to close down for a few days as without power we could not operate. It is simply not practical for a manufacturer to maintain duplicate facilites.

  • Rick – you don’t have to maintain duplicate facilities, you just have to have a plan. This is why there’s offsite tape vendors like Iron Mountain that will come out to your location and pick up backup tapes on a regular basis. If you can’t afford that, you can rotate tapes out yourself, or log ship over the wire to a colo file server. The important thing is just to be able to recover – and every business needs to be able to recover.

  • scott

    I agree with Ricks original point only I have it even worse. I don’t have a backup or test server to practice on. I can’t afford to take down our production server on a weekend and risk not having it working again Monday morning. The budget simply does not allow for the purchasing of an additional server.

  • Scott,
    Exactly. It is something I have never heard explained. By the way, the president of my company says that in the event we loose two or more buildings at the site, the owners will all be heading to warmer climes as the property is put up for sale. So I guess not all businesses need a continuation plan…

  • Rick, I personally would not use my production server for testing. In fact, I don’t do any development or customization on it if I can help it. Everyone should have a Test Server. If you (with your recovery plan) can build a server from scratch, install all the necessary components, there’s no reason to think that it won’t work flawlessly with your production Server.

    Scott, as far as cost, you can build one of these out of an old, decommissioned desktop computer and the only cost you have is for SQL Server Developer edition. The cost is about $50.

  • scott

    What do you mean the only cost is $50?

    For starters I doubt any of my old desktop PCs could run everything my server does (even at a drastically reduced speed).

    Secondly I would also have to buy the following software to do it legally.

    Windows Server 2003
    Visual Fox Pro

  • Scott,

    You don’t need to install Visual FoxPro on the server itself. Therefore you could use your company PC for that. Also, you can use Windows Server for a 60 day trial period at no cost. Of course you can complete disaster recovery exercises in less than 60 days.

    In addition, a test server (for custom code and such) only needs Windows XP, not server and most office PC’s come with a licensed copy so there’s no extra cost there either.

  • Andrew

    I read your Test Server article. Is a test server running on XP, actually a valid test? This kind of goes along with Rick’s question. How do you know that code you write will work on your production server if you aren’t running the same operating system on your test box?

  • […] Weight and Testing Code David Stein (Blog – Twitter) wrote a post called “Pop Quiz Hotshot” about starting your disaster recovery plan *now*.  It’s a great read with good points […]

  • Andrew & Richard – great questions, but here’s the thing: some testing is better than no testing. If you’re sitting around waiting for the company to buy you an environment that’s an exact duplicate for production as your very first test environment, you’re working for someone with a lot more money than most. Your job counts on your ability to restore your data, and if your company hasn’t bought you a testing environment that’s robust, you have to start somewhere. David’s article is about getting started with fire drill restores, and sadly, I just see way too many “DBAs” who have never tested their restores. That’s just dangerous.

  • Andrew

    Thanks Brent. David, what do you use for a Test Server? Operating system? SQL Version?

  • I’m using SQL 2008 on Microsoft Server 2003. I know that’s not the ideal situation. I should be using SQL 2000 on my development/test machine, but I am experimenting and learning 2008. So far I have not run into any inconsistencies.

  • I created a “test/Report” SQL server to provide access for Excel users to do as they please without affecting the prod environment. It runs without taking M2M offline – using a snapshot publication initially.

    /* This procedure exists on the Report M2M server – */
    /* I also have a code snippet that confirms that there are no open tansacts or users on the table prior to copy*/
    /* I drop the report server tabels, then move only the core M2M tables over, then run any SQL Stored Procedures on the report server – thus taking the load off the production server and generating whatever they need via a schedule task using VB.Net */

    /* Drop the Tables from the Report server – the report server name is M2MRPTSQLServer, the database is ‘RPTM2MDATA01’ */
    if exists(select * from RPTM2MDATA01.INFORMATION_SCHEMA.tables where TABLE_CATALOG = ‘RPTM2MDATA01′ AND TABLE_SCHEMA=’dbo’ AND TABLE_NAME=’apitem’)
    DROP TABLE RPTM2MDATA01.dbo.apitem
    if exists(select * from RPTM2MDATA01.INFORMATION_SCHEMA.tables where TABLE_CATALOG = ‘RPTM2MDATA01′ AND TABLE_SCHEMA=’dbo’ AND TABLE_NAME=’apmast’)
    DROP TABLE ReportM2MDATA01.dbo.apmast
    if exists(select * from RPTM2MDATA01.INFORMATION_SCHEMA.tables where TABLE_CATALOG = ‘RPTM2MDATA01′ AND TABLE_SCHEMA=’dbo’ AND TABLE_NAME=’apvend’)
    DROP TABLE ReportM2MDATA01.dbo.apvend

    /*repeat for each table you may need, like SOMAST, INMAST etc – Follow the same for other companies.*/

    /* COPY the database tables from the production SQL M2M instance – M2MProdSQL to the new M2M Report Server – ‘RPTM2MDATA01’ */
    /*M2MProdSQL is the production M2M SQL server */
    SELECT * INTO m2mdata01.dbo.apitem FROM M2MProdSQL.m2mdata01.dbo.apitem
    SELECT * INTO m2mdata01.dbo.apmast FROM M2MProdSQL.m2mdata01.dbo.apmast
    SELECT * INTO m2mdata01.dbo.apvend FROM M2MProdSQL.m2mdata01.dbo.apvend

    Do you want to know how many connections you have to each SQL database? I find this useful, as every now and then I find old connections that should be cleared but hung for some reason or other.

    select db_name(dbid),count(*) from master..sysprocesses
    where spid > 49 –everyting below 50 is SQL Server itself
    group by db_name(dbid)

    database connections
    M2MDATA01 6
    M2MDATA50 2
    master 1

    Then run ()
    exec sp_who2

    Just a little snippet if anyone needs

  • Larry, I glanced at your code and it’s interesting. However, there are much easier ways to make a copy of the M2M database for reporting. The easiest I can think of is simply restoring a backup of your live data to another server for reporting. That way you don’t have to worry about users being out of the system or other issues.

    BTW, you really should look into the new M-Data Analytics project. I think we could use your skills and you seem like someone who would like to go further with SQL Server as well.

    I’ll be blogging prolifically about it in the coming weeks.

  • Good article, David, and I know I’m late to respond here. I think that Brent is right, you have to think smaller. In 20 years in this business, I’ve dealt with hurricanes and large disasters, but the most common ones are small. An internet line gets cut. A fire in the building that requires us to move out for 2 weeks. A server dropped on another server and we have to rebuild both.

    I agree that most of the time companies don’t want duplicate facilities, and they rarely, in my experience, purchase hardware that mirrors production, but you still have to plan out, think about, and test some other hardware. I need to have test servers that I can set up to take some of the load from production. These days, in 2012, so much desktop hardware can run a decent load. Even if you have a 64core production server with 512GB of RAM, I’m sure that a quad core, 16GB server will be able to run the system. It can’t support the load, but it can run, and you can test to see if you’ve set it up properly.

    Preparing for DR is the biggest issue, and I love the analogy of rifle disassembly/reassembly. You should be able to do this stuff easily, though I’m not sure we’re ever without the Internet in some form these days.

  • Travis Foschini

    How did the Exchange / DC recovery go? Did you publish the instructions for any or all?

Leave a Reply




You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>