Sunday, May 22, 2011

Recoveries are no fun for IT staff; lessons from my own career, and for Blogger

During the second week of May, 2011 Blogger encountered some problems that required a rollback of some new posts, apparently a database recovery, and restoration of the posts and comments.  I wrote about this on my “BillBoushka” blog May 14. I’ll add that since then, on a few occasions “draft” copies of restored blog posts have reappeared on my dashboard.

The bigger technical issue was how companies manage “disaster recovery”, of databases once corrupted. (I presume Blogger has some sort of SQL-based relational database underneath; Wordpress uses MySQL).
I had a few brushes with these situations in my mainframe career. 

Back in the 1980s, we had to rerun three days of daily billing at Chilton Credit Reporting after we found a member master had been incorrectly restored. Fortunately, each run took only about an hour, even in 1980s Ahmdahl technology.

Most mainframe shops have “incrementals” and “full backups”, for everything (including reporting packages like SAR and Dispatch).  One place I worked in the early 1990s used to do full backups and “compactions” every Saturday night, taking everything down at 4 PM Saturday.

The ultimate nightmare for a senior programmer-analyst who “owns” a major application is to find corruption and the need for a major recovery.

Recoveries usually consist of going back to the last “full” and then applying “incrementals” successively.  Or cycles would have to be rerun if there was actual application corruption. In the mid 1990s we almost had another catastrophe with an IDMS VSAM-transparency and the backup GCVEXPORT, etc, but a techie figured out the problem and bailed us out.  Recoveries are not fun for support staff, anywhere.

In 1999, I participated in a twenty-four hour disaster recovery dress rehearsal offsite near Minneapolis. 

Update: June 6

Blogger has a detailed explanation of its incident on Blogger Status, dated May 31, 2011, link.  It's titled "Blogger Incident Report", authored by Eddie Kessler, Tech Lead/Manager, Blogger.  I urge people who work in large IT production environments, even financial institutions, to read it (I don't know if you need a blogger account to see it), as it gives a perspective as to what can go wrong in companies with many servers and how harrowing recoveries can be. It's educational! 

No comments: