Monday, September 17, 2007

Remembering Y2K: When have you tested everything? What data do you keep? What data covers everything?

Remember Y2K? Back in 1999, we had a big workplace debate just on what, from a philosophy 201 perspective, constituted a satisfactory data repository of evidence that all of our systems (in a life and annuity company) would perform properly on and after Saturday Jan. 1, 2000 (and for that matter Mon. Jan. 1 2001), since it was necessary to expand the year to a four-digit position. (It’s a bit more involved than that with some systems, but that was the idea.) There were several questions: what jobs should be run? Which cycles should be run? (End of month? End of year?) What printouts or files should be saved? (File-file compares? Reports? Test data?) What production data should be extracted? How would it be collected and stored? In the fall of 1999, we did wind up boxing a lot of JCL, reports and screen prints and shipping them to an official warehouse. Y2K came and went without a hitch.

We had a similar exercise early in the year with a disaster recovery fire drill (at a company called Comdisco) that I remember well. What data, what files do you collect, what do you run on the backup site to prove it all got copied.

Back in the early 1990s we had philosophical discussions of this sort. One had to make sure that all possible situations were covered by test cases or by extracted production data or selected production cycles (now a bigger issue than then because of privacy considerations). Before any elevation, there would be the exercise of parallel cycles, file-to-file compares, and saving evidence, in the form of printouts, screen prints, and sometimes just on disk or offloaded to diskettes (maybe copied to the “LAN” which then was a real innovation). Because of “personal responsibility” I kept quite a library of test runs in the big black three-ring binders, low-tech. This way it was possible to prove that the system was tested properly if something ever went wrong. That may sound like lack of confidence.

There was also the issue, emerging then, that source management software (then CA-Librarian, today usually CA-Endeavor) had to be used properly to guarantee source-load module integrity.

In fact, as far back as early 1989, I essentially “saved” a small health care consulting business (small then, big now) by saving a huge paper library of test runs. I spent three weeks desk checking numbers in a windowless office with no personal PC terminal. When a major client questioned our numbers, I was able to prove we had run everything properly. Re-examination of Federal register specs and of COBOL code from a federal program showed a discrepancy within the government’s own work. When I replicated federal code in our system, we quickly got the results that the client had expected after running the model and simulations.

There is a lesson in all of this. Remember that undergraduate Philosophy 101 course where the professor asks “how do you know what you believe?” or something like that. Remember those essay questions on epistemology? (I got a B on that.) Systems testing and quality assurance is all about that, when a system must run in production and process millions of client transactions daily, perfectly. It’s volume, buddy, along with absolute perfection. That’s what mainframe culture was all about.

It seems that one can blow this kind of question up when we look at major issues today. How do we know that we have collected all of the relevant data or cases and that it is right?

No comments: