Healthcare.gov: Armchair Analysis

An attempt to draw out the various parts of HealthCare.gov's tech system, based on the testimony of its contractors. -- Elise Hu, NPR
Healthcare.gov may need to be redesigned to just collect and store the application data, then say, "Thank you very much, you'll get an email within 48hrs with a link to your options."

The data would go into a queue database that other, back-end software would process sequentially and produce a results dataset with whatever is needed to present a list of plans and prices for that applicant. The applicant logs back in in a couple of days to start comparing plans.


RATIONALE
I used the Chromium browser's Javascript Console to look at some of the source files for the site front-end when I signed up. Can't say I've ever seen a site with so much code.

From that, and the tech articles I've read, it looks like the top-level designers had never worked on "big data" systems where you've got to dramatically simplify the operations in order to get good performance. Sounds a lot like the most vexing issues we faced in a large state agency Y2K redesign I was a DBA on, except on a massively larger scale.

Most websites today can do complex, interactive operations like client-server systems do; such as querying a database while the user waits a second or two. (I use "query" here to stand for all database accesses of any kind, though most are queries.) The front-end code can always safely assume timely completion of database accesses so no special coding for failed queries is required.

On our project, we found though that, with 3000 simultaneous connections, even the bleeding-edge hardware and software we were using couldn't keep up. During peak loads (mid-morning and mid-afternoon), the databases were so taxed that queries would time-out before completion. Writing front-end code "exceptions" that can gracefully handle such situations is very complex and not many programmers have ever had to do it. It was new to me at the time and I'd been programming for 10 years.

The most obvious response to a timeout is to try again. Turns out that its also the absolute worst thing to do!

Timeouts in such a system are almost always due to overload, with failure of network connection being the second most common cause. In both situations, once one query fails, all other queries start failing too. If every query is resubmitted, only to immediately fail again, you have a "flood" of exceptions that can drive back-end system CPU usage to 100% in seconds. The same kind of thing is called a DDOS (Distributed Denial of Service) attack when it's network traffic flooding a website. Different source but same kind of effect on the computers. Think of a water-wheel on a dabbling brook and then the dam upstream bursts.

Healthcare.gov is getting millions of simultaneous connections. Hardware performance may be several times better now than it was in 2000, but software performance is worse.

Yes. I said it though it may be blasphemy. The sluggish responses we endure as norm today would have been ridiculed as unacceptable just 10 years ago. As the complexity of software has advanced dramatically, the efficiency and performance have plummeted. Only Moore's law of computing hardware advancement has kept system responsiveness even tolerable.

It looks like Healthcare.gov is trying to query, not LAN-connected databases, but remote computer systems at several businesses like Experion (for financial history) and other agencies like the IRS (for income verification).
 ​ A far slower process than in-house client-server. Even with today's lower expectations for responsiveness, there's no way for it to keep up.



The "major redesign" that been whispered of is most likely something like we had to do in 1999. Load testing had hinted at timeout problems with the original client-server design. That sent us tweaking our little hearts out, but it was never enough to support 100% interactive responsiveness. We had to make choices. We had to redesign the workflow to allow some operations to be non-interactive.

The hardest part was convincing management that those "antiquated batch jobs" they had been railing against for years and had so proudly told everyone were being eliminated in the new system had to stay.

A team was formed to rewrite the data-access portions of the batch jobs to talk to the new SQL databases instead of the old mainframe ones. And one lone, black mainframe cabinet got a reprieve from the scrap heap (as well as some upgrades) and may still be running COBOL batch jobs in support of the shiny new system to this day as far as I know.

No comments :