Some lessons from a major outage

By Tony Collins

One of the main reasons for remote hosting is that you don’t have to worry about security and up-time is guaranteed. Support is 24x7x365. State-of-the-art data centres offer predictable, affordable, monthly charges.

In the UK more hospitals are opting for remote hosting of business-critical systems. Royal Berkshire NHS Foundation Trust and Homerton University Hospital NHS Foundation Trust are among those taking remote hosting from Cerner, their main systems supplier.

More trusts are expected to do the same, for good reasons: remote hosting from Cerner will give Royal Berkshire a single point of contact to deal with on technical problems without the risks and delay of ascertaining whether the cause is hardware, third party software or application related.

But what when the network goes down – across the country and possibly internationally?

With remote hosting of business-critical systems becoming more widespread it’s worth looking at some of the implications of a major outage.

A failure at CSC’s Maidstone data centre in 2006 was compounded by problems with its recovery data centre in Royal Tunbridge Wells. Knock-on effects extended to information services in the North and West Midlands. The outage affected 80 trusts that were moving to CSC’s implementation of Lorenzo under the NPfIT.

An investigation later found that CSC had been over-optimistic when it informed NHS Connecting for Health that the situation was under control. Chris Johnson, a professor of computing science at Glasgow University, has written an excellent case study on what happened and how the failure knocked out primary and secondary levels of protection. What occured was a sequence of events nobody had predicted.

Cerner outage

Last week Cerner had a major outage across the US. Its international customers might also have been affected.

InformationWeek Healthcare reported that Cerner’s remote hosting service went down for about six hours on Monday, 23 July. It hit “hospital and physician practice clients all over the country”. Information Week said the unusual outage “reportedly took down the vendor’s entire network” and raised “new questions about the reliability of cloud-based hosting services”.

A Cerner spokesperson Kelli Christman told Information Week,

“Cerner’s remote-hosted clients experienced unscheduled downtime this week. Our clients all have downtime procedures in place to ensure patient safety. The issue has been resolved and clients are back up and running. A human error caused the outage. As a result, we are reviewing our training protocol and documented work instructions for any improvements that can be made.”

Christman did not respond to a question about how many Cerner clients were affected. HIStalk, a popular health IT blog, reported that hospital staff resorted to paper but it is unclear whether they would have had access to the most recent information on patients.

One Tweet by @UhVeeNesh said “Thank you Cerner for being down all day. Just how I like to start my week…with the computer system crashing for all of NorCal.”

Another by @wsnewcomb said “We have not charted any pts [patients] today. Not acceptable from a health care leader.”

Cerner Corp tweeted “Our apologies for the inconvenience today. The downtime should be resolved at this point.”

One HIStalk reader praised Cerner communications. Another didn’t:

“Communication was an issue during the downtime as Cerner’s support sites was down as well. Cerner unable to give an ETA on when systems would be back up. Some sites were given word of possible times, but other sites were left in the dark with no direction. Some sites only knew they were back up when staff started logging back into systems.

“Issue appears to have something to do with DNS entries being deleted across RHO network and possible Active Directory corruption. Outage was across all North America clients as well as some international clients.”

Colleen Becket, chairman and co-CEO of Vurbia Technologies, a cloud computing consultancy, told InformationWeek Healthcare that NCH Healthcare System, which includes two Tampa hospitals, had no access to its Cerner system for six hours. The outage affected the facilities and NCH’s ambulatory-care sites.

Lessons?

A HIStalk reader said Cerner has two electronic back-up options for remote hosted clients. Read-only access would have required the user to be able to log into Cerner’s systems, which wouldn’t have been possible with the DNS servers out of action last week.

Another downtime service downloads patient information to local computers, perhaps at least one on each floor, at regularly scheduled intervals, updated say every five minutes. “That way, even if all connection with [Cerner’s data centre] is lost, staff have information (including meds, labs and more) locally on each floor which is accurate up to the time of the last update”.

Finally, says the HIStalk commentator, “since this outage was due to a DNS problem, anyone logged into the system at the time it went down was able to stay logged in. This allowed many floors to continue to access the production system even while most of the terminals couldn’t connect.”

But could the NHS afford a remote hosted service, and a host of on-site back-up systems?

Common factors in health IT implementation failures

In its discussion on the Cerner outage, HIStalk put its finger on the common causes of hospital IT implementation failures. It says the main problems are usually:

– a lack of customer technical and implementation resources;
– poorly developed, self-deceiving project budgets that don’t support enough headcount, training, and hardware to get the job done right;
– letting IT run the project without getting users properly involved
– unreasonable and inflexible timelines as everybody wants to see something light quickly up after spending millions; and
– expecting that just implementing new software means clearing away all the bad decisions (and indecisions) of the past and forcing a fresh corporate agenda on users and physicians, with the suppplier being the convenient whipping boy for any complaints about ambitious and sometimes oppressive changes that the culture just can’t support.

Cerner hosting outage raises concerns

HIStalk on Cerner outage

Case study on CSC data centre crash in 2006

Advertisements

2 responses to “Some lessons from a major outage

  1. Pingback: Massive Health IT Outage: But, Of Course, Patient Safety Was Not Compromised » H2EONLINE.ORG

  2. BLAH BLAH BLAH

    Sorry to appear rude but the reality is plain and simple. Remote hosting means that the single point of failure is the network and all the research and learned comments cannot get around the fact that if the network goes down anywhere the Trust has no paddle for the creek it is in.

    Now we can wax lyrical about triangulated networks at Trusts (how many actually have them?) along with multiple lines coming in the datacentre but that is still a single source of failure. The triangulation has to apply to the datacentre as well i.e. at least two datacentres connected to each other and each connected to the trust via separate dedicated links…..hmmm this seems to be getting quite expensive. To this we need to add the unpredictable JCB digging up the road for whatever reason and the house of cards come crashing down. Hosting is less of a problem when you’re selling books or DVDs. When however your enterprise information system is so supplied one needs to do a proper risk analysis on what happens if we lose it for 1 min, 1 hour, 1 day or longer. Depending on where a Trust’s pain point is, this will determine if hosting is right for them or not. There is no other consideration to be made. Suppliers’ claims about resilience and stability at this level are irrelevant. It’s all about risk to Trust, risk to staff and let’s not forget risk to patient.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s