When “life and death” NHS IT goes down

By Tony Collins

Almost unnoticed outside the NHS an email was circulated by health officials last weekend about a national “severity 1” incident involving the Electronic Prescription Service, running on BT’s data Spine .

“The EPS [electronic prescriptions service] database is currently experiencing severe degradation of performance. … BT engineers [are] currently investigating with the database application support team,” said the email.

A severity 1 or 2 incident, which involves a temporary loss of, or disruption to, the Spine or other national NHS system,  is not unusual, according to a succession of emails forwarded to Campaign4Change.

The Department of Health defines a severity 1 incident as a  failure that has the potential to:

— have a significant adverse impact on the provision of the service to a large number of users; or

— have a significant adverse impact on the delivery of patient care to a large number of patients; or

— cause significant financial loss and/or disruption to NHS Connecting for Health [now the Health and Social Care Information Centre], or the NHS; or

— result in any material loss or corruption of health data, or in the provision of incorrect data to an end user.

The Health and Social Care Information Centre, which manages BT’s Spine and other former NPfIT contracts, reports that the spine availability is 99.9% or 100%. But the HSCIC’s emails tell a story of service outage or disruption that is almost routine.

If the spine and other national services  are really available 99.9% of the time, is that good enough for the NHS, especially when ministers and officials are increasingly expecting clinicians and nurses to depend on electronic patient records and electronic prescriptions?  In short, are national NHS IT systems up to the job?

NHS staff access the spine tens of millions of times every month, often to trace patients before accessing their electronic records.  The spine is pivotal to the use of patient records held on Rio and Cerner Millennium systems in London. It is critical to the operation of Choose and Book, the Summary Care Record, Electronic Prescription Service pharmacy systems, GP2GP, iPM/Lorenzo, and the Personal Demographics Service.

According to a Department of Health letter sent to the Public Accounts Committee, payments to BT for the Spine totalled £1.08bn by March 2013.

BT says on its website that its 10-year NHS Spine contract involves developing systems and software to support more than 899,000 registered NHS users. The HSCIC says the Spine is used and supported 24 hours a day, 365 days a day.

“There is a huge amount of industrial-strength robustness, availability, disaster recovery, that you cannot get someplace else,” said a BT executive when he appeared before MPs in May 2011.

Life and death  

Sir David Nicholson spoke of the importance of the spine and other national NHS systems at a hearing of the Public Accounts Committee in 2011. He said they were

“providing services that literally mean life and death to patients today … So the Spine, and all those things, provides really, really important services for our patients…”

When Croydon Health Services NHS Trust went live with a Cerner Millennium patient records system at the end of September a “significant network downtime” – of BT’s N3 network – had an effect on patients.

A trust board paper, dated 25 November 2013 says:

“CRS Millennium (Cerner) Deployment -Network downtime – Week 1.  In particular, the significant network downtime in week 1 (BT N3 problem) led to no electronic access to Pathology and Radiology which resulted in longer waits for patients in the Emergency Department (ED) leading to a large number of breaches. This was a BT N3 problem which has been rectified with BT …”

Below are some of the emails passed to Campaign4Change in the past four months. Written by the Health and Social Care Information, the emails alert NHS users to outages or disruption to GP or national NHS IT systems.

Some HSCIC messages of disruption to service

October 2013

Severity 2
HSCIC
National
CQRS has not received a number of participation status messages.
Also affecting: GPES
USER IMPACT:
CQRS Users are not able to manually submit specific information, this will impact the users’ business process for entry of achievement data.
ACTION BEING TAKEN:
Following a configuration change by the GPES Business Unit a specific code has now been added to the GPET-Q Database. We are currently awaiting confirmation that the addition of the relevant code has been successful. Discussions are taking place regarding the re-submission of status messages. HSCIC conference calls are on-going.

[A severity 2 service failure is a failure [that] has the potential to:

– have a significant adverse impact on the provision of the service to a small or moderate number of service users; or

– have a moderate adverse impact on the delivery of patient care to a significant number of service users; or

– have a significant adverse impact on the delivery of patient care to a small or moderate number of patients; or

– have a moderate adverse impact on the delivery of patient care to a high number of patients; or

– cause a financial loss and/or disruption … which is more than trivial but less severe than the significant financial loss described in the definition of a Severity 1 service failure.]

**

Severity 2
HSCIC
BT Spine
National
Intermittent performance issues on TSPINE.
T-Spine
RESOLUTION:
BT Spine have confirmed that the incident has been resolved and users are able to perform routine business processes without delays.

November 2013

Severity 1
BT Spine
HSCIC
National
Users are unable to log into PDS.
USER IMPACT:
All sites are currently unable to access PDS, this is causing a delay to normal services.
ACTION BEING TAKEN:
BT Spine are working to restore service.

**

Severity 2
BT Spine
HSCIC
National EPS users.
Slow performance on reliable and unreliable messages for EPS.
USER IMPACT:
This is causing delays to routine business processes as some users may be experiencing slow performance with the EPS service.
BT investigating.

**

Severity 2
BT Spine
HSCIC
National
Slow performance on EPS Messaging.
USER IMPACT:
This is causing delays to routine business processes as some users may be experiencing slow performance with the EPS service.
ACTION BEING TAKEN:
BT moved the database to an alternate node following application server restarts. This temporarily restored normal message response times however performance has started to degrade again. BT Investigation continues.

**
Severity 1
Atos
HSCIC
National
Multiple users were unable to log in to the Choose & Book application.
ATOS made some network configuration changes overnight 19th/20th November which restored service. After a period of monitoring throughout the day yesterday the service has remained stable and at expected levels. Further activities and investigation will be carried out by several resolver teams which will be scheduled through change management.

**
Severity 2
BT Spine
HSCIC
National
Slow performance on EPS Messaging.
No further issues of slow response times with EPS messaging have occurred today. BT Spine to continue root cause investigation.

**

Severity 2
Cegedim RX
HSCIC
National
Cegedim RX – Users are experiencing slow performance in EPS 1 and EPS 2.
USER IMPACT:
Users are experiencing slow performance and delays to routine business processes when using EPS 1 and EPS 2.
ACTION BEING TAKEN:
Following a restart of application services, traffic has improved for all new EPS messages. However there is a backlog of EPS messages which may cause delays to routine business processes. Cegedim RX to continue to investigate.

**

December 2012

Severity 1
BT Spine
HSCIC
National
Performance issues have been detected with the transaction messaging system (TMS).
Also affecting: Choose and Book, GP2GP
USER IMPACT:
This may cause delays to routine business processes. This may have an effect on all Spine related systems. This includes PDS, Choose and Book, PSIS, SCR, ACF Services.
ACTION BEING TAKEN:
This has been resolved but BT are currently monitoring performance. Further investigation is required by BT into the root cause.

**

Severity 2
GDIT – CQRS
HSCIC
National
DTS has not processed a CQRS payment file.
CQRS
Also affecting: GPES
USER IMPACT:
This is causing delays to routine business processes.
ACTION BEING TAKEN:
GDIT are currently developing a fix which will be rolled out tomorrow evening, pending successful testing.

January 2014

Severity 1
BT Spine
HSCIC
National
TMS reliable messaging unavailable.
USER IMPACT:
TMS reliable messaging unavailable and users having to implement manual workarounds.
ACTION BEING TAKEN:
Issues experienced due to a planned change overrunning, BT Spine continue to implement the transition activity in order to restore service.

**

Severity 2
BT Spine
HSCIC
National
Users have experienced intermittent issues with the creation and cancellation of smartcards in CMS [Card Management Service for managing smartcards].
CMS
USER IMPACT:
This is intermittently causing delays to routine business processes as some users have been unable to create, cancel, cut or print cards in CMS.
ACTION BEING TAKEN
Users may experience issues with the creation and cancellation of cards in CMS. BT have identified a fix for the issue which is currently undergoing testing prior to deployment into the live environment.

**

Severity 2
BT Spine
HSCIC
National
The maternity browser was unavailable within NN4B.
RESOLUTION:
BT identified a problematic server which was recycled to restore system functionality.

**

Spine scheduled outage for essential maintenance activity.

During critical work to migrate to a new storage solution on Spine an issue was experienced on the Transaction Messaging Service (TMS) in September of this year. The issue resulted in BT failing over the TMS database from its usual site on Live B to Live A to restore service. The failover was completed well within the Service Level Agreement and no detrimental long term impacts to the service were incurred.

On the 15th January 2014 between approximately 22:00-23:30, HSCIC, in conjunction with BT, are planning to relocate the TMS database back to Live B, this is for several critical reasons:

  1. The issues experienced, which prompted the failover, are fully resolved and will not be experienced again as the storage migration work is now complete.
  2. The Spine service is designed to operate with all databases running on Live B so this work supports the optimum configuration for the service.
  3. Most critically the transition for all data on Spine to Spine2 has been designed to operate from a standby site with no live databases on it. Therefore to support the Spine2 transition this work is absolutely essential.

In order to facilitate a safe relocation of the database a 1.5 hour outage is required to TMS. The impact of this to Spine is significant and results in effectively an outage for Spine and its interfaces to connecting systems for that period. The time and date is aimed at the lowest times of utilisation for Spine, to minimise impact to end users, as well as not impacting critical batch processing and Choose & Book slot polls.

 

Date & Time

Change Start Change Finish Services Affected Outage Duration
15/01/2013 22:00 15/01/2013 23:30 Transaction Messaging Service (TMS) 1.5 hours
Service  Impact Description
Choose and Book The Choose and Book service will be available but functionality will be limited until the TMS database has switched over.Users of the web application will experience limited retrievals during the outage window.The system will not be able to create shared-secret for patients who have not been referred via Choose and Book before.Service Providers will be unable to:

  • Perform clinic re-structures and re-arrange appointments for patients for directly bookable services
  • Send DNA messages to Choose and Book.

For directly bookable services the following functionality will be unavailable:

  • Booking appointments
  • Rearranging appointments
  • Creating new patient accounts

Choose & Book systems will need to queue the messages and resend to Spine once the TMS service is enabled.

Due to the timing of the outage slot polls will not be affected.

Summary Care Record application (SCRa) The SCRa application will be available but functionality will be limited until the TMS database has switched over. Simple traces can be completed on PDS data but users will be unable to perform any PSIS updates (e.g. GP summary updates)
DSA The DSA application will be available but functionality will be limited until the TMS database has switched over.Simple traces can be completed on PDS data but users will be unable to perform any PSIS updates (e.g. GP summary updates).
Electronic Prescription Service (EPS)Pharmacy Systems Reliable messaging will be unavailable for the duration of the switchover work as the TMS service will be suspended dual site. All messages received from EPS systems will be rejected and not go into retry.EPS systems will need to queue the messages and resend to Spine once the TMS service is enabled.
EPS Batch The PPA response for any “claim” messages will not be sent to PPA/PPD. However, EPS will send those response(s) again when the retry jobs are re-activated after the switchover exercise is over. Response for any “claim” messages will not be received until after the switchover. Retry jobs will resend the responses once the TMS service is enabled.
Existing Service Providers (ESPs) There will be varying impacts depending on the product, release version and Spine compliant modules of the solution.ESP systems will need to queue the messages and resend to Spine once the TMS service is enabled.
GP2GP GP2GP will be unavailable until the TMS database has switched over.GP2GP systems will need to queue the messages and resend to Spine once the TMS service is enabled.
GP Extraction Service (GPES) GPES functionality will be unavailable until the TMS database has switched over.Messages will be queued on Spine and processed once the TMS service is restored.
GP Systems Functionality for Choose & Book, EPS and GP2GP, SCR will be limited until the TMS database has switched over.For Choose & Book directly bookable services the following functionality will be unavailable:

  • Booking appointments
  • Rearranging appointments
  • Creating new patient accounts

Systems will need to queue the messages and resend to Spine once the TMS service is enabled.

iPM/Lorenzo The real-time connection to Spine will be unavailable during the TMS outage. However both systems can be disconnected from Spine and operate without synchronised PDS data.iPM/Lorenzo will need to queue the messages and resend to Spine once the TMS service is enabled.
Millennium An outage of PDS reliable messaging will impact Millennium users.Users will be unable to:

  • trace patients
  • register new patients on PDS
  • book or reschedule appointments

Millennium will need to queue the messages and resend to Spine once the TMS service is enabled.

NN4B Trusts will need to be aware that during the outage NHS numbers cannot be generated, new-births cannot be registered and blood-spot labels cannot be generated and should plan accordingly.All birth notifications will be queued and processed once the TMS service is enabled.
Personal Demographics Service (PDS) Simple traces can be completed on PDS data.PDS reliable messaging will be unavailable until the TMS database has switched over.
RiO Users will be unable to:

  • trace patients
  • register new patients
  • book or reschedule appointments

The RiO system will need to queue the messages and resend to Spine once the TMS service is enabled.

TMS Event Service (TES) The majority of TES functionality will be unavailable during the outage.Trusts will need to be aware EPS, Death notifications, and Patient Care Provision Notifications (change of pharmacy) will be queued and sent to the receiving systems once the TMS service is restored.Any impacted notifications will be queued and sent to the receiving systems once TMS is restored.
TMS Batch (DBS, CHRIS, ONS) DBS will be unavailable until the TMS database has switched over (DBS processing will be suspended for the duration of the exercise).As the TMS switchover will be scheduled to start at 22:00, CHRIS batch should complete before the outage starts (CHRIS batch runs at 20:00 nightly).ONS processing will start at 18:00 nightly. If it doesn’t complete before 22:00, the messages will be queued and processed once the TMS service is restored.

**

Severity 2
BT Spine
HSCIC
National
Users are unable to grant worklist items within UIM.
USER IMPACT:
This is causing delays to routine business processes as users are unable to complete their worklist items within the UIM application.
ACTION BEING TAKEN:
BT investigating.

**

Severity 1
BT Spine
HSCIC
National
The EPS database is currently experiencing severe degradation of performance.
USER IMPACT:
Delays to routine business processes.
ACTION BEING TAKEN:
BT engineers currently investigating with the database application support team.

Comment

David Nicholson is right. The NHS has become dependent on systems such as the Spine. But can doctors ever trust any aspect of the safety of patients to systems that are not available 24×7 as they need to be in a national health service?

It appears that BT and other suppliers have not been in breach of service level agreements, and the HSCIC has a good relationship with the companies.  But does the HSCIC have too great an interest in not finding fault with its suppliers or the contracts, for finding fault  could draw attention to any defects in a service for which the HSCIC is responsible?

Have national NHS IT suppliers a strong enough commercial or reputational interest  in avoiding  a disruption or loss of service, so long as they keep within their service level agreements? 

If nobody sees anything wrong with the reliability of existing national NHS IT services improvements are unlikely. Diane Vaughan’s book on the culture and organisation of NASA shows that experts in a big organisation can do everything right according to the rules  and procedures – and still have a disastrous outcome.

Advertisements

One response to “When “life and death” NHS IT goes down

  1. Pingback: The biggest cause of shared services failure? | Calchas

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s