Category Archives: Disaster recovery

Is HMRC’s RTI project really a success?

By Tony Collins

On  Eddie Mair’s “PM” programme on R4, I suggested that HMRC’s real-time information project was not the failure many had expected it to be.

“Even some hawk-eyed critics of government IT projects like journalist Tony Collins think that HMRC may have something of a success on its hands,” said BBC reporter Chris Vallance who produced the RTI item.

I was quoted as saying that many had expected RTI to become another government IT disaster. “But given that there are millions of PAYE employees who are on the system at the moment, if there were any major difficulties we’d expect to have seen them by now.”

Now an HMRC expert has questioned whether my comments were justified. He says parts of RTI are in chaos. He doesn’t want to be named. He writes:

“The RTI system was intended to report on a weekly or monthly basis the same information as had previously been reported by employers on an annual basis. Although details of pay and tax would be forwarded to HMRC far more frequently the same core logic applied. Details of the statutory deductions by the employer would have to be reconciled with payments made, and details of the income and tax paid recorded against the employee’s PAYE record.

“What appears to have happened is that HMRC has designed a system that takes details of employees’ earned or pension income, and statutory PAYE deductions, and then makes various illogical assumptions.

“For instance it would appear that where an employee receives no earnings in a particular pay period, the RTI system assumes that no information is “transmitted” for this employee, indicating that the employee has “left the employment”.

“Similarly where an employer undertakes a re-order of the pay identities (codes on the payroll system called Works Numbers that identify employees), the fact that payroll information is transmitted to HMRC with a Works Number different to that used previously triggers an assumption that the employee has two employments, with the same employer.

“This has the consequence of allowing the NPS (New PAYE Computer System – costing in excess of £400 million) to assume that the employee’s estimated income for the tax year has doubled. The NPS then looks to see if the employee has any part-time or other employment, and in many cases it changes the PAYE code number of these part-time employments from Basic Rate, which deducts tax at 20%, to Code D0, which deducts tax at 40%. All because of an incorrect and invalid assumption.

“Similarly, this failure to understand how PAYE and payroll interact has lead to the situation where an employee who leaves an employment that has attracted a PAYE coding deduction for Car Benefit in Kind and starts another Employment, has the PAYE Coding Deduction removed. The fact that the new employment may well involve a company car is completely ignored, with the result that the employee is more than likely to have a large underpayment of income tax at the year end, despite being on PAYE.

“This failure to understand the basic operation and logic of PAYE would appear to be due to that fact that HMRC has been influenced by those who have an understanding of data flows and cash transfers. The rush to modernise PAYE and move away from “a 1940’s system” has completely omitted the fact that basic operations for employments, tax and NI deductions and the accountability of these remains exactly the same, even if the calculations are undertaken electronically rather than with a quill pen and large ledger book.

“The old PAYE system had as part of its reporting system two components: the forms P14 detailing each individual’s pay, tax etc. and a summary of all employees information, the form P35. The P14 passed information to the NPS system and the P35 information was passed to the accounts computer systems. This allowed HMRC to determine the income of the employee and calculate if sufficient income tax etc had been paid. It also allowed HMRC to match the figure of tax / NIC due and payable on the P35 with the amount actually received from the employer. RTI has failed to comply with this basic logic and chaos is ensuing, to the extent that the National Audit Office recently commented

 ‘The financial and accounting systems supporting RTI are not yet fully accredited. Financial accreditation is a formal requirement of HMRC’s Change Programme and provides assurance that any new systems are acceptable for accounting and financial control purposes. The RTI systems went live on the basis that action would be taken to resolve identified financial design issues by 31 October 2013.

‘These issues do not affect an employer’s ability to submit data to HMRC but do weaken HMRC’s ability to produce and report financial information on PAYE. HMRC is currently undertaking work to understand the impact of these issues and how best to address them.’

“Why the RTI system was not designed in the same logical manner is of great concern.

“The system failures that are occurring are not due to computer components or programs not being fit for purpose. Indeed the processing of the PAYE data streamed to HMRC as a result of the RTI system could reasonably be compared to any other large commercial organisation, albeit that the NAO has concerns over the fall-back planning HMRC has in place should there be any hardware failures and commented

‘The resilience needed to maintain the RTI service if there is a major technical failure is not in place. Online and time-sensitive system implementations are usually developed with formal technical resilience and disaster recovery capability.

‘HMRC chose not to pay for full resilience because of the cost implications and because PAYE could be operated in an emergency without RTI. However, although RTI has the potential to be used by other government departments, the lack of full resilience may inhibit its use in areas of activity where a temporary disruption to service cannot be tolerated.

‘Data submissions can be held temporarily in a queue but this would not provide continuity of service in the event of a catastrophic failure. The RTI service failing at a critical processing time could increase the volume of customer communications and lead to more effort for employers.’

“The RTI system is a very clear example of basic failures to properly prepare a Business Analysis Requirement for a system which in essence does no more that increase the number of times payroll information is passed to HMRC. Claims for the reinvention of PAYE for the 21st Century are as invalid as the claim that the ability to write has been done away with due to email and electronic communication. There has been a flawed reliance on the thoughts and views of those who have little or no experience in PAYE or payroll.”

On the PM programme, Ruth Owen, Director General of PersonalTax at HMRC, accepted that all was not perfect. She said

“We have had over 1.4 million PAYE schemes come into Real-Time Information [each PAYE scheme may have many employees on it] and that exceeds our expectations at this point in the year. But there’s still more to do. We have got to get everybody on and there are still people who need our help to get on.”

She added: “We have had a small number of difficult issues… We have had issues where people have got the wrong tax codes.”

Owen said the links between RTI and Universal Credit were “going well” but conceded that there have been only a tiny number of UC claimants so far.

“We have had around 100 claimants who we have helped DWP identify income stream data for. So it’s going to plan at the moment.”

Chris Vallance concluded the item by saying that some of the largest employers have yet to be added to RTI. “It’s only when it works at scale that we will really know how good real-time information really is,” he said.

Update: Chartered accountant Baker Tilly says on its website  that thousands of people have been issued wrong tax codes as a result of RTI-related problems. 

Audio of PM programme item on RTI – 4 July 2013 (approx 5 mins)

Cerner US-wide outage – what went wrong?

By Tony Collins

Hospital EMR and EHR has an account of what caused Cerner’s outage which affected its client sites across the USA and some international customers, according to reports.

It makes the point that the problem had little or nothing to do with the cloud.

The Los Angeles Times reports that within minutes of the outage, doctors and nurses reverted to writing orders and notes by hand, but in many cases they no longer had access to previous patient information saved remotely with Cerner.

“That information isn’t typically put on paper charts when records are kept electronically,” said the newspaper which quoted Dr. Brent Haberman, a pediatric pulmonologist in Memphis as saying, “If you can’t get to all the patient notes and all the previous data, you can imagine it’s very confusing and mistakes could be made…A new doctor comes on shift and doesn’t have access to what happened the past few hours or days.”

Hospital EMR and EHR

Some hospitals cope well through the outage

Some lessons from a major outage

By Tony Collins

One of the main reasons for remote hosting is that you don’t have to worry about security and up-time is guaranteed. Support is 24x7x365. State-of-the-art data centres offer predictable, affordable, monthly charges.

In the UK more hospitals are opting for remote hosting of business-critical systems. Royal Berkshire NHS Foundation Trust and Homerton University Hospital NHS Foundation Trust are among those taking remote hosting from Cerner, their main systems supplier.

More trusts are expected to do the same, for good reasons: remote hosting from Cerner will give Royal Berkshire a single point of contact to deal with on technical problems without the risks and delay of ascertaining whether the cause is hardware, third party software or application related.

But what when the network goes down – across the country and possibly internationally?

With remote hosting of business-critical systems becoming more widespread it’s worth looking at some of the implications of a major outage.

A failure at CSC’s Maidstone data centre in 2006 was compounded by problems with its recovery data centre in Royal Tunbridge Wells. Knock-on effects extended to information services in the North and West Midlands. The outage affected 80 trusts that were moving to CSC’s implementation of Lorenzo under the NPfIT.

An investigation later found that CSC had been over-optimistic when it informed NHS Connecting for Health that the situation was under control. Chris Johnson, a professor of computing science at Glasgow University, has written an excellent case study on what happened and how the failure knocked out primary and secondary levels of protection. What occured was a sequence of events nobody had predicted.

Cerner outage

Last week Cerner had a major outage across the US. Its international customers might also have been affected.

InformationWeek Healthcare reported that Cerner’s remote hosting service went down for about six hours on Monday, 23 July. It hit “hospital and physician practice clients all over the country”. Information Week said the unusual outage “reportedly took down the vendor’s entire network” and raised “new questions about the reliability of cloud-based hosting services”.

A Cerner spokesperson Kelli Christman told Information Week,

“Cerner’s remote-hosted clients experienced unscheduled downtime this week. Our clients all have downtime procedures in place to ensure patient safety. The issue has been resolved and clients are back up and running. A human error caused the outage. As a result, we are reviewing our training protocol and documented work instructions for any improvements that can be made.”

Christman did not respond to a question about how many Cerner clients were affected. HIStalk, a popular health IT blog, reported that hospital staff resorted to paper but it is unclear whether they would have had access to the most recent information on patients.

One Tweet by @UhVeeNesh said “Thank you Cerner for being down all day. Just how I like to start my week…with the computer system crashing for all of NorCal.”

Another by @wsnewcomb said “We have not charted any pts [patients] today. Not acceptable from a health care leader.”

Cerner Corp tweeted “Our apologies for the inconvenience today. The downtime should be resolved at this point.”

One HIStalk reader praised Cerner communications. Another didn’t:

“Communication was an issue during the downtime as Cerner’s support sites was down as well. Cerner unable to give an ETA on when systems would be back up. Some sites were given word of possible times, but other sites were left in the dark with no direction. Some sites only knew they were back up when staff started logging back into systems.

“Issue appears to have something to do with DNS entries being deleted across RHO network and possible Active Directory corruption. Outage was across all North America clients as well as some international clients.”

Colleen Becket, chairman and co-CEO of Vurbia Technologies, a cloud computing consultancy, told InformationWeek Healthcare that NCH Healthcare System, which includes two Tampa hospitals, had no access to its Cerner system for six hours. The outage affected the facilities and NCH’s ambulatory-care sites.


A HIStalk reader said Cerner has two electronic back-up options for remote hosted clients. Read-only access would have required the user to be able to log into Cerner’s systems, which wouldn’t have been possible with the DNS servers out of action last week.

Another downtime service downloads patient information to local computers, perhaps at least one on each floor, at regularly scheduled intervals, updated say every five minutes. “That way, even if all connection with [Cerner’s data centre] is lost, staff have information (including meds, labs and more) locally on each floor which is accurate up to the time of the last update”.

Finally, says the HIStalk commentator, “since this outage was due to a DNS problem, anyone logged into the system at the time it went down was able to stay logged in. This allowed many floors to continue to access the production system even while most of the terminals couldn’t connect.”

But could the NHS afford a remote hosted service, and a host of on-site back-up systems?

Common factors in health IT implementation failures

In its discussion on the Cerner outage, HIStalk put its finger on the common causes of hospital IT implementation failures. It says the main problems are usually:

– a lack of customer technical and implementation resources;
– poorly developed, self-deceiving project budgets that don’t support enough headcount, training, and hardware to get the job done right;
– letting IT run the project without getting users properly involved
– unreasonable and inflexible timelines as everybody wants to see something light quickly up after spending millions; and
– expecting that just implementing new software means clearing away all the bad decisions (and indecisions) of the past and forcing a fresh corporate agenda on users and physicians, with the suppplier being the convenient whipping boy for any complaints about ambitious and sometimes oppressive changes that the culture just can’t support.

Cerner hosting outage raises concerns

HIStalk on Cerner outage

Case study on CSC data centre crash in 2006