How 1 Flawed Single Data Entry Took Down a Telephone Network

The cause of the outage that largely took out Spark’s mobile network yesterday evening has been traced to a corrupted data file sent in error by an Asian phone company.

Customers across the country had only about a one-in-four to one-in-three chance of placing calls or texts during the worst of the outage between 5.30pm and 8.30pm, chief operating officer David Havercroft said.

Customers from Hamilton north were the worst affected by the fault which also prevented 111 calls being made and took out mobile broadband.

Havercroft said the fault had been traced to a corrupted automated message sent by a large “reputable” Asian carrier which was intended to update Spark on which of its customers should be allowed to roam in New Zealand on Spark’s network.

Spark received hundreds of such messages from phone companies around the world each day that were processed automatically by the network’s core computer system, he said.

However, the corrupted message had a missing digit and a piece of equipment supplied by Cisco was swamped by error messages when it tried to process the command.

“The number range it was telling the system to cancel off the network was truncated and the system got into a loop trying to take hundreds of thousands of numbers off the network.”

That had a domino effect on another piece of equipment which shed New Zealand customers intermittently in an effort “protect the network”, Havercroft said.

Cisco was working on a patch for the fault, which Havercroft believed was unprecedented, and which he hoped would be in place within days.

Spark was unlikely to seek compensation from Cisco or the Asian carrier and would only consider compensating its own customers on a case-by-case basis if they complained, he said.

Havercroft said he was pleased with the way Cisco had responded. “We will be having a conversation with them about why there weren’t other protections built in to stop this happening.”

While Spark was “massively disappointed” it could not guarantee 100 per cent up-time, he said.

Customers could expect on average a couple of hours of disruption each year, he said.

The outage rekindled memories of the errors that struck the network in 2009 and 2010 when it was called XT. They prompted Spark, then Telecom, to ditch the XT brand and film a series public televised apologies that featured former chief executive Paul Reynolds flying fishing and appealing for a “second chance”.

Originally publised on stuff.co.nz