Tuesday 6th November 2018

Optimus Communications - SIP Trunks Outage Incident Report

Priority: 1 - Location: Nationwide - Reference: 486252

Start Date/Time: 05/11/2018 09:20 - Resolution Date/Time: 05/11/2018 16:15 - Duration: 6 hours 55 minutes

Event Description: [Restored] Voice services were offline.

Customer Impact: Customers would have experienced incoming and outbound calling failures.

Summary of Incident: Optimus Systems have several parts to our voice network, the affected part of the network was: CVS - Cloud Voice Switch - This is the hosts our inter connections, which terminates calls to and from landlines and mobiles. All calls in and out of CVS were affected and down. Most of our customers leverage CVS in some part as it connects at the root level back to our PBX instances.

Timeline of Incident:

  • At 9.20am, our upstream provider received a DDOS (Denial of service) attack to one of their web-hosting customers in their Christchurch data-centre.
  • Whilst CVS runs on 8 database servers in different locations around New Zealand for redundancy, the DDOS attack meant that a corrupted update was propagated between all the database servers. So instead of 8 good database servers, they had 8 corrupted database servers across New Zealand.
  • They have backups of the database servers, however this takes time to restore.
  • After restoring the database servers, they then experienced loading issues, because so many customers IP phones were trying to register against the SIP servers.
  • Because of the loading issues, the opportunity was taken to re-start all the database servers, because while many had been up for 900+ days and have been reliable, they wanted to rule out any unknown memory issues.
  • They then restored services briefly, but then unfortunately received a second and third DDOS attack to the same customer on Christchurch on different IP addresses.
  • Whilst the Voice network and Data networks are running on separate infrastructure, the reason this particular DDOS attack was so damaging, is that the customer being DDOS’ed and their Voice databases were on the same 1Gbit trunk port. Whilst they have never had any loading issues on this Trunk port before, in this particular instance, traffic levels were able to cause cascading issues.
  • They have completed moving Voice services in Christchurch to a separate 10Gbit trunk port.

Post event actions:

  • We realise that this outage, along with the Friday 26th October unrelated outage, is completely unacceptable to our customers.
  • Optimus Systems are unhappy with the last few outages on what has been a stable system. We are conducting our own investigation into alternate upstream interconnects as even though there are technical answers, being unable to supply a service for almost a whole day is not acceptable.
  • We are investigating this issue and endeavoring to defend against future issues from this moment forward and will keep you informed of our progress.