Communicating with our community is vitally important to us, especially when it comes to information about our services and operations. Currently, we keep the community updated about the disposition of ARIN services at our Public Policy and Members Meetings and directly through announcements. To keep you better informed throughout the year, we are starting an operations-focused blog series to update you on the development of new services, highlights about our existing services, and any plans to modify our overall registry service offerings. This series will feature posts from several different members of the ARIN team to keep you current on operations at your Regional Internet Registry.
An important goal of this blog series is to openly address operational topics, rather than serve as a marketing tool. As an example, today’s blog covers an operations mishap that occurred with our services last year at the end of December.
One of the things we pride ourselves on at ARIN is our ability to keep our services up and running – i.e. available to our customers whenever they need them. When we do have to take an outage for whatever reason, we communicate plans to the community via email and announcements posted on our website, and we do all that we can to finish in the work window that we committed to in those announcements. Sadly, we had a serious mishap in December of 2019 that we would like to explain.
On 23 December 2019, at 12:35 PM EST, ARIN Operations received multiple alerts from monitoring systems concerning multiple virtual machines that supported customer-facing ARIN services. Operations staff attempted to manually force the failover of the virtual machines to other hardware that was on standby. However, this also failed and we lost hosting of the virtual machines in that part of our network at 12:55 PM, at which point both our website and our ARIN Online customer application were offline and unavailable.
After troubleshooting, working with our virtualization vendor, and rebuilding the virtualization nodes, it was determined that access to the shared storage platform used by the virtual machine cluster was the likely culprit. We worked with our storage system vendor to assist in diagnosis but they were unable to resolve the storage problem in a timely manner. At 3:30 PM, given the length of the outage and the uncertainty as to time-to-repair, it was decided to swing ARIN’s website to our disaster recovery site – a capability we maintain and test so that it is available for such situations. We were able to restore access to the ARIN website by 4:00 PM and access to our customer application was restored by 5:10 PM.
While ARIN’s services were operational, we continued to work with storage system vendor through the Christmas holiday to determine the underlying problem. By 2 January, they were able to replace the failed component, and we were able to put the impacted cluster back in operation. Because we knew that swinging the website back to the primary data center would require an outage, we elected to wait to perform this work in conjunction with a previously planned maintenance window on 25 January.
So, what went wrong? Our virtualization cluster is based on a high-availability configuration with redundancy throughout the system. After a thorough after-action review, it turned out that a hardware failure coupled with a system misconfiguration on the storage appliance caused the virtualization engine to fail. The hardware has since been replaced, the corrected configuration was validated by vendor support, and everything has been put back into normal operation.
We are sorry for the inconvenience caused by the unplanned outage, but also proud of our team for their quick thinking and hard work to get us back online. We are committed to keeping the community informed about the status of our services, and hope you find our operations-focused blog series helpful in keeping you informed about operations at ARIN.