Starting at approximately 3 AM UTC today, BunnyCDN suffered an unfortunate 1-hour system-wide outage. This was a result of a chain of events that all started with a DDoS attack aimed at our backend and API server and a combination of a bug in our DNS network. We would like to apologize to everyone affected and assure you that we are already hard at work on preventive measures so that this does not happen again in the future.
So what exactly went wrong?
As always, we want to provide a full insight into what happened.
Starting at 3 AM UTC we first began receiving notifications about a problem with our network. After further investigation, we found out that there was a DDoS attack taking place targeting the BunnyCDN API and backend server. The attack was getting filtered, but we later found out that it prevented our backend server from performing DNS queries against Google's nameservers in order to connect to the database. Although our CDN system was designed to run completely independent from the backend server, as a result of requests timing out, this triggered a memory leak bug in our DNS software. Over the course of approximately 15 minutes, this resulted in a slow death of all the global DNS nodes and a resulted in a worldwide outage.
We immediately went and attempted to restart the backend server, however, due to the DDoS attack Google's DNS was still unreachable and it seemed like the database was still not responding. To remedy the situation, we attempted a physical restart of the machine, but for yet an unknown reason it failed to boot and start back up multiple times which caused additional delays. At around 4 AM UTC, we finally managed to bring the web server back up and attempted to switch the DNS resolvers to one provided by our hosting partner that was not being filtered out by the DDoS protection. This did the trick and brought the system back online. We immediately restarted the DNS network and all the software running on the edge nodes and the traffic began flowing from the edge nodes. Finally, the sudden surge of data accumulated while the CDN system was offline then cashed our metrics server, but that was then quickly restarted and brought back online. At this point, the whole system was back up and running again.
What are we doing to prevent this in the future?
We want to learn from our mistakes and make sure to improve based on the experience. The first thing we are doing is making sure that the DNS servers are completely independent of the backend system and finding out why this was not the case during the incident. Our system was designed to not have a single point of failure and we want to make sure this remains true under all circumstances. We will also be moving our internal API to a different server that will not be publicly exposed.
Finally, we would like to apologize to everyone for the downtime and thank everyone for their patience while the incident was underway. We take any downtime very seriously and are constantly trying to improve the service and we will take the knowledge gathered from this and try to make BunnyCDN even better.