Unfortunately, CollegeSource had the opportunity to test the truth of this aphorism as we recently experienced two related service outages, the first on Friday, January 15th, and the second on Monday, January 25th. This is what happened and what we have learned from the outages.
Event #1
At 9am PST / 12pm EST on Friday January 15th, a number of our hosted services unexpectedly went down. The services affected were TES, Transferology, Transferology Lab, and CollegeSource On-line. As you might expect, this event incited a flurry of activity and we quickly uncovered a hardware failure at a level high enough to make all of our automated fail-over scenarios moot, yet not so drastic that we chose to employ one of our more long-term recovery scenarios.
Let me try to explain that further.
CollegeSource maintains its own virtualized server array. We conduct multiple, redundant, and geographically disparate backups of data. We have parallel virtual machines that work both in tandem and independently to ensure uninterrupted service if any single such machine fails. Additionally, we have redundant hardware platforms from which we can run the services, should we need to completely abandon one platform and relaunch the services on another. The solutions we choose to deploy are very much dependent upon the type of problem we experience.
Quick, Fail-Over Solutions
The vast majority of issues we encounter happen at a level where we can fail-over, automatically or by intervention, to another virtual drive. Meaning the user never experiences any interruption of service. That is the more trivial end of the spectrum of problems, though no problems seem trivial when you have as many clients as we do relying on the services.
Stack Replacement Solutions
On the other end of the spectrum is catastrophic failure, the kind that renders a hardware platform useless: long-term power grid outage, fire, earthquake. For any outage that involves an unrecoverable technology stack, we commit to moving the services to one of our backup platforms and redirecting the domains to those servers. Because of the way physical addresses are propagated across the Internet, however, it takes hours (sometimes more than a day) for the entire Internet to correctly re-route traffic. Therefore, when we believe we can restore a platform, especially when we believe we can get it up and running within a few hours, we prefer to avoid this drastic measure.
Everything In-Between
The error we experienced on Friday was about as crippling as you can imagine precisely because the hardware failed at a level that was beyond all of our quick-recovery solutions, yet did not warrant a full relocation of the services.
The entire first outage lasted an unprecedented 29 hours. Typing that makes me cringe, even though I know most users only experienced interrupted service for 5-8 business hours, depending on their time zone. Partial functionality was restored by Saturday morning, but full restoration of the services was not achieved until 2pm PST / 5pm EST Saturday, January 16th. Worse, some of the more recent backup files were corrupted during the failure. Our last reliable back up was from 3am PST / 6am EST Friday morning, meaning some clients lost data they had input/created during the 6 hours between this last reliable backup and the time the services went down.
Event #2
As we met to debrief on how such an error occurred and how it could be prevented in the future, it became clear that our first step was to replace one of the units that had caused the problem and which was functioning at about 70% capacity. Availability of the part forced us to choose starting this repair on Sunday afternoon, January 25th, or to wait another week with the potential of the part failing during a critical time. The timing wasn’t optimal; we prefer to start such processes on a Friday night. However, the repair was projected to only take two hours and the other option – wait another week and pray the part didn’t fail completely – was unacceptable. We began work Sunday afternoon and our goal was to be back up and running in the wee hours of Sunday morning after an outage of only a few hours, but once again things did not go our way and problems with the repairs stretched into Monday morning. We engaged all of our resources, including consultants from our hardware and software vendors, to quickly resolve the problem, but the result was an extension of the planned outage until about 9am PST / 12:00pm EST.
For the combined outages, we believe customers suffered 9-10 business hours of interrupted service. We know this is not acceptable. We know we can do better.
Lessons
Failure is never a pleasant thing to experience, but in every failure there is the chance to learn, re-prioritize, and grow. We have already reconfigured our backup scenarios. We are researching and committed to implementing a completely new primary hardware stack with more reliable internal mechanical and virtual configurations. We are looking into DNS proxy services that might provide better opportunities for messaging or a quicker fail-over to a new hardware platform. We will make all of these changes and more.
But perhaps the thing we needed to be reminded of most is that people really need our products to do their jobs!
During the outage, one person wrote to us and said “I miss TES. I don’t like when she’s not around.” This, of course made us feel amazing and terrible at the same time. We love that you rely on us and we feel it keenly when we disappoint you.
Thank You
The people that work at CollegeSource care….a lot. While the restoration efforts fell on a small group of people, everyone else was understanding, supportive, and appreciative of the work that had to be done, including clients. Positive attitudes go a long way in stressful situations. Many of you sent emails and made phone calls to our staff, to let us know you had lost access, to ask for answers as to what was wrong and when it would be made right, and to patiently encourage us during a difficult process. Thank you for your support! We are honored to make tools that you love and we feel the responsibility of keeping them available to you.
And thank you for noticing our absence. We promise to do our very best never to be far from you again.
A mass email informing us of any interruption in service would be greatly appreciated, unless the failures affected you sending any emails. 🙂
Thanks, Carol. Yes, we are learning how we can communicate better if a similar, future event occurs. In this case, we sent emails to Transferology administrators because we had recently run a report and had them on hand. Our other lists were a little harder to get to. The same people who were involved in the restoration would have had to stop and pull lists from database backups and we obviously didn’t want to slow them down. We are reviewing our notification procedures and trying to make some sensible adjustments. The first is a Managed DNS service that will allow us to put up an updatable “interruption of service” page no matter how catastrophic an outage is. (Though, of course, no system is perfect. If the Managed DNS service is out, it would cut off the route to our services even if they were up and running. The assumption is that the Managed DNS provider would be more distributed and less vulnerable than the services themselves, but … that’s why we are still just “investigating.”) We would not likely email every user if we could get such a page up reliably, but we will probably email TES and Transferology administrators in the event of any future unplanned outages of a significant length.
I reached out to Shelly Jackson, who is always accessible and reliable. In fact, you are all a wonderful, responsive group of people and a pleasure to work with. Yes, my job was a bit more difficult without TES, especially since I am working on a college-wide project involving the database. But Shelly’s usual quick response (despite the time zone difference) was honest and reassuring. I appreciate Troy’s frank and easy-to-understand explanation of the problems that precipitated the outages and the measures the company is taking to remedy these problems. Keep doing what you do. You are the best!