PlumberSurplus.com Ecommerce and Entrepreneurship Blog | About | Contact | PlumberSurplus.com Store

Communicating Service Outages to Your Team

Posted on July 22, 2010 by josh

If you're in charge of keeping critical systems going for your organization, it will be necessary, from time to time, to explain to your management team (or your customers, if you operate a service) why a service outage occurred. Whether it's your network, your phone system, your payment/checkout system, your website, or your {fill in one of the dozens of services you manage}, it's critical that you communicate what happened, who caused it, when it happened, where it occurred in your infrastructure, and how it affected your operation.

Below, I've created a simple sample of one of these communications (for demonstration purposes only):


Hello Team,

This communication is intended to provide you with a formal explanation for the service interruptions that we experienced on Thursday, July 21, 2010, which resulted in a downtime on our website.

Background
We experienced a downtime on our website from 9:48pm to 9:59pm. No pages could be reached during this period.

Escalation Taken
Alert messages of the downtime were sent to the alert team at 9:48pm that there was a potential downtime. The tech team mobilized and immediately began testing systems to determine a cause.

Root Cause
Upon investigation, it was found that a family of pygmy mouse lemurs had taken refuge in our data center. One of the adorable animals had pressed the power button on our web server, as it was shiny. This turned off our web server.

Corrective Actions
We turned the web server back on by pressing the power button and the web site resumed normal activity within a minute or so. Sadly, we also had to evict the cute but culpable creatures.

Next Steps
In the coming week, we will be installing a mesh screen on the server's rack to deter any new intruders. We've also begun plans to scale our infrastructure to support load balancing and redundancy across multiple web servers in multiple locations to help to mitigate future issues like this.

Should you have any questions, please do not hesitate to email or call us. Sorry for any inconvenience!

Best regards,

Your IT Guy

Issues are going to happen, and they're not always in your control. The key here is to communicate quickly and accurately. As soon as you've got the situation under control, and you have the facts, communicate it out. You want to address concerns quickly, because your team may be wondering why sales dropped between 9:48pm and 9:59pm. You want to be as accurate as possible because you may not realize that your marketing team had scheduled a report to be uploaded to FTP (which lies on another server that may have also been affected) at 9:45pm. Don't worry. Chances are, your team (or your customer) doesn't care as much about being down as they do about how you respond and how well similar future issues are addressed. Try not to use uber-technical language, as you may be communicating with non-technical personnel. Be ready to answer lots of questions and make yourself available to them. You may have satisfactorily resolved the issue, but others may still be left without closure on an issue. Do a great job of communicating what was, what is, and what will be. Your team will appreciate it immensely.

 

blog comments powered by Disqus