Post Mortem: Skype Redux?
The dust has settled after Skypes mid-August meltdown. Millions of users around the world lost service for up to two days, producing much wailing and gnashing of teeth. Everything appears to be back to normal now, barring the odd Skype-borne virus. But we wondered.
What really happened when the lights went out in Skype-world? Could it happen again? Could it last longer next time? Can we still trust Skype?
I talked to the companys director of operations, Michael Jacksonand for a reality check, polled Martin Geddes, the evangelist of Teleapocalypse and chief analyst with UK-based STL Partners, the consulting and research firm behind the Telco2.0 initiative. Skype also took a crack at explaining everything on one of its blogs.
To its credit, the company takes full responsibility for the outagedespite early reports that tried to make Microsoft the villainand it appears to appreciate the impact the incident had on users.
Two days is a long time; were the first to admit that, Jackson says. Clearly there are businesses and people who depend on this service. Theyre wondering, is it going to happen again? Can we rely on Skype in the future? So I guess we have to regain that trust pointthe same as any company that lets its customers down. Were going to do our absolute utmost to try and make sure it doesnt happen again.
The company did give paying customersSkypeOut, SkypeIn, Skype Pro, and voicemail, usersan extra week of service, though initially it appeared it was only giving credit for the period of the outage. Geddes felt this was the one false step in the companys post-meltdown public relations effort.
That was the finance department speaking, he says of the initial offer. Its not anything from the heart, its not an apology. They should have said, Heres a weeks credit or a months. It would be better if they offered nothing. This is an insult, really.
Skype users apparently didnt feel that way. According to Jackson, usage numbers very quickly bounced back, with log-ons on the following Tuesday about the same as the previous week. The seasonal upswing with school starting in September also exactly mirrored the previous year, he says.
What caused the outage?
Its much clearer now what happened and why.
Early on the morning of Thursday, August 16, Microsoft launched a mass online update of Windows computers to add security patches and other bug fixes. Soon after, Skype noticed an unusual number of users were having trouble logging in.
Skype users dont really log in the way users in a conventional client-server network doin a peer-to-peer network, there are no central serversbut they do have to validate their Skype clients and credentials against the network.
The problem in this case was a dearth of supernodes, the user computers the company commandeers to manage the peer-to-peer network and specifically the validation process. Without them users cant log in.
The software agreement you sign when you install Skype client software gives the company permission to use some of your computers processing and bandwidth capacity. Each supernode handles about 300 nearby users. Skype configures five in each cell for redundancy. So with upwards of nine million users online, it takes something like 150,000 supernodes to make Skype work.
The software automatically selects the most reliable computers with the fastest Internet connections to be supernodes. The trouble is, when a supernode goes away temporarily, as thousands did when Microsoft automatically rebooted them after the patch, it no longer qualifies to be a supernode, at least until it proves its reliability all over again.
So millions of Skype users computers were rebooting after the update and most were trying to reconnect to Skype. The few supernodes left standing couldnt handle the traffic. Geddes compares it to a denial of service attack on a conventional network.
Theres some truth in that, Jackson says. Its a combination of a lack of availability of [super]nodesthey were all fulland the fact you cant become a supernode until you log on to the network. And there arent enough clients available to become nodes because they cant log on. So its more a catch 22 than a [denial of service].
But why did this Microsoft update catalyze, as Jackson puts it, such a catastrophic reaction in the Skype network? Microsoft regularly updates and automatically reboots users computers.
This patch caught a larger percentage of computers and it was a deeper reset, Jackson says. We hadnt seen this before. Wed seen perturbations in the network [after other Microsoft updates], but put them down to just that, perturbations. We never thought it could be this kind of a domino effect.
The internal gremlin
The other factorthe real culprit, Skype now sayswas a resource allocation algorithm in the client software that could not adapt to such a set of circumstances. Instead of clients backing off on their attempts to validate on the network when supernodes werent immediately available and waiting for the ship to right itself, they kept hammering away, trying to log in.
We just never thought that supernodes could ever not be available to this level, Jackson says. Once engineers could see that thats what had happened, it took about eight minutes to repair [the offending piece of code].
Could it happen again?
Fixing the code should prevent the same thing happening again under similar circumstances, but the network actually righted itself on its own, he points out.
Should Skype have known something like this could happen? Jackson says yes. The Microsoft update and reboot was a legitimate action, he says, and the way of the world. So Skype should have been prepared.
Some might dispute this. Why does Microsoft automatically shut down computers at all, given the riskat the very leastof unsaved user data being lost? Why not perform the update and pop up a message that users would find when they came back to their computers, instructing them to reboot to complete the process?
But Jackson goes out of his way to absolve Microsoft and even praise the company on two counts. It initially took seriously the possibility of its own culpability, that something in the patch was preventing the Skype network from recovering, he says. And it was very responsive to Skypeincluding convening a SWAT team at 8 a.m. on the Thursday morning to help trouble shoot.
It wasnt anything they did, Jackson says. But they were hugely helpful. I was really impressed.
Geddes has an interesting take. One of the trade-offs with peer-to-peer networks, compared to client-server networks, he points out, is that they trade off manageabilityespecially the ability to manage endpointsfor scalability. Its the nature of the beast. He also notes that P2P is still a new, immature technology and that Skype is feeling its way forward, much as early Internet service providers had to do.
The Skype folks couldnt do a test of millions of users re-logging in, Geddes says. So Microsoft did the test for them, and it failed.
Jackson insists that a simple fix to the resource allocation algorithm, which will force clients to wait when they encounter a similar situation and re-validate in an orderly fashion, will prevent the same thing happening. [The network] wouldnt break. The time period [outage] would be some minutes rather than hours.
The August melt-down was a wake-up call for Skype, though, he says. Following an in-depth post mortem of its own, the company has assigned engineers the task of anticipating other potential network-wrecking circumstances, and figuring out ways to prevent them.
As for regaining users trust, Jackson is candid and realistic. We screwed up he says. Everybody gets a second chance. We just cant abuse it.