Spotify experienced a global outage on March 8 as a result of problems with its cloud-hosted service discovery system. At 18:12 UTC / 13:12 ET, the firm became aware of login issues and began applying updates to important systems at 18:39 UTC / 13:39 ET.
What was the issue?
Spotify’s backend is made up of several microservices that connect. Microservices use a variety of service discovery technologies to communicate with one another. The majority of their services employ a DNS-based service discovery system, but some of them use Traffic Director, an xDS-based traffic control plane and discovery system.
The Google Cloud Traffic Director service was down on March 8. The Spotify outage, which impacted many customers, was caused by this, in conjunction with a problem in a client (gRPC) library: if a user logged out of the Spotify app, they were unable to log back in.
“As soon as the problem was discovered, we rolled out configuration changes to revert our affected systems back to use our DNS-based service discovery and saw it recover gradually,” said Spotify in a blog post.
Spotify is working with Google Cloud to better understand how Traffic Director’s difficulties resulted in a massive outage that impacted the customers. The company stated that they will increase monitoring and alerting to guarantee that similar service discovery-related issues are caught sooner in the future.