Björn Rabenstein is a production engineer at SoundCloud and a Prometheus developer. Previously, Björn was a site reliability engineer at Google and a science number cruncher.
His talk for Codemotion Berlin, About SRE – and how (not) to apply it, begins by explaining that “SRE is what happens when you ask a software engineer to design an operations team”.
Björn also spoke about his role – and that of his teammates.
“Production engineer is probably a term used by many companies and it’s very specific at SoundCloud. I sometimes call us the Cloud Native team because we’re in charge of Kubernetes and Prometheus… but it’s an infrastructure team. We try to create an infrastructure where developers can build their own systems. On the other hand, we’re also developing things on our own.”
Björn’s talk then dived deeper into what SRE (or “Site Reliability Engineering”) means – and then takes a step back. SRE was originally created internally at Google to solve Google’s challenges of running Google’s production systems at Google scale. Björn asks: “how can that even work for you, as you are not Google? It won’t! Unless you know how to apply Google’s lessons to your possibly very different organization”. He adds that following SoundCloud’s five-year mission to a reliable site should give you plenty of inspiration.
According to Björn, SoundCloud runs a complex microservice architecture to serve a great diversity of features to a large user base. All of this is done by a relatively small number of engineers, under constant pressure to innovate in the not exactly easy market of music streaming. While this might appear quite similar to the situation of many other startups, SoundCloud is a rather extreme example. As such, it is perfectly suited to find out how to tackle this tech-debt prone situation.
About six years ago, with the microservice migration in full swing, site reliability became more and more problematic at SoundCloud. At about the same time, SoundCloud happened to employ a handful of ex-Google SREs. Naively, one might have expected they would simply wave their magic G-wands and make the site reliable again.
However, simply copying Google-style SRE and applying it to an organization very different in scale and culture was doomed to fail.
In Berlin, Björn added that studying the exact reasons for the failure and SoundCloud’s subsequent mission to find their own implementation of SRE is a helpful exercise for many smaller organisations in a similarly challenging situation of sustainably running a diverse set of services.