Handling Streaming Data in Spotify Using the Cloud

Name: Handling Streaming Data in Spotify Using the Cloud
Uploaded: 2017-10-15T22:36:30+00:00
Duration: PTM29S25
Channel: Dominick Maxwell
Description: Handling Streaming Data in Spotify Using the Cloud

Handling Streaming Data in Spotify Using the Cloud
My name is Igor Maravić and I’m a Software Engineer working at Spotify. For the past year I’ve been working in a team which maintains and develops Spotify’s event delivery system. In this talk my colleague Neville and I will talk about how Spotify is handling the streaming data by leveraging the Google Cloud. Talk is going to be divided into two parts. In the first part of the talk I’m going to talk about Spotify’s event delivery. Where are we right now and where are we going to be in the near future. In the second part of the talk Neville will present the data analysis tool that we’re building to simplify the life of our data engineers. I’ll be back at the end of the talk to talk about the learnings we have had from moving Spotify’s event delivery into the cloud. Igor Maravić Software Engineer Neville Li

Current Event Delivery System
Spotify have had phenomenal user growth during past years. We recently announced that we now have 100 million monthly active users. With the users growth came the data growth. Currently we’re delivering more than 90 billion events to our Hadoop cluster every day. All the events are delivered through our current event delivery system…

Current event delivery system
Client Any data centre Hadoop data centre Client …which is based on Kafka 0.7. Events that are being delivered are generated directly on the clients as a response to certain user actions. This actions can be playlist creation, playlist subscription or playing a song. All of this events are delivered to Hadoop cluster. This system worked ok until certain scale. With the increased load on the system we started experiencing more outages. Big problem that we observed with this system was... Service Discovery Liveness Monitor Gateway Client Checkpoint Monitor Syslog Client ACK Brokers Syslog Producer Hadoop Brokers Syslog Consumers Groupers Realtime Brokers ETL job

Complex Hadoop Any data centre Hadoop data centre Gateway Syslog
Client Any data centre Hadoop data centre Client ...complexity. Most of the components of the system are tightly coupled. In case of the outages it’s hard to pinpoint the actual problem, since changes of one component might have the catastrophic effects on its downstream components. To add to this problem, most of the system components have limited test coverage and are operationally immature. Service Discovery Liveness Monitor Gateway Client Checkpoint Monitor Syslog Client ACK Brokers Syslog Producer Hadoop Brokers Syslog Consumers Groupers Realtime Brokers ETL job

Stateless Hadoop Any data centre Hadoop data centre Gateway Syslog
Client Any data centre Hadoop data centre Client Other problem is that our current event delivery system is stateless. State is only kept outside of our event delivery system. Events are persisted in syslog files before they get sent to our event delivery system. Next place where the events are safely persisted is Hadoop. They are safely persisted only after they pass through the whole event delivery pipeline. If the Hadoop cluster is down, event delivery stops working. Service Discovery Liveness Monitor Gateway Client Checkpoint Monitor Syslog Client ACK Brokers Syslog Producer Hadoop Brokers Syslog Consumers Groupers Realtime Brokers ETL job

Delivered data growth When you mix those problems with the stellar data growth that Spotify had, you get a recipe for a disaster. On the graph you can clearly see the point in time when we realised that the event delivery system can’t keep up with the growing load. To keep on running we started filtering out less important events at that time.

Redesigning Event Delivery
Considering the design shortcomings of the current system, to keep up with Spotify’s future growth the fastest way forward, as we saw it, was a complete rebuild of the system.

Redesigning event delivery
Hadoop data centre Hadoop Client Any data centre New event delivery system needs to keep the same function as the old system. That meaning that all the events from the clients need to be delivered via it to Hadoop. Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer

Event Delivery Service Reliable Persistent Queue
Same API Hadoop data centre Hadoop Client Any data centre To have easier migration path we need to keep the same API, both on producing and on consuming side. Producer API is defined as text lines persisted to local files on service machines. Those files are tailed with File Tailer which is sending them line by line to our event delivery service. Consumer API is defined as a location where the data is stored in Hadoop and the format in which it gets stored. All the data that gets delivered via our event delivery system is written in Avro format on HDFS. Delivered data is stored in hourly partitioned directory structure, or buckets as we like to call it. Consumer and producer API of our event delivery service are a relic from the past. They are the legacy of the first event delivery system that was based on scp command and which copied hourly based syslog files from all the servers to Hadoop. Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer

Dedicated event streams
Hadoop data centre Hadoop Client Any data centre In the new event delivery system we wanted to split the firehose stream to multiple event streams. We wanted to do this as close to the data producers as possible. Having independent event streams is important for us since it gives us the full isolation between the consumers of different streams and the visibility of resources used for each stream. Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer

Persistence Hadoop data centre Hadoop Client Any data centre We wanted our new event delivery service to be state-full. We wanted to have reliable persistent state, implemented as a part of reliable persistent queue, in between Hadoop and service machines. In case of Hadoop problems, event delivery can continue working without any visible problems. This in turn means that we’re going to be able to collect data from all the service machines as fast as possible. Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer

Keep it simple Hadoop data centre Hadoop Client Any data centre Finally we took great care to keep system design as simple as possible. Simpler the system gets it’s easier to operate it and to scale it. Simple as that. When we had the basics laid down, it was time to get our hands dirty. Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer

Choosing Reliable Persistent Queue
First on the agenda was the reliable persistent queue.

Kafka 0.8 Not surprisingly, we started our adventure with Kafka 0.8.
Kafka is a well known technology which has been used by many companies around the globe. It has a strong community built around it. It’s a first class citizen in many other open source projects which we use in Spotify. Lastly it has been designed to be a reliable persistent queue. This is a big improvement from Kafka 0.7, which we’re using in our current event delivery system.

Event delivery with Kafka 0.8
Client Any data centre Hadoop data centre Client Kafka is well documented and putting the system in place was relatively simple. Nevertheless this system had a flaw… Brokers Mirror Makers Hadoop Gateway Client Syslog Client Event Delivery Service File Tailer Brokers Camus (ETL)

Event delivery with Kafka 0.8
Client Any data centre Hadoop data centre Client As soon as we started pushing production volumes through it, system started breaking apart. We had issues with Mirror Makers which would drop data if the destination brokers were currently unavailable. Mirror Makers would also from time to time get in a confused state and stop mirroring data. We had issues with Kafka Producer library, which we used in the Event Delivery Service, which would enter in unrecoverable broken state if we would restart or remove one or more brokers from the cluster. Ah... At this point, we were at the crossroad. Should we significantly invest in Kafka and make it work for us, or should we try something else? Since Spotify started experimenting with Google Cloud products… // Some issues were addressed in the newer version Brokers Mirror Makers Hadoop Gateway Client Syslog Client Event Delivery Service File Tailer Brokers Camus (ETL)

Cloud Pub/Sub …Cloud Pubsub looked like a good alternative for Kafka.
According to the docs Cloud Pub/Sub retains undelivered data for 7 days per created subscription and it has at least once delivery semantics. It’s globally available service, which means that published events in one region can be consumed directly from another region. This is important for us since we can skip flaky cross Atlantic internet links when publishing events from US and consuming them in Europe. Lastly Cloud Pub/Sub is a fully managed service which has a nice side effect of us not having the operational responsibility for it. On the other hand this is a double-edged sword if the Cloud Pub/Sub doesn’t deliver what it promises. Since we still wanted to continue using it, we needed to take additional steps to verify that Cloud Pub/Sub is going to deliver what it promises. Considering the Spotify’s data growth in past years, we were mostly interested in scalability of Google Pubsub.

2M QPS published to Pub/Sub
Based on our load at the time, which was 700,000 events per second, we decided to stress test Cloud Pub/Sub by pushing and consuming 2M events per second from it. As you can see from this graph, stress testing Pub/Sub worked like a charm. We were able to seamlessly push and consume all the events we wanted, as you can see from this graph. Based on this test, and few other functional tests, we decided to proceed with Cloud Pubsub. 2M/s 1.5M/s 1M/s 500k/s 0/s

Event delivery with Cloud Pub/Sub
Client Cloud Storage Hadoop data centre Any data centre Hadoop Client Gateway From the design perspective, Cloud Pubsub fitted right in. Putting it in place was simple, since Event Delivery Service was already put in place when we did Kafka experiments. We just changed the Kafka API calls with calls to Cloud Pub/Sub API. We could say that Cloud PubSub was a drop in replacement for Kafka. When we’ve put Cloud PubSub to good use, we started to think about… Client Syslog Client File Tailer Cloud Pub/Sub ETL Event Delivery Service

ETL … writing the ETL. For the ETL, we have strict requirements which are based on the way our current event delivery system is working.

Event time based hourly buckets
Our ETL needs to write events to Cloud Storage and HDFS grouped into the hourly buckets. In which bucket does the event ends up is determined by the event timestamp which is determined when the event was written to syslog. Data that is written to hourly buckets need to be in Avro format. H H H 02H H H

Incremental bucket fill
As the data arrives we want to gradually fill buckets. We want to do this so we would achieve lower latency of delivering data. H H H 02H H H

Bucket completeness 2016-03-2123H 2016-03-2200H 2016-03-2201H
Before the downstream jobs can consume the data from the buckets, buckets need to be marked as complete and all the data that is contained in them should be de-duplicated. To detect if the bucket is complete, or not, we’re using the notion of watermark. Once the watermark passes the hour boundary we can mark the bucket as complete. H H H 02H H H

Late data handling 2016-03-2123H 2016-03-2200H 2016-03-2201H
After bucket was closed we can’t write late data to it. This is because most Spotify jobs are batch jobs that are consuming data from a hourly bucket only once. All the backfilled data could be considered as lost. Since we can’t do the backfill of the closed buckets we need to have another way of gracefully handling late data. In our case we want to write late data in to the currently open bucket. That way we skew data, as observed by the event timestamp, but we have no data loss. H H H 02H H H

Experimentation with Dataflow
To implement ETL job we experimented with Dataflow since it looked like a perfect fit. Dataflow SDK is a framework which provides high level API for writing data pipelines. It was designed to have unified stream and batch models. For achieving the best performance we wanted to use Dataflow in streaming mode, since events are constantly flowing through our event delivery system. With Dataflow high level API, and especially windowing and watermark concepts, we easily implemented all logic that we needed. We did achieve promising results with the prototype in terms of latency, but … we had stability issues. We were hitting hard to debug issues which required help from Google Engineers to be tracked down. Even if the Google engineers were extremely helpful this was slowing us down.

ETL as a set of micro-services
Hadoop data centre Completionist Hadoop After we rethink our approach we decided to implement our ETL as a set of micro-services. From the big picture, ETL consumes all the published data from the Cloud Pub/Sub and than it writes de-duplicated data into the hourly buckets on Hadoop and Cloud Storage. To implement that we broke the ETL into three simple components. Those components are Consumer, Completionist and Deduper. Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper

Consumer Hadoop Completionist Deduper Consumer Cloud Pub/Sub
Hadoop data centre Completionist Hadoop Consumer is the service that consumes all events from Cloud Pub/Sub and writes them to files on the intermediate storage. Before the event is considered successfully consumed, file to which it was written needs to be successfully communicated to Completionist. Only after the Completionist successfully acknowledges written file, we consider event to be successfully consumed from PubSub. Our current plan is to have independent Consumer cluster per event stream. That way we have a full isolation between all event streams. Because of this we can scale clusters independently for different event types to achieve optimal quality of service for each of them. Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper

Completionist Hadoop Completionist Deduper Consumer Cloud Pub/Sub
Hadoop data centre Completionist Hadoop Completionist is ETLs control plane. It answers the queries from the Deduper what files from the intermediate storage belong to the completed hour. To be able to answer to this question it tracks all files written by Consumer and the time when they were written. To calculate if the specific hour is complete, or not, it uses simple time based logic which is executed at the query time. This is the simplest logic we could thought off and we still need to verify if it’s good enough. The current implementation is backed by Cloud Datastore, which is highly replicated NoSQL data store with support for transactions. We plan to have a single Completionist cluster for all the event types, since the queries are cheap from the service view point. Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper

Deduper Hadoop Completionist Deduper Consumer Cloud Pub/Sub
Hadoop data centre Completionist Hadoop Deduper is the last component of our ETL. It’s the component which takes the intermediate data, de-duplicates it and writes it to hourly buckets in it’s final destinations. It’s implemented as Apache Crunch job, which is being executed on Dataproc cluster. This job is a batch job which runs on the full hourly window of data. To know if the hour is complete and which intermediate files belong to completed hour, it talks to Completionist. Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper

Where are we right now? Hadoop Completionist Deduper Consumer
Hadoop data centre Completionist Hadoop ETL that we’re building is still not in production. So far we have the skeleton up and running. Achieved latency of exposing the processed hours of the events is worse compared to Dataflow streaming job, that we’ve built as a proof of concept. This is to be expected since in this case, we’re not only having the latency of consuming the data, but we have the added latency of scheduling the batch job, submitting it to Dataproc and than processing the full hourly window of data. In Dataflow streaming job, we were able to process data as it was coming in. Having higher latency is still acceptable for us, since we’re trading it here for the easier management of the system. With the ETL structured like this, we’re able to leverage all Spotify’s existing infrastructure tools. This allows us to use well defined CI/CD process and to have good monitoring and alerting for all of the components. Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper

A Scala API for Google Cloud Dataflow
Scio A Scala API for Google Cloud Dataflow

Origin Story Scalding and Spark ML, recommendations, analytics
50+ users, 400+ unique jobs Growing rapidly

Moving to Google Cloud Early Dataflow Scala hack project

Why not Scalding on GCE Pros
Community Twitter, eBay, Etsy, Stripe, LinkedIn, … Stable and proven

Why not Scalding on GCE Cons Hadoop cluster operations
Multi-tenancy resource contention and utilization No streaming mode (Summingbird?)

Why not Spark on GCE Pros Batch, streaming, interactive and SQL
MLlib, GraphX Scala, Python, and R support Zeppelin, spark-notebook, Hue

Why not Spark on GCE Cons Hard to tune and scale
Cluster lifecycle management

Why Dataflow with Scala
Hosted solution, no operations Ecosystem GCS, BigQuery, PubSub, Bigtable, … Unified batch and streaming model

Why Dataflow with Scala
High level DSL easy transition for developers Reusable and composable code via FP Numerical libraries: Breeze, Algebird

Scio Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o]
Verb: I can, know, understand, have knowledge.

Core API similar to spark-core Some ideas from scalding
github.com/spotify/scio

WordCount Almost identical to Spark version val sc = ScioContext()
sc.textFile("shakespeare.txt") .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty)) .countByValue() .saveAsTextFile("wordcount.txt")

PageRank def pageRank(in: SCollection[(String, String)]) = {
val links = in.groupByKey() var ranks = links.mapValues(_ => 1.0) for (i <- 1 to 10) { val contribs = links.join(ranks).values .flatMap { case (urls, rank) => val size = urls.size urls.map((_, rank / size)) } ranks = contribs.sumByKey.mapValues(( ) * _) ranks

Spotify Running 60 million tracks
30m users * 10 tempo buckets * 25 tracks Audio: tempo, energy, time signature ... Metadata: genres, categories, … Latent vectors from collaborative filtering

Personalized new releases
Pre-computed weekly on Hadoop (on-premise cluster) 100GB recommendations from HDFS to Bigtable in US+EU 250GB Bloom filters from Bigtable to HDFS 200 LOC

User conversion analysis
For marketing and campaigning strategies Track user transitions through products Aggregated for simulation and projection 150GB BigQuery in and out

Demo Time!

What’s next? Migrating internal teams BigQuery SQL-2011 dialect
Apache Beam migration Better streaming support PRs and issues welcome!

Learnings So far, migrating event delivery from the current system to the new one, while maintaining the current system is a huge project as you might imagine. We got a long way, but we still have a lot to cover. ETL is currently the only thing which is blocking us from fully migrating to the new event delivery system. Never the less, even if the new reliable event delivery is still not fully in place, having events being published to Cloud PubSub, is already paying off since it enabled us to move much faster on the real time use case. So far we have used it to replace Kafka in few of our Storm topologies which are used to do real time Ad targeting with great success.

Blog posts @ labs.spotify.com
Spotify’s Event Delivery - The Road To The Cloud Part I, Part II, Part III If you’re interested on reading more about our event delivery I would like to point you to the blog posts that are located on labs.spotify.com.

Thank you! Igor Maravić <igor@spotify.com>
Thanks for listening! Igor Maravić Neville Li

Handling Streaming Data in Spotify Using the Cloud

Similar presentations

Presentation on theme: "Handling Streaming Data in Spotify Using the Cloud"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Handling Streaming Data in Spotify Using the Cloud

Similar presentations

Presentation on theme: "Handling Streaming Data in Spotify Using the Cloud"— Presentation transcript:

Similar presentations

About project

Feedback