Handling Streaming Data in Spotify Using the Cloud

Slides:



Advertisements
Similar presentations
Building LinkedIn’s Real-time Data Pipeline
Advertisements

© 2013 Adobe Systems Incorporated. All Rights Reserved. Adobe Confidential. Jim Donahue | Principal Scientist Adobe Systems Technology Lab Flint: Making.
System Center 2012 R2 Overview
Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.
Spark in the Hadoop Ecosystem Eric Baldeschwieler (a.k.a. Eric14)
Building a Real-Time Data Pipeline: Apache Kafka at Linkedin Hadoop Summit 2013 Joel Koshy June 2013 LinkedIn Corporation ©2013 All Rights Reserved.
Ellucian Mobile: Don’t text and drive, kids!
Why Spark on Hadoop Matters
Apache Spark and the future of big data applications Eric Baldeschwieler.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Crystal Hoyer Program Manager IIS Team Preview of features that will be announced at MIX09 Please do not blog, take pictures or video of session.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
UITS SharePoint and the IUSPUG 2009 LSP Appreciation Event PRESENTER(S) Cory P. Retherford and Brian Hughes September, 30 th, 2009.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
DELIVERING THE ENTERPRISE FABRIC FOR BIG DATA Aiaz Kazi SVP, Platform Strategy and Adoption
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Machine Learning as a Service
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
Optimal Pipeline Using Perforce, Jenkins & Puppet Nitin Pathak Works on
Matthew Winter and Ned Shawa
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Spark and Jupyter 1 IT - Analytics Working Group - Luca Menichetti.
KAASHIV INFOTECH – A SOFTWARE CUM RESEARCH COMPANY IN ELECTRONICS, ELECTRICAL, CIVIL AND MECHANICAL AREAS
1 Divya Jain Oct 10 th, 2014 Big Data Products: Where do I start?
Part III BigData Analysis Tools (Storm) Yuan Xue
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
Apache Kafka A distributed publish-subscribe messaging system
Databricks What is Databricks ? Cloud services used Functionality Languages Spark Usage 3 rd Party Apps Architecture Books
Microsoft Partner since 2011
Apache Beam: The Case for Unifying Streaming API's Andrew Psaltis HDF / IoT Product Solution June 13, 2016 HDF / IoT Product Solution.
Microsoft Ignite /28/2017 6:07 PM
Docker for Ops: Operationalize Your Apps in Production Vivek Saraswat Sr. Product Evan Hazlett Sr. Software
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
Pipe Engineering.
Big Data is a Big Deal!.
Introduction to Spark Streaming for Real Time data analysis
Hadoop.
ITCS-3190.
SharePoint Solutions Architect, Protiviti
Emily Kohne Oscar Rivera Adriana Perez Brenda Izaguirre
Spark Presentation.
Data Platform and Analytics Foundational Training
Emitter: Scalable, fast and secure pub/sub in Go
ETL Architecture for Real-Time BI
Introduction to Spark.
Spark Software Stack Inf-2202 Concurrent and Data-Intensive Programming Fall 2016 Lars Ailo Bongo
Near Real Time ETLs with Azure Serverless Architecture
Data science and machine learning at scale, powered by Jupyter
Messaging Services and Client Software
Introduction to Apache
July, 2016 Fangshi Li Carl Steinbach Dr. LinkedIn July, 2016 Fangshi Li Carl Steinbach.
: Infrastructure for Complete Machine Learning Lifecycle
Overview of big data tools
Architecture for Real-Time ETL
Spark and Scala.
Technical Capabilities
Agenda Need of Cloud Computing What is Cloud Computing
cLOUD solution Google Cloud Platform CLOUD solution GUI Apache server
Streaming data processing using Spark
Remedy Integration Strategy Leverage the power of the industry’s leading service management solution via open APIs February 2018.
Databricks and End-to-End Processes Demo Links & Help
Data Wrangling as the key to success with Data Lake
Thank you to our Sponsors
Data Wrangling for ETL enthusiasts
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Handling Streaming Data in Spotify Using the Cloud My name is Igor Maravić and I’m a Software Engineer working at Spotify. For the past year I’ve been working in a team which maintains and develops Spotify’s event delivery system. In this talk my colleague Neville and I will talk about how Spotify is handling the streaming data by leveraging the Google Cloud. Talk is going to be divided into two parts. In the first part of the talk I’m going to talk about Spotify’s event delivery. Where are we right now and where are we going to be in the near future. In the second part of the talk Neville will present the data analysis tool that we’re building to simplify the life of our data engineers. I’ll be back at the end of the talk to talk about the learnings we have had from moving Spotify’s event delivery into the cloud. Igor Maravić <igor@spotify.com> Software Engineer Neville Li <neville@spotify.com>

Current Event Delivery System Spotify have had phenomenal user growth during past years. We recently announced that we now have 100 million monthly active users. With the users growth came the data growth. Currently we’re delivering more than 90 billion events to our Hadoop cluster every day. All the events are delivered through our current event delivery system…

Current event delivery system Client Any data centre Hadoop data centre Client …which is based on Kafka 0.7. Events that are being delivered are generated directly on the clients as a response to certain user actions. This actions can be playlist creation, playlist subscription or playing a song. All of this events are delivered to Hadoop cluster. This system worked ok until certain scale. With the increased load on the system we started experiencing more outages. Big problem that we observed with this system was... Service Discovery Liveness Monitor Gateway Client Checkpoint Monitor Syslog Client ACK Brokers Syslog Producer Hadoop Brokers Syslog Consumers Groupers Realtime Brokers ETL job

Complex Hadoop Any data centre Hadoop data centre Gateway Syslog Client Any data centre Hadoop data centre Client ...complexity. Most of the components of the system are tightly coupled. In case of the outages it’s hard to pinpoint the actual problem, since changes of one component might have the catastrophic effects on its downstream components. To add to this problem, most of the system components have limited test coverage and are operationally immature. Service Discovery Liveness Monitor Gateway Client Checkpoint Monitor Syslog Client ACK Brokers Syslog Producer Hadoop Brokers Syslog Consumers Groupers Realtime Brokers ETL job

Stateless Hadoop Any data centre Hadoop data centre Gateway Syslog Client Any data centre Hadoop data centre Client Other problem is that our current event delivery system is stateless. State is only kept outside of our event delivery system. Events are persisted in syslog files before they get sent to our event delivery system. Next place where the events are safely persisted is Hadoop. They are safely persisted only after they pass through the whole event delivery pipeline. If the Hadoop cluster is down, event delivery stops working. Service Discovery Liveness Monitor Gateway Client Checkpoint Monitor Syslog Client ACK Brokers Syslog Producer Hadoop Brokers Syslog Consumers Groupers Realtime Brokers ETL job

Delivered data growth When you mix those problems with the stellar data growth that Spotify had, you get a recipe for a disaster. On the graph you can clearly see the point in time when we realised that the event delivery system can’t keep up with the growing load. To keep on running we started filtering out less important events at that time.

Redesigning Event Delivery Considering the design shortcomings of the current system, to keep up with Spotify’s future growth the fastest way forward, as we saw it, was a complete rebuild of the system.

Redesigning event delivery Hadoop data centre Hadoop Client Any data centre New event delivery system needs to keep the same function as the old system. That meaning that all the events from the clients need to be delivered via it to Hadoop. Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer

Event Delivery Service Reliable Persistent Queue Same API Hadoop data centre Hadoop Client Any data centre To have easier migration path we need to keep the same API, both on producing and on consuming side. Producer API is defined as text lines persisted to local files on service machines. Those files are tailed with File Tailer which is sending them line by line to our event delivery service. Consumer API is defined as a location where the data is stored in Hadoop and the format in which it gets stored. All the data that gets delivered via our event delivery system is written in Avro format on HDFS. Delivered data is stored in hourly partitioned directory structure, or buckets as we like to call it. Consumer and producer API of our event delivery service are a relic from the past. They are the legacy of the first event delivery system that was based on scp command and which copied hourly based syslog files from all the servers to Hadoop. Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer

Dedicated event streams Hadoop data centre Hadoop Client Any data centre In the new event delivery system we wanted to split the firehose stream to multiple event streams. We wanted to do this as close to the data producers as possible. Having independent event streams is important for us since it gives us the full isolation between the consumers of different streams and the visibility of resources used for each stream. Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer

Event Delivery Service Reliable Persistent Queue Persistence Hadoop data centre Hadoop Client Any data centre We wanted our new event delivery service to be state-full. We wanted to have reliable persistent state, implemented as a part of reliable persistent queue, in between Hadoop and service machines. In case of Hadoop problems, event delivery can continue working without any visible problems. This in turn means that we’re going to be able to collect data from all the service machines as fast as possible. Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer

Event Delivery Service Reliable Persistent Queue Keep it simple Hadoop data centre Hadoop Client Any data centre Finally we took great care to keep system design as simple as possible. Simpler the system gets it’s easier to operate it and to scale it. Simple as that. When we had the basics laid down, it was time to get our hands dirty. Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer

Choosing Reliable Persistent Queue First on the agenda was the reliable persistent queue.

Kafka 0.8 Not surprisingly, we started our adventure with Kafka 0.8. Kafka is a well known technology which has been used by many companies around the globe. It has a strong community built around it. It’s a first class citizen in many other open source projects which we use in Spotify. Lastly it has been designed to be a reliable persistent queue. This is a big improvement from Kafka 0.7, which we’re using in our current event delivery system.

Event delivery with Kafka 0.8 Client Any data centre Hadoop data centre Client Kafka is well documented and putting the system in place was relatively simple. Nevertheless this system had a flaw… Brokers Mirror Makers Hadoop Gateway Client Syslog Client Event Delivery Service File Tailer Brokers Camus (ETL)

Event delivery with Kafka 0.8 Client Any data centre Hadoop data centre Client As soon as we started pushing production volumes through it, system started breaking apart. We had issues with Mirror Makers which would drop data if the destination brokers were currently unavailable. Mirror Makers would also from time to time get in a confused state and stop mirroring data. We had issues with Kafka Producer library, which we used in the Event Delivery Service, which would enter in unrecoverable broken state if we would restart or remove one or more brokers from the cluster. Ah... At this point, we were at the crossroad. Should we significantly invest in Kafka and make it work for us, or should we try something else? Since Spotify started experimenting with Google Cloud products… // Some issues were addressed in the newer version Brokers Mirror Makers Hadoop Gateway Client Syslog Client Event Delivery Service File Tailer Brokers Camus (ETL)

Cloud Pub/Sub …Cloud Pubsub looked like a good alternative for Kafka. According to the docs Cloud Pub/Sub retains undelivered data for 7 days per created subscription and it has at least once delivery semantics. It’s globally available service, which means that published events in one region can be consumed directly from another region. This is important for us since we can skip flaky cross Atlantic internet links when publishing events from US and consuming them in Europe. Lastly Cloud Pub/Sub is a fully managed service which has a nice side effect of us not having the operational responsibility for it. On the other hand this is a double-edged sword if the Cloud Pub/Sub doesn’t deliver what it promises. Since we still wanted to continue using it, we needed to take additional steps to verify that Cloud Pub/Sub is going to deliver what it promises. Considering the Spotify’s data growth in past years, we were mostly interested in scalability of Google Pubsub.

2M QPS published to Pub/Sub Based on our load at the time, which was 700,000 events per second, we decided to stress test Cloud Pub/Sub by pushing and consuming 2M events per second from it. As you can see from this graph, stress testing Pub/Sub worked like a charm. We were able to seamlessly push and consume all the events we wanted, as you can see from this graph. Based on this test, and few other functional tests, we decided to proceed with Cloud Pubsub. 2M/s 1.5M/s 1M/s 500k/s 0/s

Event delivery with Cloud Pub/Sub Client Cloud Storage Hadoop data centre Any data centre Hadoop Client Gateway From the design perspective, Cloud Pubsub fitted right in. Putting it in place was simple, since Event Delivery Service was already put in place when we did Kafka experiments. We just changed the Kafka API calls with calls to Cloud Pub/Sub API. We could say that Cloud PubSub was a drop in replacement for Kafka. When we’ve put Cloud PubSub to good use, we started to think about… Client Syslog Client File Tailer Cloud Pub/Sub ETL Event Delivery Service

ETL … writing the ETL. For the ETL, we have strict requirements which are based on the way our current event delivery system is working.

Event time based hourly buckets Our ETL needs to write events to Cloud Storage and HDFS grouped into the hourly buckets. In which bucket does the event ends up is determined by the event timestamp which is determined when the event was written to syslog. Data that is written to hourly buckets need to be in Avro format. 2016-03-2123H 2016-03-2200H 2016-03-2201H 2016-03-22 02H 2016-03-2203H 2016-03-2204H

Incremental bucket fill As the data arrives we want to gradually fill buckets. We want to do this so we would achieve lower latency of delivering data. 2016-03-2123H 2016-03-2200H 2016-03-2201H 2016-03-22 02H 2016-03-2203H 2016-03-2204H

Bucket completeness 2016-03-2123H 2016-03-2200H 2016-03-2201H Before the downstream jobs can consume the data from the buckets, buckets need to be marked as complete and all the data that is contained in them should be de-duplicated. To detect if the bucket is complete, or not, we’re using the notion of watermark. Once the watermark passes the hour boundary we can mark the bucket as complete. 2016-03-2123H 2016-03-2200H 2016-03-2201H 2016-03-22 02H 2016-03-2203H 2016-03-2204H

Late data handling 2016-03-2123H 2016-03-2200H 2016-03-2201H After bucket was closed we can’t write late data to it. This is because most Spotify jobs are batch jobs that are consuming data from a hourly bucket only once. All the backfilled data could be considered as lost. Since we can’t do the backfill of the closed buckets we need to have another way of gracefully handling late data. In our case we want to write late data in to the currently open bucket. That way we skew data, as observed by the event timestamp, but we have no data loss. 2016-03-2123H 2016-03-2200H 2016-03-2201H 2016-03-22 02H 2016-03-2203H 2016-03-2204H

Experimentation with Dataflow To implement ETL job we experimented with Dataflow since it looked like a perfect fit. Dataflow SDK is a framework which provides high level API for writing data pipelines. It was designed to have unified stream and batch models. For achieving the best performance we wanted to use Dataflow in streaming mode, since events are constantly flowing through our event delivery system. With Dataflow high level API, and especially windowing and watermark concepts, we easily implemented all logic that we needed. We did achieve promising results with the prototype in terms of latency, but … we had stability issues. We were hitting hard to debug issues which required help from Google Engineers to be tracked down. Even if the Google engineers were extremely helpful this was slowing us down.

ETL as a set of micro-services Hadoop data centre Completionist Hadoop After we rethink our approach we decided to implement our ETL as a set of micro-services. From the big picture, ETL consumes all the published data from the Cloud Pub/Sub and than it writes de-duplicated data into the hourly buckets on Hadoop and Cloud Storage. To implement that we broke the ETL into three simple components. Those components are Consumer, Completionist and Deduper. Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper

Consumer Hadoop Completionist Deduper Consumer Cloud Pub/Sub Hadoop data centre Completionist Hadoop Consumer is the service that consumes all events from Cloud Pub/Sub and writes them to files on the intermediate storage. Before the event is considered successfully consumed, file to which it was written needs to be successfully communicated to Completionist. Only after the Completionist successfully acknowledges written file, we consider event to be successfully consumed from PubSub. Our current plan is to have independent Consumer cluster per event stream. That way we have a full isolation between all event streams. Because of this we can scale clusters independently for different event types to achieve optimal quality of service for each of them. Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper

Completionist Hadoop Completionist Deduper Consumer Cloud Pub/Sub Hadoop data centre Completionist Hadoop Completionist is ETLs control plane. It answers the queries from the Deduper what files from the intermediate storage belong to the completed hour. To be able to answer to this question it tracks all files written by Consumer and the time when they were written. To calculate if the specific hour is complete, or not, it uses simple time based logic which is executed at the query time. This is the simplest logic we could thought off and we still need to verify if it’s good enough. The current implementation is backed by Cloud Datastore, which is highly replicated NoSQL data store with support for transactions. We plan to have a single Completionist cluster for all the event types, since the queries are cheap from the service view point. Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper

Deduper Hadoop Completionist Deduper Consumer Cloud Pub/Sub Hadoop data centre Completionist Hadoop Deduper is the last component of our ETL. It’s the component which takes the intermediate data, de-duplicates it and writes it to hourly buckets in it’s final destinations. It’s implemented as Apache Crunch job, which is being executed on Dataproc cluster. This job is a batch job which runs on the full hourly window of data. To know if the hour is complete and which intermediate files belong to completed hour, it talks to Completionist. Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper

Where are we right now? Hadoop Completionist Deduper Consumer Hadoop data centre Completionist Hadoop ETL that we’re building is still not in production. So far we have the skeleton up and running. Achieved latency of exposing the processed hours of the events is worse compared to Dataflow streaming job, that we’ve built as a proof of concept. This is to be expected since in this case, we’re not only having the latency of consuming the data, but we have the added latency of scheduling the batch job, submitting it to Dataproc and than processing the full hourly window of data. In Dataflow streaming job, we were able to process data as it was coming in. Having higher latency is still acceptable for us, since we’re trading it here for the easier management of the system. With the ETL structured like this, we’re able to leverage all Spotify’s existing infrastructure tools. This allows us to use well defined CI/CD process and to have good monitoring and alerting for all of the components. Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper

A Scala API for Google Cloud Dataflow Scio A Scala API for Google Cloud Dataflow

Origin Story Scalding and Spark ML, recommendations, analytics 50+ users, 400+ unique jobs Growing rapidly

Moving to Google Cloud Early 2015 - Dataflow Scala hack project

Why not Scalding on GCE Pros Community Twitter, eBay, Etsy, Stripe, LinkedIn, … Stable and proven

Why not Scalding on GCE Cons Hadoop cluster operations Multi-tenancy resource contention and utilization No streaming mode (Summingbird?)

Why not Spark on GCE Pros Batch, streaming, interactive and SQL MLlib, GraphX Scala, Python, and R support Zeppelin, spark-notebook, Hue

Why not Spark on GCE Cons Hard to tune and scale Cluster lifecycle management

Why Dataflow with Scala Hosted solution, no operations Ecosystem GCS, BigQuery, PubSub, Bigtable, … Unified batch and streaming model

Why Dataflow with Scala High level DSL easy transition for developers Reusable and composable code via FP Numerical libraries: Breeze, Algebird

Scio Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o] Verb: I can, know, understand, have knowledge.

Core API similar to spark-core Some ideas from scalding github.com/spotify/scio

WordCount Almost identical to Spark version val sc = ScioContext() sc.textFile("shakespeare.txt") .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty)) .countByValue() .saveAsTextFile("wordcount.txt")

PageRank def pageRank(in: SCollection[(String, String)]) = { val links = in.groupByKey() var ranks = links.mapValues(_ => 1.0) for (i <- 1 to 10) { val contribs = links.join(ranks).values .flatMap { case (urls, rank) => val size = urls.size urls.map((_, rank / size)) } ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _) ranks

Spotify Running 60 million tracks 30m users * 10 tempo buckets * 25 tracks Audio: tempo, energy, time signature ... Metadata: genres, categories, … Latent vectors from collaborative filtering

Personalized new releases Pre-computed weekly on Hadoop (on-premise cluster) 100GB recommendations from HDFS to Bigtable in US+EU 250GB Bloom filters from Bigtable to HDFS 200 LOC

User conversion analysis For marketing and campaigning strategies Track user transitions through products Aggregated for simulation and projection 150GB BigQuery in and out

Demo Time!

What’s next? Migrating internal teams BigQuery SQL-2011 dialect Apache Beam migration Better streaming support PRs and issues welcome!

Learnings So far, migrating event delivery from the current system to the new one, while maintaining the current system is a huge project as you might imagine. We got a long way, but we still have a lot to cover. ETL is currently the only thing which is blocking us from fully migrating to the new event delivery system. Never the less, even if the new reliable event delivery is still not fully in place, having events being published to Cloud PubSub, is already paying off since it enabled us to move much faster on the real time use case. So far we have used it to replace Kafka in few of our Storm topologies which are used to do real time Ad targeting with great success.

Blog posts @ labs.spotify.com Spotify’s Event Delivery - The Road To The Cloud Part I, Part II, Part III If you’re interested on reading more about our event delivery I would like to point you to the blog posts that are located on labs.spotify.com.

Thank you! Igor Maravić <igor@spotify.com> Thanks for listening! Igor Maravić <igor@spotify.com> Neville Li <neville@spotify.com>