Streaming data processing using Spark

Slides:



Advertisements
Similar presentations
Spark Streaming Large-scale near-real-time stream processing
Advertisements

Spark Streaming Large-scale near-real-time stream processing
Spark Streaming Real-time big-data processing
Spark Streaming Large-scale near-real-time stream processing
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
Programming Models for IoT and Streaming Data IC2E Internet of Things Panel Judy Qiu Indiana University.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Berkley Data Analysis Stack (BDAS)
Hadoop Ecosystem Overview
Spark Streaming Large-scale near-real-time stream processing
Streaming Analytics with Spark 1 Magnoni Luca IT-CM-MM 09/02/16EBI - CERN meeting.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Big Data Infrastructure Week 12: Real-Time Data Analytics (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Massive Data Processing – In-Memory Computing & Spark Stream Process.
IncApprox The marriage of incremental and approximate computing Pramod Bhatotia Dhanya Krishnan, Do Le Quoc, Christof Fetzer, Rodrigo Rodrigues* (TU Dresden.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
The Big Data Network (phase 2) Cloud Hadoop system
Big thanks to everyone!.
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
Integration of Oracle and Hadoop: hybrid databases affordable at scale
Image taken from: slideshare
Apache Spark: A Unified Engine for Big Data Processing
Architecture and design
PROTECT | OPTIMIZE | TRANSFORM
Fast, Interactive, Language-Integrated Cluster Computing
Integration of Oracle and Hadoop: hybrid databases affordable at scale
How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Streaming Analytics & CEP Two sides of the same coin?
Introduction to Spark Streaming for Real Time data analysis
Introduction to Distributed Platforms
ITCS-3190.
BigDL Deep Learning Library on HDInsight
Fast data arrives in real time and potentially high volume
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Original Slides by Nathan Twitter Shyam Nutanix
Hadoop Tutorials Spark
Spark Presentation.
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Data Platform and Analytics Foundational Training
Berkeley Data Analytics Stack (BDAS) Overview
Report from MesosCon North America June 2016, Denver, U.S.
In-Memory Performance
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Ministry of Higher Education
Introduction to Spark.
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Real-Time streaming in Power BI
Visual Analytics Sandbox
Capital One Architecture Team and DataTorrent
Data-Intensive Distributed Computing
StreamApprox Approximate Stream Analytics in Apache Spark
CS110: Discussion about Spark
Spark vs Storm in Pairs Trading Predictions
Overview of big data tools
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Introduction to Spark.
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Data science laboratory (DSLAB)
Fast, Interactive, Language-Integrated Cluster Computing
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
Big-Data Analytics with Azure HDInsight
CS 239 – Big Data Systems Fall 2018
Presentation transcript:

Streaming data processing using Spark Spark streaming

Agenda What is Spark streaming? Why Spark streaming? How Spark streaming works A Twitter example Spark streaming alternatives Q & A Demo

Motivation Many important applications must process large streams of live data and provide results in near-real-time Social network trends Website statistics Intrusion detection systems etc. Require large clusters to handle workloads Require latencies of few seconds

What is Spark Streaming? Framework for large scale stream processing Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark’s batch and interactive processing Provides a simple batch-like API Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.

What is Spark streaming? Spark streaming is an extension of the core Spark API that enables scalable, high-throughput stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets

How Spark streaming works Spark Streaming receives live input data streams and divides the data into batches. Each batch is processed by the Spark engine to generate the final stream of results.

Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs

output operation: to push data to external storage Example – Get hashtags from Twitter val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage batch @ t batch @ t+1 batch @ t+2 tweets DStream flatMap flatMap flatMap hashTags DStream save Every batch saved to HDFS, Write to database, update analytics UI, do whatever appropriate

Example ( Scala and Java) val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") Java JavaDStream<Status> tweets = ssc.twitterStream() JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { })

Combinations of Batch and Streaming Computations Combine batch RDD with streaming DStream Example: Join incoming tweets with a spam HDFS file to filter out bad tweets tweets.transform(tweetsRDD => { tweetsRDD.join(spamHDFSFile).filter(...) })

Similar components Higher throughput than Storm Spark Streaming: 670k records/second/node Storm: 115k records/second/node Apache S4: 7.5k records/second/node

Thank You…