Streaming data processing using Spark

Slides:

Advertisements

Similar presentations

Spark Streaming Large-scale near-real-time stream processing

Advertisements

Spark Streaming Large-scale near-real-time stream processing

Spark Streaming Real-time big-data processing

Spark Streaming Large-scale near-real-time stream processing

Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox

Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)

Programming Models for IoT and Streaming Data IC2E Internet of Things Panel Judy Qiu Indiana University.

Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.

Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.

Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.

Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.

Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,

Berkley Data Analysis Stack (BDAS)

Hadoop Ecosystem Overview

Spark Streaming Large-scale near-real-time stream processing

Streaming Analytics with Spark 1 Magnoni Luca IT-CM-MM 09/02/16EBI - CERN meeting.

Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.

Big Data Infrastructure Week 12: Real-Time Data Analytics (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.

Massive Data Processing – In-Memory Computing & Spark Stream Process.

IncApprox The marriage of incremental and approximate computing Pramod Bhatotia Dhanya Krishnan, Do Le Quoc, Christof Fetzer, Rodrigo Rodrigues* (TU Dresden.

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

The Big Data Network (phase 2) Cloud Hadoop system

Big thanks to everyone!.

Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.

Integration of Oracle and Hadoop: hybrid databases affordable at scale

Image taken from: slideshare

Apache Spark: A Unified Engine for Big Data Processing

Architecture and design

PROTECT | OPTIMIZE | TRANSFORM

Fast, Interactive, Language-Integrated Cluster Computing

Integration of Oracle and Hadoop: hybrid databases affordable at scale

How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)

Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.

Streaming Analytics & CEP Two sides of the same coin?

Introduction to Spark Streaming for Real Time data analysis

Introduction to Distributed Platforms

BigDL Deep Learning Library on HDInsight

Fast data arrives in real time and potentially high volume

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Original Slides by Nathan Twitter Shyam Nutanix

Hadoop Tutorials Spark

Spark Presentation.

Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.

Data Platform and Analytics Foundational Training

Berkeley Data Analytics Stack (BDAS) Overview

Report from MesosCon North America June 2016, Denver, U.S.

In-Memory Performance

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

COS 518: Advanced Computer Systems Lecture 11 Michael Freedman

Ministry of Higher Education

Introduction to Spark.

MIT 802 Introduction to Data Platforms and Sources Lecture 2

Real-Time streaming in Power BI

Visual Analytics Sandbox

Capital One Architecture Team and DataTorrent

Data-Intensive Distributed Computing

StreamApprox Approximate Stream Analytics in Apache Spark

CS110: Discussion about Spark

Spark vs Storm in Pairs Trading Predictions

Overview of big data tools

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Introduction to Spark.

MIT 802 Introduction to Data Platforms and Sources Lecture 2

Data science laboratory (DSLAB)

Fast, Interactive, Language-Integrated Cluster Computing

COS 518: Advanced Computer Systems Lecture 12 Michael Freedman

Big-Data Analytics with Azure HDInsight

CS 239 – Big Data Systems Fall 2018

Presentation transcript:

Streaming data processing using Spark Spark streaming

Agenda What is Spark streaming? Why Spark streaming? How Spark streaming works A Twitter example Spark streaming alternatives Q & A Demo

Motivation Many important applications must process large streams of live data and provide results in near-real-time Social network trends Website statistics Intrusion detection systems etc. Require large clusters to handle workloads Require latencies of few seconds

What is Spark Streaming? Framework for large scale stream processing Scales to 100s of nodes Can achieve second scale latencies Integrates with Spark’s batch and interactive processing Provides a simple batch-like API Can absorb live data streams from Kafka, Flume, ZeroMQ, etc.

What is Spark streaming? Spark streaming is an extension of the core Spark API that enables scalable, high-throughput stream processing of live data streams. Data can be ingested from many sources like Kafka, Flume, Twitter, ZeroMQ, Kinesis, or TCP sockets

How Spark streaming works Spark Streaming receives live input data streams and divides the data into batches. Each batch is processed by the Spark engine to generate the final stream of results.

Discretized Stream Processing Run a streaming computation as a series of very small, deterministic batch jobs

output operation: to push data to external storage Example – Get hashtags from Twitter val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") output operation: to push data to external storage batch @ t batch @ t+1 batch @ t+2 tweets DStream flatMap flatMap flatMap hashTags DStream save Every batch saved to HDFS, Write to database, update analytics UI, do whatever appropriate

Example ( Scala and Java) val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") Java JavaDStream<Status> tweets = ssc.twitterStream() JavaDstream<String> hashTags = tweets.flatMap(new Function<...> { })

Combinations of Batch and Streaming Computations Combine batch RDD with streaming DStream Example: Join incoming tweets with a spam HDFS file to filter out bad tweets tweets.transform(tweetsRDD => { tweetsRDD.join(spamHDFSFile).filter(...) })

Similar components Higher throughput than Storm Spark Streaming: 670k records/second/node Storm: 115k records/second/node Apache S4: 7.5k records/second/node

Thank You…