Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Spark Streaming for Real Time data analysis

Similar presentations


Presentation on theme: "Introduction to Spark Streaming for Real Time data analysis"— Presentation transcript:

1 Introduction to Spark Streaming for Real Time data analysis
Ellen Kraffmiller - Technical Lead Robert Treacy - Senior Software Architect The Institute for Quantitative Social Science at Harvard University

2 Agenda Overview of Spark Streaming Demo
Spark Components and Architecture Streaming Storm, Kafka, Flume, Kinesis Spark Streaming Structured Streaming Demo

3 Introductions Who we are Who is the audience

4 Related Presentations
BOF Distinguish Pop Music from Heavy Metal with Apache Spark Mlib CON3165 – Introduction to Machine Learning with Apache Spark Mlib CON5189 – Getting Started with Spark CON Analyzing Streaming Video with Apache Spark BOF Big Data Processing with Apache Spark: Scala or Java? CON Turning Relational Database Tables into Spark Data Sources CON4998 – Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems

5 More Related Presentations
CON7867 – Data Pipeline for a Hyperscale, Real-Time Applications Using Kafka and Cassandra CON2234 – Interactive Data Analytics and Visualization with Collaborative Documents CON7682 – Adventures in Big Data Scaling Bottleneck Hunt CON1191 – Kafka Streams + Tensor Flow + H2o.ai – Highly Scalable Deep Learning CON7368 – The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners

6 Apache Spark A fast and general engine for large scale data processing
Latest version 2.2 100x faster than Hadoop MapReduce in memory, 10x faster in disk Resilient Distributed Dataset Directed Acyclic Graph (DAG) - lineage

7 Spark Components Spark SQL Spark Streaming MLlib GraphX Spark Core
Standalone Cluster YARN Mesos Spark Core – basic functions; task scheduling. Memory management, fault recovery, storage system interaction Spark SQL - work with structured data; query with SQL and Hive Query Language (HQL); supports various data sources – Hive tables, Parquet, JSON Spark Streaming – processing live streams of data Mlib – Machine Learning – clustering, classification,regression, collaborative filtering; model evaluation, data import; lower level ML primotives GraphX – manipulate graphs, perform graph-parallel computations; subgraph mapVertices; PageRank and triangle couting algorithms

8 Streaming Technologies
Storm Kafka Flume Kinesis

9 Apache Storm Low latency , discrete events
Real time processing – not a queue Events may be processed more than once Trident Core Storm Trident – batch processing and exactly once semantics

10 Apache Kafka Scalable, fault tolerant message queue Exactly once Pull

11 Apache Flume Message passing push Hadoop ecosystem
No event replication – can lose messages

12 Kinesis AWS pull Simpler than Kafka, but also slower – partly because of replication

13 Spark Streaming Stream processing and batch processing use same API
Two types of Streams – DStreams Structured Streams (Datasets/Dataframes) Kafka about message queue- no dataprocessing Dstreams original API Structured Streams newer AI allowing SQL type queries with Datasets/Dataframes

14 DStreams Series of RDDs Basic Sources Advance Sources
File systems, socket connections Advance Sources Kafka, Flume, Kinesis,.. Discretized Streams RDD can be any type of object

15 Structured Streaming Built on Spark SQL Engine Dataset/Dataframe API
Late events aggregated correctly Exactly once semantics Stream source must track progress (i.e. Kafka offsets) Stream Sink must be idempotent Rows and columns timestamp

16 Source: https://spark. apache
Programming Model: live stream data = table that is continuously being updated

17 Word count example – showing the structure stream coming in and appending data to the unbounded input table and the result table being updated with the query results

18 Spark ML Pipelines Pipeline consists of a sequence of PipelineStages
which are either Transformers or Estimators Transformer transform() converts one DataFrame to another DataFrame Estimator fit() accepts a DataFrame and produces a Model a Model is a Transformer

19 Bayes Model Very fast, simple 𝑃 𝑐 𝑥 = 𝑃(𝑥|𝑐)𝑃(𝑐) 𝑃(𝑥)
𝑃 𝑐 𝑥 = 𝑃(𝑥|𝑐)𝑃(𝑐) 𝑃(𝑥) We will be using a Naïve Bayesian Model Predict probality of outcome c given condition by multiplying the probabilty of observed prior existence of x when the outcome is c times the overall probability of outcome c all divided by the overall existence of condition x

20 Demo: Analyze Twitter Streams with Spark ML and Spark Streaming
Part 1: Create Naïve Bayes Model with training data Part 2: Sentiment Analysis of live Twitter Stream using Naïve Bayes Model and Spark Dstreams Part 3: Structured Streaming example – get Twitter trending hashtags in 10 minute windows using Twitter timestamp Demo 1: Create and Save Naïve Bayes Model with Spark Batch Processing Demo 2: Do Sentiment Analysis of Live Twitter Stream using Naïve Bayes Model Demo 3: Do Sentiment Analysis for one hour of Twitter data, send to Kafka topic Demo 4: Structured Streaming example - Read Twitter stream from Kafka and aggregate in 10 minute windows usinfg Twitter timestamp

21 Thank you! Questions?


Download ppt "Introduction to Spark Streaming for Real Time data analysis"

Similar presentations


Ads by Google