Introduction to Spark Streaming for Real Time data analysis Ellen Kraffmiller - Technical Lead Robert Treacy - Senior Software Architect The Institute for Quantitative Social Science at Harvard University
Agenda Overview of Spark Streaming Demo Spark Components and Architecture Streaming Storm, Kafka, Flume, Kinesis Spark Streaming Structured Streaming Demo
Introductions Who we are Who is the audience
Related Presentations BOF5810 - Distinguish Pop Music from Heavy Metal with Apache Spark Mlib CON3165 – Introduction to Machine Learning with Apache Spark Mlib CON5189 – Getting Started with Spark CON4219 - Analyzing Streaming Video with Apache Spark BOF1337 - Big Data Processing with Apache Spark: Scala or Java? CON1495 - Turning Relational Database Tables into Spark Data Sources CON4998 – Java EE 7 with Apache Spark for the World’s Largest Credit Card Core Systems
More Related Presentations CON7867 – Data Pipeline for a Hyperscale, Real-Time Applications Using Kafka and Cassandra CON2234 – Interactive Data Analytics and Visualization with Collaborative Documents CON7682 – Adventures in Big Data Scaling Bottleneck Hunt CON1191 – Kafka Streams + Tensor Flow + H2o.ai – Highly Scalable Deep Learning CON7368 – The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
Apache Spark A fast and general engine for large scale data processing Latest version 2.2 100x faster than Hadoop MapReduce in memory, 10x faster in disk Resilient Distributed Dataset Directed Acyclic Graph (DAG) - lineage
Spark Components Spark SQL Spark Streaming MLlib GraphX Spark Core Standalone Cluster YARN Mesos Spark Core – basic functions; task scheduling. Memory management, fault recovery, storage system interaction Spark SQL - work with structured data; query with SQL and Hive Query Language (HQL); supports various data sources – Hive tables, Parquet, JSON Spark Streaming – processing live streams of data Mlib – Machine Learning – clustering, classification,regression, collaborative filtering; model evaluation, data import; lower level ML primotives GraphX – manipulate graphs, perform graph-parallel computations; subgraph mapVertices; PageRank and triangle couting algorithms
Streaming Technologies Storm Kafka Flume Kinesis
Apache Storm Low latency , discrete events Real time processing – not a queue Events may be processed more than once Trident Core Storm Trident – batch processing and exactly once semantics
Apache Kafka Scalable, fault tolerant message queue Exactly once Pull
Apache Flume Message passing push Hadoop ecosystem No event replication – can lose messages
Kinesis AWS pull Simpler than Kafka, but also slower – partly because of replication
Spark Streaming Stream processing and batch processing use same API Two types of Streams – DStreams Structured Streams (Datasets/Dataframes) Kafka about message queue- no dataprocessing Dstreams original API Structured Streams newer AI allowing SQL type queries with Datasets/Dataframes
DStreams Series of RDDs Basic Sources Advance Sources File systems, socket connections Advance Sources Kafka, Flume, Kinesis,.. Discretized Streams RDD can be any type of object
Structured Streaming Built on Spark SQL Engine Dataset/Dataframe API Late events aggregated correctly Exactly once semantics Stream source must track progress (i.e. Kafka offsets) Stream Sink must be idempotent Rows and columns timestamp
Source: https://spark. apache Programming Model: live stream data = table that is continuously being updated
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html Word count example – showing the structure stream coming in and appending data to the unbounded input table and the result table being updated with the query results
Spark ML Pipelines Pipeline consists of a sequence of PipelineStages which are either Transformers or Estimators Transformer transform() converts one DataFrame to another DataFrame Estimator fit() accepts a DataFrame and produces a Model a Model is a Transformer
Bayes Model Very fast, simple 𝑃 𝑐 𝑥 = 𝑃(𝑥|𝑐)𝑃(𝑐) 𝑃(𝑥) 𝑃 𝑐 𝑥 = 𝑃(𝑥|𝑐)𝑃(𝑐) 𝑃(𝑥) We will be using a Naïve Bayesian Model Predict probality of outcome c given condition by multiplying the probabilty of observed prior existence of x when the outcome is c times the overall probability of outcome c all divided by the overall existence of condition x
Demo: Analyze Twitter Streams with Spark ML and Spark Streaming Part 1: Create Naïve Bayes Model with training data Part 2: Sentiment Analysis of live Twitter Stream using Naïve Bayes Model and Spark Dstreams Part 3: Structured Streaming example – get Twitter trending hashtags in 10 minute windows using Twitter timestamp https://github.com/ekraffmiller/SparkStreamingDemo Demo 1: Create and Save Naïve Bayes Model with Spark Batch Processing Demo 2: Do Sentiment Analysis of Live Twitter Stream using Naïve Bayes Model Demo 3: Do Sentiment Analysis for one hour of Twitter data, send to Kafka topic Demo 4: Structured Streaming example - Read Twitter stream from Kafka and aggregate in 10 minute windows usinfg Twitter timestamp
Thank you! Questions?