DSLab The Data Science Lab Data Science Lab – Spring 2019.

Slides:



Advertisements
Similar presentations
This is ASEN 3112: Structures
Advertisements

Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Watch this slideshow with your child. You will need the backpack, the planner, and the binder while you watch. Get them now. Click on the arrow when you’re.
Computational Photography Prof. Feng Liu Spring /30/2015.
Homepage Layout Management. Note: This is our last Core Publisher training in the series! You will be checking in with your Station Relations Support.
Graphical User Interfaces A Quick Outlook. Interface Many methods to create and “interface” with the user 2 most common interface methods: – Console –
Civil Engineering Applications of GIS. Reg Souleyrette, Ph.D., P.E. Eric R. Green, GISP, PE, MSCE Tony Fields, GIS Analyst.
Lecture 5: Signal Processing II EEN 112: Introduction to Electrical and Computer Engineering Professor Eric Rozier, 2/20/13.
To search for a credit class, Click on Credit students.
With your group members you will be making a 4-minute presentation on your topic. You will be required to present your topic as a group to the class on.
Build Your Own Website Review of week 3 Editing your header Editing your header Creating and navigating to hidden pages Creating and navigating to hidden.
Valuation of Travel Time Uncertainty & Delays Joel P. Franklin Assistant Professor, KTH – Transport and Location Analysis.
Start Typical student project Advanced student project Level of final response.
An Autobiographical Presentation n Summary n Objectives n Prerequisites n Time frame n Sample lesson plan n Assessment.
Web 2.0 – A New Beginning Web 2.0, a phrase coined by O'Reilly Media in 2004 refers to a supposed second generation of Internet-based services— such.
Designing Civil Engineering 240 – Geomatics Course objective Introduce engineering applications of surveying and geographical information systems, or GIS,
Apache Beam: The Case for Unifying Streaming API's Andrew Psaltis HDF / IoT Product Solution June 13, 2016 HDF / IoT Product Solution.
Introduction to Digital Analytics Keith MacDonald Guest Presentation.
Duane Deardorff, Alice Churukian, Reyco Henning Dept. of Physics and Astronomy The University of North Carolina at Chapel Hill Contributed talk CF06, presented.
MOODLE TRAINING — ADVANCED TOPICS — Fall 2016 Convocation week Michael Scanlan Office of Information Technology.
1 Out of Order Processing for Stream Query Evaluation Jin Li (Portland State Universtiy) Joint work with Theodore Johnson, Vladislav Shkapenyuk, David.
Course Overview CS 4501 / 6501 Software Testing
PROTECT | OPTIMIZE | TRANSFORM
Introduction to Spark Streaming for Real Time data analysis
Power meter SB-DN-PM1P03 SB-DN-PM3P02.
Some slides borrowed from the authors
Pathology Spatial Analysis February 2017
Scaling SQL with different approaches
Physics 103 General Astronomy
Data stream as an unbounded table
Projects on Extended Apache Spark
4 Core Modules to be delivered on themes of:
Ecosystems Unit Activity 2.1 Predicting Patterns in Ecosystems
GSCM 209 Competitive Success/snaptutorial.com
GSCM 209 Competitive Success/snaptutorial.com
GSCM 209 Competitive Success-- snaptutorial.com
GSCM 209 Innovative Education--snaptutorial.com
GSCM 209 Education for Service- -snaptutorial.com.
GSCM 209 Education for Service- -snaptutorial.com
GSCM 209 Education for Service/snaptutorial.com
GSCM 209 Teaching Effectively-- snaptutorial.com.
GSCM 209 Teaching Effectively-- snaptutorial.com
Benchmarking Modern Distributed Stream Processing Systems
Computational Photography
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
COS 518: Advanced Computer Systems Lecture 11 Daniel Suo
Spark vs Storm in Pairs Trading Predictions
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
ENGT 3050 Vamsi Borra Instructor: Ph.D. candidate, EECS department
Topic: Energy Physics 231.
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Feature Extraction on Twitter Streaming data using Spark RDD
The Dataflow Model.
Topics discussed in this section:
4 Dynamite Timer Slides 30 & 60 Seconds 2 & 5 Minutes.
Course Overview CS 4640 Programming Languages for Web Applications
Chapter 14 Partial Derivatives. Chapter 14 Partial Derivatives.
A Data Assimilation Scheme for Driven Systems
UCLA, CS240B,Fall Ideas for Presentation and Take Home Final
How Should You Participate in this Course?
Cec6- TRAINING ON BUILDING QUANTITATIVE AOP
Data science laboratory (DSLAB)
Streaming data processing using Spark
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
Streams and Stuff Sirish and Sam and Mike.
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Message Mapping Mechanisms in MFC and Other Applications in Visual C++
Generating Function and Applications
Presentation transcript:

DSLab The Data Science Lab Data Science Lab – Spring 2019

Tao, Marc, Ramtin & Olivier Stream Processing Eric Tao, Marc, Ramtin & Olivier Sofiane Spring 2019 - week #11

Stream Processing Module Objectives Review concepts of stream processing Experiment with typical tools for Data ingestion and processing Week 10 Concepts Experiments Week 11 Advanced topics Operations on streaming data (joins) Time constraints Homework

Reminder: Concepts Event Time vs Processing Time Windowing Fixed Sliding Count-based vs time-based Transformations Stateful / Stateless operations

Processing Time vs Event Time delay

Processing Time vs Event Time

Processing Time vs Event Time 1 4 5 2 6 3 Event Time 1, 2, 3, 4, 5, 6, … 1 5 2 4 6 3 Processing Time 2, 3, 1, 6, 5 , 4, …

Processing Time vs Event Time Credits: Tyler Akidau (et al.), Streaming Systems, O’Reilly Media, 2018.

Processing Time vs Event Time Out of order Variable time skew Credits: Tyler Akidau (et al.), Streaming Systems, O’Reilly Media, 2018.

Processing Time vs Event Time Out of order Variable time skew Watermark: set boundaries to drop late data beyond watermark (ensure some completeness) Credits: Tyler Akidau (et al.), Streaming Systems, O’Reilly Media, 2018.

In Spark Streaming Count on a sliding window of 10 minutes long with 5 minutes sliding interval. from pyspark.sql.functions import * windowedAvgDF = \ eventsDF \ .groupBy(window("eventTime", ”10 minute”, "5 minute")) \ .count()

In Spark Streaming Count on a sliding window of 10 minutes long with 5 minutes sliding interval with 10 minutes watermarking (lateness). from pyspark.sql.functions import * windowedAvgDF = \ eventsDF \ .withWatermark("eventTime", "10 minutes") \ .groupBy(window("eventTime", "10 minutes", "5 minutes")) \ .count()

Transformations How it works DStream: continuous stream of data Created from inputs (e.g. Kafka) or derived from other DStreams Supports transformations like RDDs (map, count, join, etc) Image credits: https://spark.apache.org/docs/latest/streaming-programming-guide.html Spark Streaming receives live input data streams and divides the data into (micro) batches, which are then processed by the Spark engine to generate the final stream of results in batches.

Transformations: Joins On two streams over a common key Example: Ad monetization (*) Spark Streaming receives live input data streams and divides the data into (micro) batches, which are then processed by the Spark engine to generate the final stream of results in batches. Source: Databricks Engineering Blog, by Tathagata Das and Joseph Torres , https://bit.ly/2I58Ve7

Transformations: Joins from pyspark.sql.functions import expr # Define watermarks impressionsWithWatermark = impressions \ .selectExpr("adId AS impressionAdId", "impressionTime") \ .withWatermark("impressionTime", "10 seconds ") # watermark: max 10 seconds late clicksWithWatermark = clicks \ .selectExpr("adId AS clickAdId", "clickTime") \ .withWatermark("clickTime", "20 seconds") # watermark: max 20 seconds late # Inner join with time range conditions impressionsWithWatermark.join( clicksWithWatermark, expr(""" clickAdId = impressionAdId AND clickTime >= impressionTime AND clickTime <= impressionTime + interval 1 minutes ""” # aggregation interval: 1 minute ) Spark Streaming receives live input data streams and divides the data into (micro) batches, which are then processed by the Spark engine to generate the final stream of results in batches. Source: Databricks Engineering Blog, by Tathagata Das and Joseph Torres , https://bit.ly/2I58Ve7

Homework Gitlab project URL https://git-dslab.epfl.ch/dslab2019/homework4