Data science laboratory (DSLAB)

Slides:



Advertisements
Similar presentations
Spark Streaming Large-scale near-real-time stream processing
Advertisements

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Spark Streaming Large-scale near-real-time stream processing
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
© 2014 Pivotal Introducing Spring XD Mark Pollack, Sr. Software Engineer, Pivotal.
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
Programming Models for IoT and Streaming Data IC2E Internet of Things Panel Judy Qiu Indiana University.
Spark in the Hadoop Ecosystem Eric Baldeschwieler (a.k.a. Eric14)
Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
1 Distributed Online Simultaneous Fault Detection for Multiple Sensors Ram Rajagopal, Xuanlong Nguyen, Sinem Ergen, Pravin Varaiya EECS, University of.
Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.
Pulsar Realtime Analytics At Scale Tony Ng, Sharad Murthy June 11, 2015.
Stream Clustering CSE 902. Big Data Stream analysis Stream: Continuous flow of data Challenges ◦Volume: Not possible to store all the data ◦One-time.
Department of Information Engineering The Chinese University of Hong Kong A Framework for Monitoring and Measuring a Large-Scale Distributed System in.
Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences.
Spark Streaming Large-scale near-real-time stream processing
History • Created by Nathan BackType • Open sourced on 19th September, 2011 Documentation at Contribution
Streaming Analytics with Spark 1 Magnoni Luca IT-CM-MM 09/02/16EBI - CERN meeting.
Big Data Infrastructure Week 12: Real-Time Data Analytics (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Apache Kafka A distributed publish-subscribe messaging system
Apache Beam: The Case for Unifying Streaming API's Andrew Psaltis HDF / IoT Product Solution June 13, 2016 HDF / IoT Product Solution.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
AMSA TO 4 Advanced Technology for Sensor Clouds 09 May 2012 Anabas Inc. Indiana University.
Microsoft Ignite /28/2017 6:07 PM
Activiti in an Event- driven architecture Robin Bramley Chief Scientific Officer, Ixxus.
András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
MillWheel Fault-Tolerant Stream Processing at Internet Scale
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
Connected Infrastructure
Big Data Infrastructure
PROTECT | OPTIMIZE | TRANSFORM
Fast, Interactive, Language-Integrated Cluster Computing
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Streaming Analytics & CEP Two sides of the same coin?
Introduction to Spark Streaming for Real Time data analysis
International Conference on Data Engineering (ICDE 2016)
Original Slides by Nathan Twitter Shyam Nutanix
Scaling Apache Flink® to very large State
Data Mining, Distributed Computing and Event Detection at BPA
Connected Infrastructure
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Klara Nahrstedt Spring 2010
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Northwestern Lab for Internet and Security Technology (LIST) Yan Chen Department of Computer Science Northwestern University.
Apache Flink and Stateful Stream Processing
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Reliability testing for Spark Streaming
Benchmarking Modern Distributed Stream Processing Systems
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Mining Dynamics of Data Streams in Multi-Dimensional Space
Introduction to Spark.
Capital One Architecture Team and DataTorrent
COS 518: Advanced Computer Systems Lecture 11 Daniel Suo
CS110: Discussion about Spark
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
The Dataflow Model.
COS 418: Distributed Systems Lecture 19 Wyatt Lloyd
Lesson 13 - Cleaning Data Lesson 14 - Creating Summary Tables
Data Mining, Distributed Computing and Event Detection at BPA
Afonso Mukai Scientific Software Engineer
Data science laboratory (DSLAB)
Data science laboratory (DSLAB)
Data science laboratory (DSLAB)
Streaming data processing using Spark
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
DSLab The Data Science Lab Data Science Lab – Spring 2019.
Presentation transcript:

Data science laboratory (DSLAB) Julien Eberle, Sofiane Sarni Swiss Data Science Center EPFL & ETH Zurich Data Science Lab – Spring 2018

Today’s lab week #10 Real-time data acquisition and processing

This Module Review concepts of streaming data acquisition and processing Experiment with stream processing tools for Data acquisition and ingestion Real-time data processing

Why Real-Time data acquisition and processing? Reminder from module 2 (Hadoop) Batch vs Stream You want an updated answer as more information becomes available? streams AKA: Data in motion, or Fast data Continuous computation that never stop, process infinite amount of data on the fly Designed to keep size of in-memory state bounded, regardless of how much data is processed Update the answer as more data becomes available Operate on small time windows Can wait until all information is available for a more accurate answer? batch AKA: Data at rest Operates on finite size data sets, and terminate when all data has been processed

Applications of Stream Processing Log analysis Statistics, DoS attacks, scaling services Sensor data weather, manufacturing, transportation Real-time detections Fraud, intrusion Trend analysis Social media Physics / Astrophysics …

Concepts Data: unordered, unbound Event Time vs Processing Time Windowing Fixed Sliding Count-based vs time-based Transformations Stateful / Stateless

Concepts Data: unordered, unbound Time flow

Concepts Event Time Processing Time delay

Concepts Windowing (fixed) Time flow

Concepts Windowing (sliding) Time flow

Concepts Windowing (time-based) t = 10:00 t = 10:05 t = 10:10 Time flow t = 10:00 t = 10:05 t = 10:10

Concepts Transformations (window based) Aggregations Filtering … Sums, averages, counts, maximum, … Filtering By type, IDs, … …

Concepts Transformations (window based) Stateful vs Stateless Aggregations Sums, averages, counts, maximum, … Filtering By type, IDs, … … Stateful vs Stateless Stateful: need to memorize records or partial results e.g. compute average value on stream Stateless: rely only on information within the window e.g. latest temperature

Ecosystem

Ecosystem

Kafka Messaging system Real-time Publish & Subscribe Real-time Distributed and Scalable (large data volumes) Low latency Fault tolerant

Spark Streaming Stream processing Integrated with Spark API High level abstractions (DStreams) Fault tolerant Works through micro-batches Can read from multiple sources Kafka, HDFS, Flume, ZeroMQ, Twitter Batch Stream Micro-batch