Data science laboratory (DSLAB)

Slides:

Advertisements

Similar presentations

Spark Streaming Large-scale near-real-time stream processing

Advertisements

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO

Spark Streaming Large-scale near-real-time stream processing

Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox

© 2014 Pivotal Introducing Spring XD Mark Pollack, Sr. Software Engineer, Pivotal.

Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)

Programming Models for IoT and Streaming Data IC2E Internet of Things Panel Judy Qiu Indiana University.

Spark in the Hadoop Ecosystem Eric Baldeschwieler (a.k.a. Eric14)

Lecture 18-1 Lecture 17-1 Computer Science 425 Distributed Systems CS 425 / ECE 428 Fall 2013 Hilfi Alkaff November 5, 2013 Lecture 21 Stream Processing.

Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.

Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,

1 Distributed Online Simultaneous Fault Detection for Multiple Sensors Ram Rajagopal, Xuanlong Nguyen, Sinem Ergen, Pravin Varaiya EECS, University of.

Improving the Accuracy of Continuous Aggregates & Mining Queries Under Load Shedding Yan-Nei Law* and Carlo Zaniolo Computer Science Dept. UCLA * Bioinformatics.

Pulsar Realtime Analytics At Scale Tony Ng, Sharad Murthy June 11, 2015.

Stream Clustering CSE 902. Big Data Stream analysis Stream: Continuous flow of data Challenges ◦Volume: Not possible to store all the data ◦One-time.

Department of Information Engineering The Chinese University of Hong Kong A Framework for Monitoring and Measuring a Large-Scale Distributed System in.

Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences.

Spark Streaming Large-scale near-real-time stream processing

History • Created by Nathan BackType • Open sourced on 19th September, 2011 Documentation at Contribution

Streaming Analytics with Spark 1 Magnoni Luca IT-CM-MM 09/02/16EBI - CERN meeting.

Big Data Infrastructure Week 12: Real-Time Data Analytics (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.

Apache Kafka A distributed publish-subscribe messaging system

Apache Beam: The Case for Unifying Streaming API's Andrew Psaltis HDF / IoT Product Solution June 13, 2016 HDF / IoT Product Solution.

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

AMSA TO 4 Advanced Technology for Sensor Clouds 09 May 2012 Anabas Inc. Indiana University.

Microsoft Ignite /28/2017 6:07 PM

Activiti in an Event- driven architecture Robin Bramley Chief Scientific Officer, Ixxus.

András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

MillWheel Fault-Tolerant Stream Processing at Internet Scale

Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.

Connected Infrastructure

Big Data Infrastructure

PROTECT | OPTIMIZE | TRANSFORM

Fast, Interactive, Language-Integrated Cluster Computing

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

Streaming Analytics & CEP Two sides of the same coin?

Introduction to Spark Streaming for Real Time data analysis

International Conference on Data Engineering (ICDE 2016)

Original Slides by Nathan Twitter Shyam Nutanix

Scaling Apache Flink® to very large State

Data Mining, Distributed Computing and Event Detection at BPA

Connected Infrastructure

Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.

Klara Nahrstedt Spring 2010

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Northwestern Lab for Internet and Security Technology (LIST) Yan Chen Department of Computer Science Northwestern University.

Apache Flink and Stateful Stream Processing

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Reliability testing for Spark Streaming

Benchmarking Modern Distributed Stream Processing Systems

COS 518: Advanced Computer Systems Lecture 11 Michael Freedman

Mining Dynamics of Data Streams in Multi-Dimensional Space

Introduction to Spark.

Capital One Architecture Team and DataTorrent

COS 518: Advanced Computer Systems Lecture 11 Daniel Suo

CS110: Discussion about Spark

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

The Dataflow Model.

COS 418: Distributed Systems Lecture 19 Wyatt Lloyd

Lesson 13 - Cleaning Data Lesson 14 - Creating Summary Tables

Data Mining, Distributed Computing and Event Detection at BPA

Afonso Mukai Scientific Software Engineer

Data science laboratory (DSLAB)

Data science laboratory (DSLAB)

Data science laboratory (DSLAB)

Streaming data processing using Spark

COS 518: Advanced Computer Systems Lecture 12 Michael Freedman

DSLab The Data Science Lab Data Science Lab – Spring 2019.

Presentation transcript:

Data science laboratory (DSLAB) Julien Eberle, Sofiane Sarni Swiss Data Science Center EPFL & ETH Zurich Data Science Lab – Spring 2018

Today’s lab week #10 Real-time data acquisition and processing

This Module Review concepts of streaming data acquisition and processing Experiment with stream processing tools for Data acquisition and ingestion Real-time data processing

Why Real-Time data acquisition and processing? Reminder from module 2 (Hadoop) Batch vs Stream You want an updated answer as more information becomes available? streams AKA: Data in motion, or Fast data Continuous computation that never stop, process infinite amount of data on the fly Designed to keep size of in-memory state bounded, regardless of how much data is processed Update the answer as more data becomes available Operate on small time windows Can wait until all information is available for a more accurate answer? batch AKA: Data at rest Operates on finite size data sets, and terminate when all data has been processed

Applications of Stream Processing Log analysis Statistics, DoS attacks, scaling services Sensor data weather, manufacturing, transportation Real-time detections Fraud, intrusion Trend analysis Social media Physics / Astrophysics …

Concepts Data: unordered, unbound Event Time vs Processing Time Windowing Fixed Sliding Count-based vs time-based Transformations Stateful / Stateless

Concepts Data: unordered, unbound Time flow

Concepts Event Time Processing Time delay

Concepts Windowing (fixed) Time flow

Concepts Windowing (sliding) Time flow

Concepts Windowing (time-based) t = 10:00 t = 10:05 t = 10:10 Time flow t = 10:00 t = 10:05 t = 10:10

Concepts Transformations (window based) Aggregations Filtering … Sums, averages, counts, maximum, … Filtering By type, IDs, … …

Concepts Transformations (window based) Stateful vs Stateless Aggregations Sums, averages, counts, maximum, … Filtering By type, IDs, … … Stateful vs Stateless Stateful: need to memorize records or partial results e.g. compute average value on stream Stateless: rely only on information within the window e.g. latest temperature

Ecosystem

Ecosystem

Kafka Messaging system Real-time Publish & Subscribe Real-time Distributed and Scalable (large data volumes) Low latency Fault tolerant

Spark Streaming Stream processing Integrated with Spark API High level abstractions (DStreams) Fault tolerant Works through micro-batches Can read from multiple sources Kafka, HDFS, Flume, ZeroMQ, Twitter Batch Stream Micro-batch