Scalable Realtime Analytics with declarative, SQL like, Complex Event Processing Scripts Srinath Perera Director, Research WSO2 Apache Member

Slides:



Advertisements
Similar presentations
Tracking a Soccer Game with Big Data Srinath Perera Director of Research, WSO2 Member, Apache Software
Advertisements

Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Programming Models for IoT and Streaming Data IC2E Internet of Things Panel Judy Qiu Indiana University.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Running Hadoop-as-a-Service in the Cloud
Elke A. Rundensteiner Database Systems Research Group Office: Fuller 238 Phone: Ext. – 5815 WebPages:
Web Server Hardware and Software
Aurora Proponent Team Wei, Mingrui Liu, Mo Rebuttal Team Joshua M Lee Raghavan, Venkatesh.
Dunja Mladenić Marko Grobelnik Jožef Stefan Institute, Slovenia.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
SYSTEMS SUPPORT FOR GRAPHICAL LEARNING Ken Birman 1 CS6410 Fall /18/2014.
Taming the ETL beast How LinkedIn uses metadata to run complex ETL flows reliably Rajappa Iyer Strata Conference, London, November 12, 2013.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Ch 4. The Evolution of Analytic Scalability
` tuplejump The data engineering platform. A startup with a vision to simplify data engineering and empower the next generation of data powered miracles!
SYSTEMS SUPPORT FOR GRAPHICAL LEARNING Ken Birman 1 CS6410 Fall /18/2014.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
An Introduction to HDInsight June 27 th,
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management Author: Raul Castro Fernandez, Matteo Migliavacca, et al.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Janet works on the Azure Stream Analytics team, focusing on the service and portal UX. She has been in the data space at Microsoft for 5 years, previously.
Microsoft Azure and DataStax: Start Anywhere and Scale to Any Size in the Cloud, On- Premises, or Both with a Leading Distributed Database MICROSOFT AZURE.
BIG DATA/ Hadoop Interview Questions.
Microsoft Ignite /28/2017 6:07 PM
András Benczúr Head, “Big Data – Momentum” Research Group Big Data Analytics Institute for Computer.
Hadoop in the Wild CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Big thanks to everyone!.
Smart Building Solution
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
An Open Source Project Commonly Used for Processing Big Data Sets
Applying Control Theory to Stream Processing Systems
Spark Presentation.
Smart Building Solution
Network Load Balancing
Map Reduce.
Solving DEBS Grand Challenge with WSO2 CEP
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Data Platform and Analytics Foundational Training
Reliability testing for Spark Streaming
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
Introduction to Spark.
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Ch 4. The Evolution of Analytic Scalability
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Overview of big data tools
Big Data Young Lee BUS 550.
Slides prepared by Samkit
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Charles Tappert Seidenberg School of CSIS, Pace University
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Scalable Realtime Analytics with declarative, SQL like, Complex Event Processing Scripts Srinath Perera Director, Research WSO2 Apache Member

(Batch) Analytics  Scientists are doing this for 25 year with MPI (1991) on special Hardware  Took off with Google’s MapReduce paper (2004), Apache Hadoop, Hive and whole eco system created.  It was successful, So we are here!!  But, processing takes time.

Value of Some Insights degrade Fast!  For some usecases ( e.g. stock markets, traffic, surveillance, patient monitoring) the value of insights degrade very quickly with time. -E.g. stock markets and speed of light  We need technology that can produce outputs fast -Static Queries, but need very fast output (Alerts, Realtime control) -Dynamic and Interactive Queries ( Data exploration)

History  Realtime Analytics are not new either!! -Active Databases (2000+) -Stream processing (Aurora, Borealis (2005+) and later Storm) -Distributed Streaming Operators (e.g. Database research topic around 2005) -CEP vendor roadmap ( from tooling-market-survey-2014/ ) tooling-market-survey-2014/

Realtime Analytics Tools

I. Stream Processing  Program a set of processors and wire them up, data flows though the graph.  A middleware framework handles data flow, distribution, and fault tolerance (e.g. Apache Storm, Samza)  Processors may be in the same machine or multiple machines

II. Complex Event Processing

III. Micro Batch  Process data in small batches, and then combine results for final results (e.g. Spark)  Works for simple aggregates, but tricky to do this for complex operations (e.g. Event Sequences)  Can do it with MapReduce as well if the deadlines are not too tight.

IV. OLAP Style In Memory Computing  Usually done to support interactive queries  Index data to make them them readily accessible so you can respond to queries fast. (e.g. Apache Drill)  Tools like Druid, VoltDB and SAP Hana can do this with all data in memory to make things really fast.

Realtime Analytics Patterns  Simple counting (e.g. failure count)  Counting with Windows ( e.g. failure count every hour)  Preprocessing: filtering, transformations (e.g. data cleanup)  Alerts, thresholds (e.g. Alarm on high temperature)  Data Correlation, Detect missing events, detecting erroneous data (e.g. detecting failed sensors)  Joining event streams (e.g. detect a hit on soccer ball)  Merge with data in a database, collect, update data conditionally

Realtime Analytics Patterns (contd.)  Detecting Event Sequence Patterns (e.g. small transaction followed by large transaction)  Tracking - follow some related entity’s state in space, time etc. (e.g. location of airline baggage, vehicle, tracking wild life)  Detect trends – Rise, turn, fall, Outliers, Complex trends like triple bottom etc., (e.g. algorithmic trading, SLA, load balancing)  Learning a Model (e.g. Predictive maintenance)  Predicting next value and corrective actions (e.g. automated car)

Apache Hive  A SQL like data processing language  Since many understand SQL, Hive made large scale data processing Big Data accessible to many  Expressive, short, and sweet.  Define core operations that covers 90% of problems  Lets experts dig in when they like!

(Batch Processing, Hive) (Realtime Analytics, X) What is X?

CEP = SQL for Realtime Analytics  Easy to follow from SQL  Expressive, short, and sweet.  Define core operations that covers 90% of problems  Lets experts dig in when they like! Lets look at the core operations.

Operators: Filters  Assume a temperature stream  Here weather:convertFtoC() is a user defined function. They are used to extend the language. define stream TempStream (ts long, temp double); from TempratureStream [weather:convertFtoC(temp) > 30.0) and roomNo != 2043] select roomNo, temp insert into HotRoomsStream ;  Usecases: -Alerts, thresholds (e.g. Alarm on high temperature) -Preprocessing: filtering, transformations (e.g. data cleanup)

Operators: Windows and Aggregation  Support many window types -Batch Windows, Sliding windows, Custom windows  Usecases -Simple counting (e.g. failure count) -Counting with Windows ( e.g. failure count every hour) from TempratureStream#window.time(1 min) select roomNo, avg(temp) as avgTemp insert into HotRoomsStream ;

Operators: Patterns  Models a followed by relation: e.g. event A followed by event B  Very powerful tool for tracking and detecting patterns from every (a1 = TempratureStream) -> a2 = TempratureStream [temp > a1.temp + 5 ] within 1 day select a2.ts as ts, a2.temp – a1.temp as diff insert into HotDayAlertStream;  Usecases - Detecting Event Sequence Patterns -Tracking - Detect trends

Operators: Joins  Join two data streams based on a condition and windows  Usecases -Data Correlation, Detect missing events, detecting erroneous data -Joining event streams from TempStream[temp > 30.0]#window.time(1 min) as T join RegulatorStream[isOn == false]#window.length(1) as R on T.roomNo == R.roomNo select T.roomNo, R.deviceID, ‘start’ as action insert into RegulatorActionStream

Operators: Access Data from the Disk  Event tables allow users to map a database to a window and join a data stream with the window  Usecases -Merge with data in a database, collect, update data conditionally define stream TempStream (ts long, temp double); define table HistTempTable(day long, avgT double); from TempStream #window.length(1) join OldTempTable on getDayOfYear(ts) == HistTempTable.day && ts > avgT select ts, temp insert into PurchaseUserStream ;

Revisit Patterns  Merge with data in a database, collect, update data conditionally  Detecting Event Sequence Patterns  Detect trends  Learning a Model  Predicting next value and corrective actions  Simple counting  Counting with Windows  Preprocessing: filtering, transformations  Alerts, thresholds  Data Correlation, Detect missing events,  Joining event streams  Tracking

Predictive Analytics  Build models and use them with WSO2 CEP, BAM and ESB using upcoming WSO2 Machine Learner Product ( 2015 Q2)  Build model using R, export them as PMML, and use within WSO2 CEP  Call R Scripts from CEP queries  Regression and Anomaly Detection Operators in CEP

Case Study: Realtime Soccer Analysis Watch at:

TFL Traffic Analysis Built using TFL ( Transport for London) open data feeds.

Great, Does it Scale?

Idea 1: Network of CEP Nodes  For scaling, we arrange CEP processing nodes in a graph like with stream processing.  The Graph can be implemented using an stream processing engine like Apache Storm

Idea II: Compile SQL like Queries to a Network of CEP Nodes from TempStream[temp > 33] insert into HighTempStream; from HighTempStream#window(1h) select max(temp)as max insert into HourlyMaxTempStream; 

How do We partition the Data to scale up the Analysis?  Lets follow MapReduce  Map Reduce does not scale itself, it asks users to break the problem to many small independent problems.

Idea III: Let the Users specify Parallelism  Language include parallel constructs: partitions, pipelines, distributed operators  Assign each partition to a different node, and partition the data accordingly define partition on TempStream.region { from TempStream[temp > 33] insert into HighTempStream; } from HighTempStream#window(1h) select max(temp)as max insert into HourlyMaxTempStream;

Handling Ordering  When the data processed in parallel, output might be generated out of order.  Due to lack of a global time, we cannot trigger windows and other time sensitive constructs  Solution: the current time needs to be propagated though the graph

Putting Everything Together

WSO2 CEP & Big Data Platform

CEP = SQL for Realtime Analytics  Easy to follow from SQL  Expressive, short, sweet and fast!!  Define core operations that covers 90% of problems  Lets experts dig in when they like! And it Scales!!

Questions? Visit us at Booth 1025http://wso2.com/landing/strata- hadoop-world-ca-2015/ hadoop-world-ca-2015/