Apache Beam: The Case for Unifying Streaming API's Andrew Psaltis HDF / IoT Product Solution June 13, 2016 HDF / IoT Product Solution.

Slides:

Advertisements

Similar presentations

Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox

Advertisements

Programming Models for IoT and Streaming Data IC2E Internet of Things Panel Judy Qiu Indiana University.

© Hortonworks Inc Running Non-MapReduce Applications on Apache Hadoop Hitesh Shah & Siddharth Seth Hortonworks Inc. Page 1.

Big Data Open Source Software and Projects ABDS in Summary XIX: Layer 14B Data Science Curriculum March Geoffrey Fox

1 SWE Introduction to Software Engineering Lecture 21 – Architectural Design (Chapter 13)

Streams – DataStage Integration InfoSphere Streams Version 3.0

Apache Spark and the future of big data applications Eric Baldeschwieler.

The SEASR project and its Meandre infrastructure are sponsored by The Andrew W. Mellon Foundation SEASR Overview Loretta Auvil and Bernie Acs National.

An Introduction to HDInsight June 27 th,

INNOV-10 Progress® Event Engine™ Technical Overview Prashant Thumma Principal Software Engineer.

S. Shumilov – Zürich Analytical Visualization Framework - a visual data processing and knowledge discovery system Ivan Denisovich, Serge Shumilov Department.

Internet of Things (Smart Grid) Storm Archival Storage – NOSQL like Hbase Streaming Processing (Iterative MapReduce) Batch Processing (Iterative MapReduce)

© Hortonworks Inc Hadoop: Beyond MapReduce Steve Loughran, Big Data workshop, June 2013.

PANEL SENIOR BIG DATA ARCHITECT BD-COE

CloudJINI Technologies

Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.

Stream Processing with Tamás István Ujj

Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.

Data Warehousing The Easy Way with AWS Redshift

Handling Streaming Data in Spotify Using the Cloud

Microsoft Ignite /28/2017 6:07 PM

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.

Big thanks to everyone!.

DATA Storage and analytics with AZURE DATA LAKE

Pipe Engineering.

Export Services Deep Dive

Current and Future Research Frontiers

Big Data is a Big Deal!.

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

Data Platform and Analytics Foundational Training

PASS Business Analytics Virtual Group & Michael Johnson

Department of Intelligent Systems Engineering

Introduction to Spark Streaming for Real Time data analysis

Welcome! Power BI User Group (PUG)

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Advanced Topics in Distributed and Reactive Programming

Status and Challenges: January 2017

Original Slides by Nathan Twitter Shyam Nutanix

NSF start October 1, 2014 Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science Indiana University.

Data Platform and Analytics Foundational Training

Remote Monitoring solution

ETL Architecture for Real-Time BI

NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.

Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.

Capital One Architecture Team and DataTorrent

Advanced Topics in Distributed and Reactive Programming

The Internet of Things (IoT) from the back-end perspective

Cloud DIKW based on HPC-ABDS to integrate streaming and batch Big Data

Introduction to building DataFlows with Apache NiFi

Slides prepared by Samkit

Architecture for Real-Time ETL

VI-SEEM data analysis service

Twister2: Design of a Big Data Toolkit

Department of Intelligent Systems Engineering

2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.

Advanced Topics in Functional and Reactive Programming

cLOUD solution Google Cloud Platform CLOUD solution GUI Apache server

Data science laboratory (DSLAB)

Streaming data processing using Spark

Big-Data Analytics with Azure HDInsight

CS 239 – Big Data Systems Fall 2018

Data Wrangling as the key to success with Data Lake

Big Data, Simulations and HPC Convergence

DBA Situational Decision Automation Diagram Template

SQL Server 2019 Bringing Apache Spark to SQL Server

Convergence of Big Data and Extreme Computing

Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,

I590 Data Science Curriculum August

Presentation transcript:

Apache Beam: The Case for Unifying Streaming API's Andrew Psaltis HDF / IoT Product Solution June 13, 2016 HDF / IoT Product Solution June 13, 2016

Today if a byte of data was 1 gallon of water we could fill an average house in 10 seconds, by 2020 it will take only 2.

< 1% of the data being generated by the 30,000 sensors on an offshore oil rig is currently used SOURCE: McKinsey Global Institute analysis

Your Big Data Application Simplified Current Architecture

Quarks | Gearpump | Concord | Heron | Pulsar | Spark | SQLStream | EsperTech | Google Cloud | S4 | Flink |DataFlow | Storm | Samza | Impetus |Kafka Streams Distributed Streaming Processing Engine Apache Apex |Storm | Amazon Kinesis | MemSQL | Bottlenose Nerve Center | Striim | Azure Stream Analytics | Informatica | Oracle Stream Analytics | WS02 Complex Event Processor | IBM InfoSphere Streams | Cisco Connected Streaming Analytics | SAS Event Stream Processing | WS02 Complex Event Processor | TIBCO StreamBase | Software AG Apama Streaming Analytics | MongoDB | Ignite | XAP

What if we wanted to add support for Apache Apex?

The Evolution of Apache Beam Google Cloud Dataflow MapReduce BigTableDremelColossus FlumeMegastore Spanner PubSub Millwheel Apache Beam Slide by Frances Perry & Tyler Akidau, April 2016

The Apache Beam Vision 1.End users: who want to write pipelines in a language that’s familiar. 2.SDK writers: who want to make Beam concepts available in new languages. 3.Runner writers: who have a distributed processing environment and want to support Beam pipelines Beam Model: Fn Runners Apache Flink Apache Spark Beam Model: Pipeline Construction Other Languages Beam Java Beam Python Execution Cloud Dataflow Execution Slide by Frances Perry & Tyler Akidau, April 2016 Local

The Beam Model: Asking the Right Questions What results are calculated? Where in event time are results calculated? When in processing time are results materialized? How do refinements of results relate? Slide by Frances Perry & Tyler Akidau, April 2016

Customizing What Where When How 3 Streaming 4 Streaming + Accumulation 1 Classic Batch 2 Windowed Batch Slide by Frances Perry & Tyler Akidau, April 2016

PCollection Represents a collection of data, which could be bounded or unbounded in size.

PTransform PTransform: represents a computation that transforms input PCollections into output PCollections.

Pipeline Pipeline: manages a directed acyclic graph of PTransforms and PCollections that is ready for execution.

PipelineRunner PipelineRunner: specifies where and how the pipeline should execute.

Word Count using GearPump

Word Count using Kafka Streams

Word Count using Apache Beam

What if we want to change the runner for Beam?

Your Big Data Application Simplified Future Architecture with Beam Apache Beam Quarks | Gearpump | Concord | Heron | Pulsar | Spark | SQLStream | EsperTech | Google Cloud | S4 | Flink |DataFlow | Storm | Samza | Impetus |Kafka Streams Apache Apex |Storm | Amazon Kinesis | MemSQL | Bottlenose Nerve Center | Striim | Azure Stream Analytics | Informatica | Oracle Stream Analytics WS02 Complex Event Processor | IBM InfoSphere Streams | Cisco Connected Streaming Analytics | SAS Event Stream Processing | WS02 Complex Event Processor | TIBCO StreamBase | Software AG Apama Streaming Analytics | MongoDB | Ignite | XAP

Why Apache Beam? Unified - One model handles batch and streaming use cases. Portable - Pipelines can be executed on multiple execution environments, avoiding lock- in. Extensible - Supports user and community driven SDKs, Runners, transformation libraries, and IO connectors. Slide by Frances Perry & Tyler Akidau, April 2016

Learn More! Apache Beam (incubating) The World Beyond Batch 101 & Why Apache Beam? A Google Perspective Join the mailing lists! User discussions - Development discussions - on Twitter Slide by Frances Perry & Tyler Akidau, April 2016

The future of streaming and batch is Apache Beam. The choice of runner is up to you. Source:

30 © Hortonworks Inc – All Rights Reserved Thank You