Some practical information

Slides:



Advertisements
Similar presentations
Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B I590 Data Science Curriculum August Geoffrey Fox
Advertisements

Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
Pulsar Realtime Analytics At Scale Tony Ng, Sharad Murthy June 11, 2015.
Running Your Database in the Cloud Eran Levin VP R&D - Xeround.
Apache Spark and the future of big data applications Eric Baldeschwieler.
Tyson Condie.
L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
Stream Processing with Tamás István Ujj
A Better Way Huawei Financial Agile Network Solution Success Cases.
Apache Beam: The Case for Unifying Streaming API's Andrew Psaltis HDF / IoT Product Solution June 13, 2016 HDF / IoT Product Solution.
eBay Marketplaces Ming Ma June 27 th, 2013.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
1 Panel on Merge or Split: Mutual Influence between Big Data and HPC Techniques IEEE International Workshop on High-Performance Big Data Computing In conjunction.
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Big thanks to everyone!.
SAS® Viya™ Overview ANDRĖ DE WAAL, GLOBAL ACADEMIC PROGRAM
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
Protecting a Tsunami of Data in Hadoop
Connected Infrastructure
Smart Building Solution
Hadoop Aakash Kag What Why How 1.
Streaming Analytics & CEP Two sides of the same coin?
Introduction to Spark Streaming for Real Time data analysis
Hadoop.
The Future of Apache Flink®
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
An Open Source Project Commonly Used for Processing Big Data Sets
Pagerank and Betweenness centrality on Big Taxi Trajectory Graph
Melbourne Azure Meetup
Docker Birthday #3.
Scaling Apache Flink® to very large State
Open Source distributed document DB for an enterprise
of our Partners and Customers
Spark Presentation.
Smart Building Solution
Couchbase Server is a NoSQL Database with a SQL-Based Query Language
Streaming Analytics with Apache Flink 1.0
Grid Computing.
Stream Analytics with SQL on Apache Flink®
Running Apache Flink® Everywhere
Connected Infrastructure
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Sub-millisecond Stateful Stream Querying over
Data Platform and Analytics Foundational Training
Remote Monitoring solution
Apache Flink and Stateful Stream Processing
ETL Architecture for Real-Time BI
NSF : CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science PI: Geoffrey C. Fox Software: MIDAS HPC-ABDS.
COS 518: Advanced Computer Systems Lecture 11 Michael Freedman
به نام خدا Big Data and a New Look at Communication Networks Babak Khalaj Sharif University of Technology Department of Electrical Engineering.
Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.
Capital One Architecture Team and DataTorrent
Windows Azure 講師: 李智樺, Ruddy Lee
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Ewen Cheslack-Postava
Your gateway to cloud innovation
Architecture for Real-Time ETL
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Technical Capabilities
Indiana University, Bloomington
2 Programming Environment for Global AI and Modeling Supercomputer GAIMSC 2/19/2019.
with Raul Castro Fernandez* Matteo Migliavacca+ and Peter Pietzuch*
An Analysis of Stream Processing Languages
Data Analysis and R : Technology & Opportunity
Streaming data processing using Spark
COS 518: Advanced Computer Systems Lecture 12 Michael Freedman
Big Data, Simulations and HPC Convergence
Introduction to Azure Data Lake
Twister2 for BDEC2 Poznan, Poland Geoffrey Fox, May 15,
Presentation transcript:

Some practical information Network name: Flink Forward 2016 Password: #flinkforward16 Twitter handle: @flinkforward Hashtag: #ff16 Group photo today at 3.30 pm All talks will be recorded and can be found on our YouTube channel “Apache Flink Berlin” after the conference FlinkFest today at Palais starting at 6.10 pm Attention: Some last minute changes to the program, please consult online schedule

The Venue

A big thanks to our sponsors!

A big thanks to our program committee! Tyler Akidau Google Stephan Ewen data Artisans Jamie Grier data Artisans Vasia Kalavri KTH Neha Narkhede Confluent

A big thanks to our speakers!

A big thanks to our speakers!

Kostas Tzoumas Stephan Ewen Flink Forward September 12, 2016 The data streaming ecosystem and Apache Flink®: present and future Kostas Tzoumas Stephan Ewen Flink Forward September 12, 2016

Founded by the original creators of Apache Flink®, our goal is to make stream processing accessible to the enterprise Contributing and helping the Flink community grow Providing enterprise support and services

Streaming is a rapidly growing and maturing market category of its own Streaming is the biggest change in data infrastructure (Flink Forward 2015) Streaming is a rapidly growing and maturing market category of its own

The Flink community has been at the center of this journey The Flink community has been at the center of this journey. And there is innovation and convergence in all parts of the stack. message transport compute engine programming paradigm

Why? Streaming technology is enabling the obvious: continuous processing on data that is continuously produced Hint: you already have streaming data

Data streaming adoption patterns Real-time products and business monitoring Robust continuous applications Decentralized architecture Unify real-time and historical data Uber, Netflix, Alibaba, Zalando, King

Retail, e-commerce Better product recommendations Process monitoring Inventory management Finance Differentiation via tech Push-based products Fraud detection Telco, IoT, Infrastructure Infrastructure monitoring Anomaly detection Internet & mobile Personalization User behavior monitoring Analytics

Largest job has > 20 operators, runs on > 5000 vCores in 1000-node cluster, processes millions of events per second Complex jobs of > 30 operators running 24/7, processing 30 billion events daily, maintaining state of 100s of GB with exactly-once guarantees 30 Flink applications in production for more than one year. 10 billion events (2TB) processed daily

What is Flink's unique role in the streaming data ecosystem?

Before Flink, users had to make hard choices between: Volume Latency Accuracy

Flink eliminates these tradeoffs 10s of millions events per second for stateful applications Sub-second latency, as low as single-digit milliseconds Accurate computation results

A broader definition of accuracy: the results that I want when I want them Accurate under failures and downtime Accurate under out of order data Results when you need them Accurate modeling of the world

1. Failures and downtime Checkpoints & savepoints Exactly-once guarantees 2. Out of order and late data Event time support Watermarks 3. Results when you need them Low latency Triggers 4. Accurate modeling True streaming engine Sessions and flexible windows

High throughput, event time support, and savepoints 5. Batch + streaming One engine Dedicated APIs 6. Reprocessing High throughput, event time support, and savepoints flink -s <savepoint> <job> 7. Ecosystem Rich connector ecosystem and 3rd party packages 8. Community support One of the most active projects with over 200 contributors

What are the next steps for Flink?

Provide state of the art streaming capabilities (✔) Operate in the largest infrastructures of the world Open up to a wider set of enterprise users Broaden the scope of stream processing

Apache Flink today The Apache Flink community has pushed the boundaries of open source stream processing.

Flink's unique combination of features Low latency Out-of-order events Consistency High Throughput Works on real-time and historic data Well-behaved flow control (back pressure) Performance Event Time Fluent API Windows & user-defined state Stateful Streaming APIs Libraries Complex Event Processing Exactly-once semantics for fault tolerance Flexible windows (time, count, session, roll-your own) Savepoints (replays, A/B testing, upgrades, versioning) Fast and large out-of-core state

Flink v1.1 Metric System Library enhancements Connectors (Stream) SQL Session Windows

Flink v1.1 + current threads Dynamic Scaling Metric System Metrics & Visualization Dynamic Resource Management Fine grained recovery Savepoint compatibility Checkpoints to savepoints Large state Maintenance Authentication Security Library enhancements Mesos & others Side in-/outputs Connectors Window DSL Queryable State Stream SQL Windows Session Windows More connectors (Stream) SQL

Flink v1.1 + current threads Dynamic Scaling Metric System Metrics & Visualization Dynamic Resource Management Fine grained recovery Savepoint compatibility Checkpoints to savepoints Operations Large state Maintenance Authentication Security Library enhancements Ecosystem Application Features Mesos & others Broader Audience Side in-/outputs Connectors Window DSL Queryable State Stream SQL Windows Session Windows More connectors (Stream) SQL

Flink v1.1 + current threads Dynamic Scaling Metric System Metrics & Visualization Dynamic Resource Management Fine grained recovery Savepoint compatibility Checkpoints to savepoints Operations Large state Maintenance Authentication Security Library enhancements Ecosystem Application Features Mesos & others Broader Audience Side in-/outputs Connectors Window DSL Queryable State Stream SQL Windows Session Windows More connectors (Stream) SQL

Flink v1.1 + current threads Dynamic Scaling Metric System Metrics & Visualization Dynamic Resource Management Fine grained recovery Savepoint compatibility Checkpoints to savepoints Operations Large state Maintenance More details in the Talk "The Future of Apache Flink" (Monday, 11:00) Authentication Security Library enhancements Ecosystem Application Features Mesos & others Broader Audience Side in-/outputs Connectors Window DSL Queryable State Stream SQL Windows Session Windows More connectors (Stream) SQL

Security / Authentication No unauthorized data access Secured clusters with Kerberos-based authentication Kafka, ZooKeeper, HDFS, YARN, HBase, … No unencrypted traffic between Flink Processes RPC, Data Exchange, Web UI Prevent malicious users to hook into Flink jobs See talk "Flink Security Enhancements" (Tuesday, 11.45) Largely contributed by

Checkpoints / Savepoints Recover a running job into a new job Recover a running job onto a new cluster Application state backwards compatibility Flink 1.0 made the APIs backwards compatible Now making the savepoints backwards compatible Applications can be moved to newer versions of Flink even when state backends or internals change v1.x v1.y v2.0

Dynamic scaling Changing load bears changing resource requirements Need to adjust parallelism of running streaming jobs See talk "Dynamic scaling: How Apache Flink adapts to changing workloads" (Tuesday, 14.45) Re-scaling stateless operators is trivial Re-scaling stateful operators is hard (windows, user state) Efficiently re-shard state time Workload Resources Re-scaling Flink jobs preserves exactly-once guarantees

Cluster management Series of improvements to seamlessly interoperate with various cluster managers YARN, Mesos, Docker, Standalone, … Proper isolation of jobs, clean support for multi-job sessions Dynamic acquire/release of resources Using mixed container sizes Driven by and Mesos integration contributed by

Cluster management Series of improvements to seamlessly interoperate with various cluster managers YARN, Mesos, Docker, Standalone, … Proper isolation of jobs, clean support for multi-job sessions See talk "Running Flink Everywhere" (Monday, 16.45) See talk "Introducing Flink on Mesos" (Tuesday, 11.30) Dynamic acquire/release of resources Using mixed container sizes Driven by and Mesos integration contributed by

Stream SQL SQL is the standard high-level query language A natural way to open up streaming to more people Problem: There is no Streaming SQL standard At least beyond the basic operations Challenging: Incorporate windows and time semantics Flink community working with Apache Calcite to draft a new model

"Streaming SQL" (Monday, 11:00) Stream SQL SQL is the standard high-level query language A natural way to open up streaming to more people See talk "Streaming SQL" (Monday, 11:00) See talk "Taking a look under the hood of Apache Flink’s relational APIs" (Monday, 16.45) Problem: There is no Streaming SQL standard At least beyond the basic operations Challenging: Incorporate windows and time semantics Flink community working with users and with Apache Calcite to draft a new model

Looking further

Streaming and batch The separation of batch and streaming … … has been largely technology driven (not by use cases) … is quite artificial People are approaching Flink for batch processing as well In fact – several talks here are about batch processing…

Streaming and batch … partition partition 2016-3-1 12:00 am 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am 2016-3-11 10:00pm 2016-3-12 2:00am 2016-3-12 3:00am …

Streaming and batch Stream (low latency) Stream (high latency) … partition partition 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am 2016-3-11 10:00pm 2016-3-12 2:00am 2016-3-12 3:00am … Stream (high latency)

Streaming and batch Stream (low latency) Batch Stream (high latency) partition partition 2016-3-1 12:00 am 2016-3-1 1:00 am 2016-3-1 2:00 am 2016-3-11 11:00pm 2016-3-12 12:00am 2016-3-12 1:00am 2016-3-11 10:00pm 2016-3-12 2:00am 2016-3-12 3:00am … Batch (bounded stream) Stream (high latency)

Why use batch at all now? … dedicated batch processors Possible to add to DataStream API … dedicated batch processors … or Flink's DataSet API Missing primitives (example: BSP iterations) Resource elasticity / efficiency Cost of fault tolerance and accuracy Deeper integration between batch and streaming techniques

Some batch proof points… TeraSort Classic Batch Jobs Relational Join Linear Algebra Graph Processing

State in stream processing Stateless Streaming (Apache Storm) Stateful Streaming (Apache Samza) Accurate Stateful Streaming (Apache Flink) State sizes in Flink today (my assessment): 10s gigabytes per operator How to scale this to many terabytes? Queryable State Data driven triggers over large state

Large-state streaming How to scale the stream processor state? … and maintain fast checkpoint intervals? … and have very fast recovery on machine failures? More and more database techniques coming into Flink

…in conclusion Flink is running in some of the largest streaming setups Community is working on adding many state-of-the-art operational features Available to broader audiences, via Stream SQL Streaming has even more potential to subsume batch and will hold more and more application state

Enjoy the conference!