co-founder / data Artisans

Name: co-founder / data Artisans
Uploaded: 2017-08-28T17:05:38+00:00
Duration: PTM15S28
Channel: Dustin Gilmore
Description: co-founder / data Artisans

co-founder / CTO @ data Artisans
Apache Flink Stephan Ewen Flink committer co-founder / data Artisans @StephanEwen

Looking back one year

April 16, 2014

Stratosphere Optimizer
Pact API (Java) DataSet API (Scala) Stratosphere Optimizer Stratosphere Runtime Iterations, Yarn support, Local execution, accummulators, web frontend, HBase, JDBC, Windows compatibility, mvn central, Local Remote Batch processing on a pipelining engine, with iterations …

Looking at now…

What is Apache Flink? Flink Real-time data streams
(master) ETL, Graphs, Machine Learning Relational, … Low latency, windowing, aggregations, ... Event logs Kafka, RabbitMQ, ... Historic data HDFS, JDBC, ...

DataStream (Java/Scala)
What is Apache Flink? HDFS Python Gelly Table ML Dataflow SAMOA Dataflow HCatalog HBase DataSet (Java/Scala) DataStream (Java/Scala) Hadoop M/R JDBC Flink Optimizer Stream Builder Kafka Flink Dataflow Runtime RabbitMQ Flume Local Remote Yarn Tez Embedded

Batch / Steaming APIs DataSet API (batch): DataStream API (streaming):
case class Word (word: String, frequency: Int) DataSet API (batch): val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataStream API (streaming): val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Count.of(1000)).every(Count.of(100)) .groupBy("word").sum("frequency") .print()

Technology inside Flink
case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next DataSource orders.tbl Filter Map lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] GroupRed sort forward Type extraction stack Dataflow Graph Cost-based optimizer Pre-flight (Client) Program deploy operators Memory manager Out-of-core algos Batch & Streaming State & Checkpoints Recovery metadata Task scheduling track intermediate results Master Workers

Flink by Feature / Use Case

Data Streaming Analysis

Life of data streams Create: create streams from event sources (machines, databases, logs, sensors, …) Collect: collect and make streams available for consumption (e.g., Apache Kafka) Process: process streams, possibly generating derived streams (e.g., Apache Flink)

Stream Analysis in Flink
More at:

Defining windows in Flink
Trigger policy When to trigger the computation on current window Eviction policy When data points should leave the window Defines window width/size E.g., count-based policy evict when #elements > n start a new window every n-th element Built-in: Count, Time, Delta policies

Checkpointing / Recovery
Flink acknowledges batches of records Less overhead in failure-free case Currently tied to fault tolerant data sources (e.g., Kafka) Flink operators can keep state State is checkpointed Checkpointing and record acks go together Exactly one semantics for state

Checkpointing / Recovery
Operator checkpoint starting Pushes checkpoint barriers through the data flow Checkpoint done barrier Data Stream checkpoint in progress After barrier = Not in snapshot Before barrier = part of the snapshot (backup till next snapshot) Checkpoint done Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots

Heavy ETL Pipelines

Heavy Data Pipelines Apology: Graph had to be blurred for online slides, due to confidentiality Complex ETL programs

Sorting, hashing, caching
Memory Management Flink contains its own memory management stack. Memory is allocated, de-allocated, and used strictly using an internal buffer pool implementation. To do that, Flink contains its own type extraction and serialization components. User code objects Unmanaged public class WC { public String word; public int count; } Sorting, hashing, caching empty page Managed Shuffling, broadcasts Pool of Memory Pages More at:

Smooth out-of-core performance
Single-core join of 1KB Java objects beyond memory (4 GB) Blue bars are in-memory, orange bars (partially) out-of-core More at:

Benefits of managed memory
More reliable and stable performance (less GC effects, easy to go to disk)

Table API val customers = envreadCsvFile(…).as('id, 'mktSegment)
.filter( 'mktSegment === "AUTOMOBILE" ) val orders = env.readCsvFile(…) .filter( o => dateFormat.parse(o.orderDate).before(date) ) .as('orderId, 'custId, 'orderDate, 'shipPrio) val items = orders .join(customers).where('custId === 'id) .join(lineitems).where('orderId === 'id) .select('orderId,'orderDate,'shipPrio, 'extdPrice * (Literal(1.0f) - 'discount) as 'revenue) val result = items .groupBy('orderId, 'orderDate, 'shipPrio) .select('orderId, 'revenue.sum, 'orderDate, 'shipPrio)

Iterations in Data Flows  Machine Learning Algorithms

Iterate by looping Client Step Step Step Step Step for/while loop in client submits one job per iteration step Data reuse by caching in memory and/or disk

Iterate in the Dataflow

Large-Scale Machine Learning
Factorizing a matrix with 28 billion ratings for recommendations (Scale of Netflix or Spotify) More at:

State in Iterations  Graphs and Machine Learning

Iterate natively with deltas
Replace initial workset A B workset workset initial partial delta iteration X Y solution solution set r esult other datasets Mer ge deltas

Effect of delta iterations…
# of elements updated iteration

… very fast graph analysis
Performance competitive with dedicated graph analysis systems … and mix and match ETL-style and graph analysis in one program More at:

Closing

Flink Roadmap for 2015 Out-of-core state in Streaming
Monitoring and scaling for streaming Streaming Machine Learning with SAMOA More additions to the libraries Batch Machine Learning Graph library additions (more algorithms) SQL on top of expression language Master failover

Flink community #unique contributor ids by git commits
dev list: messages/month. record 1000 messages on

flink.apache.org @ApacheFlink

Backup

Cornerpoints of Flink Design
Flexible Data Streaming Engine Robust Algorithms on Managed Memory  No OutOfMemory Errors Scales to very large JVMs Efficient an robust processing Low Latency Steam Proc. Highly flexible windows High-level APIs, beyond key/value pairs Pipelined Execution of Batch Programs Java/Scala/Python (upcoming) Relational-style optimizer Better shuffle performance Scales to very large groups Active Library Development Native Iterations Very fast Graph Processing Stateful Iterations for ML Graphs / Machine Learning Streaming ML (coming)

Program optimization

A simple program val orders = …
val lineitems = … val filteredOrders = orders .filter(o => dataFormat.parse(l.shipDate).after(date)) .filter(o => o.shipPrio > 2) val lineitemsOfOrders = filteredOrders .join(lineitems) .where(“orderId”).equalTo(“orderId”) .apply((o,l) => new SelectedItem(o.orderDate, l.extdPrice)) val priceSums = lineitemsOfOrders .groupBy(“orderDate”).sum(“l.extdPrice”);

relative sizes of input files
Two execution plans GroupRed GroupRed sort sort hash-part [0,1] Combine Best plan depends on relative sizes of input files forward Join Hybrid Hash Join Hybrid Hash buildHT probe buildHT probe broadcast forward hash-part [0] hash-part [0] Map DataSource lineitem.tbl Map DataSource lineitem.tbl Filter Filter DataSource orders.tbl DataSource orders.tbl

Examples of optimization
Task chaining Coalesce map/filter/etc tasks Join optimizations Broadcast/partition, build/probe side, hash or sort-merge Interesting properties Re-use partitioning and sorting for later operations Automatic caching E.g., for iterations

Visualization

Visualization tools

co-founder / data Artisans

Similar presentations

Presentation on theme: "co-founder / data Artisans"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

co-founder / data Artisans

Similar presentations

Presentation on theme: "co-founder / data Artisans"— Presentation transcript:

Similar presentations

About project

Feedback