Download presentation
1
co-founder / CTO @ data Artisans
Apache Flink Stephan Ewen Flink committer co-founder / data Artisans @StephanEwen
2
Looking back one year
3
April 16, 2014
4
Stratosphere Optimizer
Pact API (Java) DataSet API (Scala) Stratosphere Optimizer Stratosphere Runtime Iterations, Yarn support, Local execution, accummulators, web frontend, HBase, JDBC, Windows compatibility, mvn central, Local Remote Batch processing on a pipelining engine, with iterations …
5
Looking at now…
6
What is Apache Flink? Flink Real-time data streams
(master) ETL, Graphs, Machine Learning Relational, … Low latency, windowing, aggregations, ... Event logs Kafka, RabbitMQ, ... Historic data HDFS, JDBC, ...
7
DataStream (Java/Scala)
What is Apache Flink? HDFS Python Gelly Table ML Dataflow SAMOA Dataflow HCatalog HBase DataSet (Java/Scala) DataStream (Java/Scala) Hadoop M/R JDBC Flink Optimizer Stream Builder Kafka Flink Dataflow Runtime RabbitMQ Flume Local Remote Yarn Tez Embedded
8
Batch / Steaming APIs DataSet API (batch): DataStream API (streaming):
case class Word (word: String, frequency: Int) DataSet API (batch): val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataStream API (streaming): val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Count.of(1000)).every(Count.of(100)) .groupBy("word").sum("frequency") .print()
9
Technology inside Flink
case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next DataSource orders.tbl Filter Map lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] GroupRed sort forward Type extraction stack Dataflow Graph Cost-based optimizer Pre-flight (Client) Program deploy operators Memory manager Out-of-core algos Batch & Streaming State & Checkpoints Recovery metadata Task scheduling track intermediate results Master Workers
10
Flink by Feature / Use Case
11
Data Streaming Analysis
12
Life of data streams Create: create streams from event sources (machines, databases, logs, sensors, …) Collect: collect and make streams available for consumption (e.g., Apache Kafka) Process: process streams, possibly generating derived streams (e.g., Apache Flink)
13
Stream Analysis in Flink
More at:
14
Defining windows in Flink
Trigger policy When to trigger the computation on current window Eviction policy When data points should leave the window Defines window width/size E.g., count-based policy evict when #elements > n start a new window every n-th element Built-in: Count, Time, Delta policies
15
Checkpointing / Recovery
Flink acknowledges batches of records Less overhead in failure-free case Currently tied to fault tolerant data sources (e.g., Kafka) Flink operators can keep state State is checkpointed Checkpointing and record acks go together Exactly one semantics for state
16
Checkpointing / Recovery
Operator checkpoint starting Pushes checkpoint barriers through the data flow Checkpoint done barrier Data Stream checkpoint in progress After barrier = Not in snapshot Before barrier = part of the snapshot (backup till next snapshot) Checkpoint done Chandy-Lamport Algorithm for consistent asynchronous distributed snapshots
17
Heavy ETL Pipelines
18
Heavy Data Pipelines Apology: Graph had to be blurred for online slides, due to confidentiality Complex ETL programs
19
Sorting, hashing, caching
Memory Management Flink contains its own memory management stack. Memory is allocated, de-allocated, and used strictly using an internal buffer pool implementation. To do that, Flink contains its own type extraction and serialization components. User code objects Unmanaged public class WC { public String word; public int count; } Sorting, hashing, caching empty page Managed Shuffling, broadcasts Pool of Memory Pages More at:
20
Smooth out-of-core performance
Single-core join of 1KB Java objects beyond memory (4 GB) Blue bars are in-memory, orange bars (partially) out-of-core More at:
21
Benefits of managed memory
More reliable and stable performance (less GC effects, easy to go to disk)
22
Table API val customers = envreadCsvFile(…).as('id, 'mktSegment)
.filter( 'mktSegment === "AUTOMOBILE" ) val orders = env.readCsvFile(…) .filter( o => dateFormat.parse(o.orderDate).before(date) ) .as('orderId, 'custId, 'orderDate, 'shipPrio) val items = orders .join(customers).where('custId === 'id) .join(lineitems).where('orderId === 'id) .select('orderId,'orderDate,'shipPrio, 'extdPrice * (Literal(1.0f) - 'discount) as 'revenue) val result = items .groupBy('orderId, 'orderDate, 'shipPrio) .select('orderId, 'revenue.sum, 'orderDate, 'shipPrio)
23
Iterations in Data Flows Machine Learning Algorithms
24
Iterate by looping Client Step Step Step Step Step for/while loop in client submits one job per iteration step Data reuse by caching in memory and/or disk
25
Iterate in the Dataflow
26
Large-Scale Machine Learning
Factorizing a matrix with 28 billion ratings for recommendations (Scale of Netflix or Spotify) More at:
27
State in Iterations Graphs and Machine Learning
28
Iterate natively with deltas
Replace initial workset A B workset workset initial partial delta iteration X Y solution solution set r esult other datasets Mer ge deltas
29
Effect of delta iterations…
# of elements updated iteration
30
… very fast graph analysis
Performance competitive with dedicated graph analysis systems … and mix and match ETL-style and graph analysis in one program More at:
31
Closing
32
Flink Roadmap for 2015 Out-of-core state in Streaming
Monitoring and scaling for streaming Streaming Machine Learning with SAMOA More additions to the libraries Batch Machine Learning Graph library additions (more algorithms) SQL on top of expression language Master failover
33
Flink community #unique contributor ids by git commits
dev list: messages/month. record 1000 messages on
34
flink.apache.org @ApacheFlink
35
Backup
36
Cornerpoints of Flink Design
Flexible Data Streaming Engine Robust Algorithms on Managed Memory No OutOfMemory Errors Scales to very large JVMs Efficient an robust processing Low Latency Steam Proc. Highly flexible windows High-level APIs, beyond key/value pairs Pipelined Execution of Batch Programs Java/Scala/Python (upcoming) Relational-style optimizer Better shuffle performance Scales to very large groups Active Library Development Native Iterations Very fast Graph Processing Stateful Iterations for ML Graphs / Machine Learning Streaming ML (coming)
37
Program optimization
38
A simple program val orders = …
val lineitems = … val filteredOrders = orders .filter(o => dataFormat.parse(l.shipDate).after(date)) .filter(o => o.shipPrio > 2) val lineitemsOfOrders = filteredOrders .join(lineitems) .where(“orderId”).equalTo(“orderId”) .apply((o,l) => new SelectedItem(o.orderDate, l.extdPrice)) val priceSums = lineitemsOfOrders .groupBy(“orderDate”).sum(“l.extdPrice”);
39
relative sizes of input files
Two execution plans GroupRed GroupRed sort sort hash-part [0,1] Combine Best plan depends on relative sizes of input files forward Join Hybrid Hash Join Hybrid Hash buildHT probe buildHT probe broadcast forward hash-part [0] hash-part [0] Map DataSource lineitem.tbl Map DataSource lineitem.tbl Filter Filter DataSource orders.tbl DataSource orders.tbl
40
Examples of optimization
Task chaining Coalesce map/filter/etc tasks Join optimizations Broadcast/partition, build/probe side, hash or sort-merge Interesting properties Re-use partitioning and sorting for later operations Automatic caching E.g., for iterations
41
Visualization
42
Visualization tools
43
Visualization tools
44
Visualization tools
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.