Download presentation
Presentation is loading. Please wait.
Published byMorgan Anthony Modified over 9 years ago
1
NOVA: CONTINUOUS PIG/HADOOP WORKFLOWS
2
storage & processing scalable file system e.g. HDFS distributed sorting & hashing e.g. Map-Reduce dataflow programming framework e.g. Pig workflow manager e.g. Nova Pig Nova
3
Nova Overview Nova: a system for batched incremental processing. Scenarios: Yahoo Ingesting and analyzing user behavior logs Building and updating a search index from a stream of crawled web pages Processing semi-structured data (news, blogs, etc.) Two-layer programming model (Nova over Pig) Continuous processing Independent scheduling Cross-module optimization Manageability features
4
Continuous Processing - Nova: An outer workflow manager layer, deals with graphs of interconnected Pig programs, with data passing in a continuous fashion. - Pig/Hadoop: Inner layer, merely deals with transforming static input data into static output data. Nova: keeps track of “delta” data and routs them to the workflow components in the right order. InputOutput Delta
5
Independent Scheduling Different portions of a workflow may be scheduled at different times/rates. - Global link analysis algorithms may only be run occasionally due to their costly nature and consumers‘ tolerance for staleness. - The components that perform ingesting, tagging, indexing new news articles, need to operate continuously.
6
Cross-module optimization Can identify and exploit certain optimization opportunities. E.g.: -2 components read the same input data at the same time. -Pipelining: output of one module as input of subsequent module => Avoid materializing the intermediate result. Manageability features -Manage workflow programming, execution. -Support debugging, keep track of versions of workflow components. -Capture data source and emitting notifications of key events.
7
Workflow Model Workflow -Two kinds of vertices: tasks (processing steps) and channels (data containers) -Edges connect tasks to channels and vise versa. [Task] Consumption mode: ALL: read a complete snapshot NEW: only new data since the last invocation [Task] Production mode: B: new complete snapshot Delta: new data that augments any existed data
8
Workflow Model [Task] Four common patterns of processing -Non-incremental (template detection): Process data from scratch every time. -Stateless incremental (shingling): Process new data only, each data item is handle independently. -Stateless incremental with lookup table (template tagging): Process new data independently. May use a side loop-up table for reference. -Stateful incremental (de-duping): Process new data while maintain and reference some state with the prior input data.
9
Workflow Model (Cont.) Data and Update Model Blocks: A channel’s data is divided into blocks. They vary in size. -Blocks are atomic units (either be processed entirely or discarded) -Blocks are immutable. Contains a complete snapshot of data on a channel as of some point in time Base blocks are assigned increasing sequence numbers(B 0, B 1, B 2…… B n ) Base block Used in conjunction with incremental processing Contains instructions for transforming a base block into a new base block( ) Delta block
10
Workflow Model (Cont.) Data and Update Model Operators: -Merging: combine base and delta blocks: -Diffing: Compare 2 base blocks to create a delta block -Chaining: combine multiple delta blocks Upsert model: Leverages the presence of a primary key attribute to encode updates and inserts in a uniform way. With upserts, delta blocks are comprised of records to be inserted, with each one displacing any pre-existing record with the same key => retain only the most recent record with a given key.
11
Workflow Model (Cont.) Task/Data Interface: [Task] Consumption mode: ALL: read a complete snapshot NEW: only new data since the last invocation [Task] Production mode: B: new complete snapshot Delta: new data that augments any existed data
12
Workflow Model (Cont.) Workflow Programming and Scheduling Workflows programming starts with task definitions, then compose them into “workflowettes”. Workflowettes have ports to which input and output channels they may connect. Channels attached to the input and output ports of a workflowette => bound workflowette. 3 types of trigger associated with a workflowette: Data-based trigger. Time-based trigger. Cascade trigger.
13
Workflow Model (Cont.) Data blocks are immutable. Channels accumulate data blocks => can grow without bound. Data Compaction and Garbage Collection If a channel has blocks B 0 , ,, , the compaction operation computes and adds B 3 to the channel After compaction is used to add B 3 to the channel , and current cursor is at sequence number 2 , then B 0 , , can be garbage-collected.
14
Each data block resides in an HDFS file. A metadata maintains the mapping. The notion of channel exists only in metadata. Each task: a Pig program. Tying the model to Pig/Hadoop
15
Each data block resides in an HDFS file. A metadata maintains the mapping. The notion of channel exists only in metadata. Tying the model to Pig/Hadoop
16
Nova System Architecture
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.