NOVA: CONTINUOUS PIG/HADOOP WORKFLOWS. storage & processing scalable file system e.g. HDFS distributed sorting & hashing e.g. Map-Reduce dataflow programming.

Slides:



Advertisements
Similar presentations
Computer Systems & Architecture Lesson 2 4. Achieving Qualities.
Advertisements

Introduction to Hadoop Richard Holowczak Baruch College.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
HDFS & MapReduce Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Overview of MapReduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
CS 540 Database Management Systems
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
CS4432: Database Systems II Hash Indexing 1. Hash-Based Indexes Adaptation of main memory hash tables Support equality searches No range searches 2.
Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?
Program Representations. Representing programs Goals.
Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.
Ingest and Loading DigiTool Version 3.0. Ingest and Loading 2 Ingest Agenda Ingest Overview and Introduction Ingest activity steps Transformers Task Chains.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Databases & Data Warehouses Chapter 3 Database Processing.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Precision Going back to constant prop, in what cases would we lose precision?
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
File Management Chapter 12. File Management File management system is considered part of the operating system Input to applications is by means of a file.
1 Large-scale Incremental Processing Using Distributed Transactions and Notifications Written By Daniel Peng and Frank Dabek Presented By Michael Over.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
GrIDS -- A Graph Based Intrusion Detection System For Large Networks Paper by S. Staniford-Chen et. al.
Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Eric Westfall – Indiana University Jeremy Hanson – Iowa State University Building Applications with the KNS.
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
1 Adapted from Pearson Prentice Hall Adapted form James A. Senn’s Information Technology, 3 rd Edition Chapter 7 Enterprise Databases and Data Warehouses.
SOFTWARE DESIGN AND ARCHITECTURE LECTURE 09. Review Introduction to architectural styles Distributed architectures – Client Server Architecture – Multi-tier.
L/O/G/O 云端的小飞象系列报告之二 Cloud 组. L/O/G/O Hadoop in SIGMOD
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Large-scale Incremental Processing Using Distributed Transactions and Notifications Daniel Peng and Frank Dabek Google, Inc. OSDI Feb 2012 Presentation.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Framework for MDO Studies Amitay Isaacs Center for Aerospace System Design and Engineering IIT Bombay.
Unit 2 Architectural Styles and Case Studies | Website for Students | VTU NOTES | QUESTION PAPERS | NEWS | RESULTS 1.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Abstract A Structured Approach for Modular Design: A Plug and Play Middleware for Sensory Modules, Actuation Platforms, Task Descriptions and Implementations.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
ABAP Dictionary Introduction Tables in the ABAP Dictionary Performance in Table Access Consistency through Input Check Dependencies of ABAP Dictionary.
Session 1 Module 1: Introduction to Data Integrity
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Bigtable: A Distributed Storage System for Structured Data
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Computer Science Lecture 19, page 1 CS677: Distributed OS Last Class: Fault tolerance Reliable communication –One-one communication –One-many communication.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Systems Analysis and Design in a Changing World, Fourth Edition
Image taken from: slideshare
Hadoop.
Module 11: File Structure
Parallel Programming By J. H. Wang May 2, 2017.
湖南大学-信息科学与工程学院-计算机与科学系
The Dataflow Model.
Chapter 2: Operating-System Structures
Chapter 2: Operating-System Structures
Presentation transcript:

NOVA: CONTINUOUS PIG/HADOOP WORKFLOWS

storage & processing scalable file system e.g. HDFS distributed sorting & hashing e.g. Map-Reduce dataflow programming framework e.g. Pig workflow manager e.g. Nova Pig Nova

Nova Overview Nova: a system for batched incremental processing.  Scenarios: Yahoo  Ingesting and analyzing user behavior logs  Building and updating a search index from a stream of crawled web pages  Processing semi-structured data (news, blogs, etc.)  Two-layer programming model (Nova over Pig)  Continuous processing  Independent scheduling  Cross-module optimization  Manageability features

Continuous Processing - Nova: An outer workflow manager layer, deals with graphs of interconnected Pig programs, with data passing in a continuous fashion. - Pig/Hadoop: Inner layer, merely deals with transforming static input data into static output data. Nova: keeps track of “delta” data and routs them to the workflow components in the right order. InputOutput Delta

Independent Scheduling Different portions of a workflow may be scheduled at different times/rates. - Global link analysis algorithms may only be run occasionally due to their costly nature and consumers‘ tolerance for staleness. - The components that perform ingesting, tagging, indexing new news articles, need to operate continuously.

Cross-module optimization Can identify and exploit certain optimization opportunities. E.g.: -2 components read the same input data at the same time. -Pipelining: output of one module as input of subsequent module => Avoid materializing the intermediate result. Manageability features -Manage workflow programming, execution. -Support debugging, keep track of versions of workflow components. -Capture data source and emitting notifications of key events.

Workflow Model  Workflow -Two kinds of vertices: tasks (processing steps) and channels (data containers) -Edges connect tasks to channels and vise versa. [Task] Consumption mode: ALL: read a complete snapshot NEW: only new data since the last invocation [Task] Production mode: B: new complete snapshot Delta: new data that augments any existed data

Workflow Model [Task] Four common patterns of processing -Non-incremental (template detection): Process data from scratch every time. -Stateless incremental (shingling): Process new data only, each data item is handle independently. -Stateless incremental with lookup table (template tagging): Process new data independently. May use a side loop-up table for reference. -Stateful incremental (de-duping): Process new data while maintain and reference some state with the prior input data.

Workflow Model (Cont.)  Data and Update Model Blocks: A channel’s data is divided into blocks. They vary in size. -Blocks are atomic units (either be processed entirely or discarded) -Blocks are immutable. Contains a complete snapshot of data on a channel as of some point in time Base blocks are assigned increasing sequence numbers(B 0, B 1, B 2…… B n ) Base block Used in conjunction with incremental processing Contains instructions for transforming a base block into a new base block( ) Delta block

Workflow Model (Cont.)  Data and Update Model Operators: -Merging: combine base and delta blocks: -Diffing: Compare 2 base blocks to create a delta block -Chaining: combine multiple delta blocks Upsert model: Leverages the presence of a primary key attribute to encode updates and inserts in a uniform way. With upserts, delta blocks are comprised of records to be inserted, with each one displacing any pre-existing record with the same key => retain only the most recent record with a given key.

Workflow Model (Cont.)  Task/Data Interface: [Task] Consumption mode: ALL: read a complete snapshot NEW: only new data since the last invocation [Task] Production mode: B: new complete snapshot Delta: new data that augments any existed data

Workflow Model (Cont.)  Workflow Programming and Scheduling Workflows programming starts with task definitions, then compose them into “workflowettes”. Workflowettes have ports to which input and output channels they may connect. Channels attached to the input and output ports of a workflowette => bound workflowette. 3 types of trigger associated with a workflowette:  Data-based trigger.  Time-based trigger.  Cascade trigger.

Workflow Model (Cont.) Data blocks are immutable. Channels accumulate data blocks => can grow without bound.  Data Compaction and Garbage Collection  If a channel has blocks B 0 , ,, , the compaction operation computes and adds B 3 to the channel  After compaction is used to add B 3 to the channel , and current cursor is at sequence number 2 , then B 0 , , can be garbage-collected.

Each data block resides in an HDFS file. A metadata maintains the mapping. The notion of channel exists only in metadata. Each task: a Pig program. Tying the model to Pig/Hadoop

Each data block resides in an HDFS file. A metadata maintains the mapping. The notion of channel exists only in metadata. Tying the model to Pig/Hadoop

Nova System Architecture