Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
1 Chapter 1 Introduction to Object-Oriented Programming.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Imperative Programming
Topics Introduction Hardware and Software How Computers Store Data
CIS Computer Programming Logic
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
General Computer Science for Engineers CISC 106 Lecture 02 Dr. John Cavazos Computer and Information Sciences 09/03/2010.
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
Cloud Computing Other High-level parallel processing languages Keke Chen.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Copyright © 2012 Pearson Education, Inc. Publishing as Pearson Addison-Wesley C H A P T E R 1 Introduction to Computers and Programming.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
Introduction to Computer Systems and the Java Programming Language.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
S2008Final_part1.ppt CS11 Introduction to Programming Final Exam Part 1 S A computer is a mechanical or electrical device which stores, retrieves,
Lec 6 Data types. Variable: Its data object that is defined and named by the programmer explicitly in a program. Data Types: It’s a class of Dos together.
Property of Jack Wilson, Cerritos College1 CIS Computer Programming Logic Programming Concepts Overview prepared by Jack Wilson Cerritos College.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
8-1 Compilers Compiler A program that translates a high-level language program into machine code High-level languages provide a richer set of instructions.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS4432: Database Systems II Query Processing- Part 2.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Higher Computing Science 2016 Prelim Revision. Topics to revise Computational Constructs parameter passing (value and reference, formal and actual) sub-programs/routines,
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Chapter 1: Preliminaries Lecture # 2. Chapter 1: Preliminaries Reasons for Studying Concepts of Programming Languages Programming Domains Language Evaluation.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Algorithms in Programming Computer Science Principles LO
Interpreting the Data: Parallel Analysis with Sawzall Rob Pike, Sean Dorward, Robert Griesemer, Sean Quinlan Scientific Programming Journal Special Issue.
MapReduce Compiler RHadoop
Hadoop Aakash Kag What Why How 1.
Topics Introduction Hardware and Software How Computers Store Data
Spark Presentation.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
Topics Introduction Hardware and Software How Computers Store Data
Distributed System Gang Wu Spring,2018.
Interpret the execution mode of SQL query in F1 Query paper
CS639: Data Management for Data Science
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014

OUTLINE Introduction Motivation System model Language Performance Future work & Conclusion

OUTLINE Introduction Motivation System model Language Performance Future work & Conclusion

Many data sets are too large, too dynamic Files distributed across many disks on many computers An analysis may consume months of CPU time With a thousand machines that will only take a few hours of real time Break our calculations into two phases Evaluates the analysis on each record individually Aggregates the results Introduction

GFS and MapReduce Fault tolerance and reliability and provide a powerful framework upon which to implement a large, parallel system for distributed analysis Sawzall Expressing the analysis cleanly and executing it quickly

OUTLINE Introduction Motivation System model Language Performance Future work & Conclusion

Motivation Google's server logs Stored as large collections of records (protocol buffers) Partitioned over many disks within Google File System (GFS) Perform calculations Write MapReduce programs

Motivation Parallelism Separating out the aggregators Providing a restricted model for distributed processing (one record at a time) Clearer, more compact, more expressive Support domain-special types at a lower level Easier to write quick scripts

OUTLINE Introduction Motivation System model Language Performance Future work & Conclusion

System model Sawzall language written in C++ The compiler and byte-code interpreter are part of the same binary Aggregators are implemented by saw Implemented above MapReduce, running Sawzall in the map phase and the aggregators in the reduce phase MapReduce manages the distribution of execution and aggregation machines, locates the computation, and handles machine failure and other faults

System model Input is located on multiple storage nodes Input is divided into pieces to be processed separately Sawzall interpreter is instantiated for each piece of data The Sawzall program operates on each input record individually The output of the program is intermediate values, for each record These intermediate values are sent to further computation nodes running the aggregators After collated and reduced, the final results are created (In a typical run, the majority of machines will run Sawzall and a smaller fraction will run the aggregators)

System model sawcommand programSawzall source file Input_filesstandard Unix shell file-name-matching metacharacters destination names of the output number of files (the number of aggregation machines)

OUTLINE Introduction Motivation System model Language Performance Future work & Conclusion

Language Example 1 Overview Aggregators Indexed aggregators Example 2 Quantifiers Example 3

Example 1 Return the number of records, the sum of the values, and the sum of the squares of the values

Overview Basic types intsigned 64-bit quantity float64-bit IEEE floating-point value boolBoolean value timeunsigned 64-bit quantity recording microseconds Array-like types bytes string of 8-bit unsigned bytes strings string of 16-bit Unicode characters. Compound types arrays an (unspecified) number of components, all of the same type mapskey-value pairs tuples a fixed number of members of possibly different types

Overview Declarations Statements emit(send intermediate values to the aggregators)

Overview proto(imports the DDL for a protocol buffer from a file) static(avoid initialization for every record)

Aggregators Collection(A list of all the duplicates emitted values in arbitrary order) Sum(Summation of all the emitted arithmetic values) Maximum(The highest-weighted values) Top(The most popular values)

Indexed aggregators An aggregator can be indexed Create a distinct individual aggregator for each unique value Find the 1000 most popular request for each country, for each hour

Example 2 Show how the queries are distributed around the globe

Query distribution Example 2

Quantifiers when statement, defines a quantifier, a variable, a Boolean condition. Three quantifier types some if the condition is true for any value (arbitrary choice for more than one values) each for all the values that satisfy the condition all if the condition is true for all valid values

Example 3 Count the occurrences of certain words, for each day

OUTLINE Introduction Motivation System model Language Performance Future work & Conclusion

Performance Test the single-CPU speed of Sawzall interpreter Compare with that of other interpreted languages Test how the speed scales Run program using different numbers of machines

Test the single-CPU speed of Sawzall interpreter Computes pixel values for displaying the Mandelbrot set Measure basic arithmetic and loop performance Recursive function to calculate the first 35 Fibonacci numbers Measures function invocation 2.8 GHz x86 desktop machine

1.6 times slower than interpreted Java 21 times slower than compiled Java 51 times slower than compiled C++ Test the single-CPU speed of Sawzall interpreter

450 GB sample of compressed query log data count the occurrences of certain words 50 – GHz Xeon computers Test how the speed scales

Sawzall program Test how the speed scales

The solid line is elapsed time The dashed line is the product of machines and elapsed time The machine-minutes product degrades only 30%

OUTLINE Introduction Motivation System model Language Performance Future work & Conclusion

Future work Aggressive compilation More complex analyses Run once per machine, apply the accelerated binary to each input record Interface to query an external database Suspend processing of one record Language extensions Multiple passes over the data Join operations Join data from multiple input sources

Conclusion New interpreted programming language called Sawzall Programming model (one record at a time) Interface to a novel set of aggregators Write short, clear programs that are guaranteed to work well on thousands of machines in parallel Know nothing about parallel programming

THANK YOU