New Ideas Track: Testing MapReduce-Style Programs

Slides:



Advertisements
Similar presentations
Intro to Map-Reduce Feb 21, map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems…
Advertisements

Introduction to Data Center Computing Derek Murray October 2010.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Big Data and.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
HADOOP ADMIN: Session -2
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
HAMS Technologies 1
CSE 486/586 CSE 486/586 Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.
MapReduce How to painlessly process terabytes of data.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
2007. Software Engineering Laboratory, School of Computer Science S E Web-Harvest Web-Harvest: Open Source Web Data Extraction tool 이재정 Software Engineering.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
CMU SCS KDD '09Faloutsos, Miller, Tsourakakis P8-1 Large Graph Mining: Power Tools and a Practitioner’s guide Task 8: hadoop and Tera/Peta byte graphs.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Hive Big data for CSci 4707 students! Eric Atherton and Henry Hoang.
MapReduce using Hadoop Jan Krüger … in 30 minutes...
Image taken from: slideshare
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
CS239-Lecture 3 DryadLINQ Madan Musuvathi Visiting Professor, UCLA
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
The Map-Reduce Framework
Map Reduce.
15-826: Multimedia Databases and Data Mining
Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.
Introduction to MapReduce and Hadoop
A Test Case + Mock Class Generator for Coding Against Interfaces
Central Florida Business Intelligence User Group
Ministry of Higher Education
MapReduce Simplied Data Processing on Large Clusters
New Ideas Track: Testing MapReduce-Style Programs Christoph Csallner, Leonidas Fegaras, Chengkai Li Computer.
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
Hadoop Technopoints.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Distributed System Gang Wu Spring,2018.
Charles Tappert Seidenberg School of CSIS, Pace University
MAPREDUCE TYPES, FORMATS AND FEATURES
Overview of Workflows: Why Use Them?
A Map-Reduce System with an Alternate API for Multi-Core Environments
5/7/2019 Map Reduce Map reduce.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
MapReduce: Simplified Data Processing on Large Clusters
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Map Reduce, Types, Formats and Features
Presentation transcript:

New Ideas Track: Testing MapReduce-Style Programs Christoph Csallner, Leonidas Fegaras, Chengkai Li Computer Science and Engineering Department University of Texas at Arlington (UTA) European Software Engineering Conference / ACM SIGSOFT Symposium on the Foundations of Software Engineering ESEC/FSE, Szeged, Hungary, Thursday, Sep. 8, 2011 MapReduce, what is it? Google paper What kind of problems is it useful for? Process web-scale data, in parallel How does it work? Program sequential code, system handles parallelism Workflow in detail: map, group-by-key, reduce What can go wrong? Value order into reduce may vary Code that contains bug, reasons for bug How to avoid bug? Satisfy correctness conditions Find bugs automatically: Find one input list and a permutation of it Not an easy task, there are many possible lists, permutation, need specific one Observations: MapReduce program small, good target for DSE How to apply DSE: index leaf nodes by input list length, encode correctness condition Input list length not symbolic, determined by path, controlled by subsequent evaluation of loop-termination (if-) condition Slight restriction / simplification of general DSE How to encode permutation Permutation maps concrete list position (“0”) to possibly different position (“0” or “1” or ..), encode new position as variable List: Make list elements symbolic: Let constraint solver find concrete list elements Permutation: Make list positions symbolic: Let constraint solver pick different list order

Since 2004: Many MapReduce systems, papers & users Google MapReduce [OSDI 2004] > 2,000 cit. Apache/Yahoo! Hadoop http://wiki.apache.org/hadoop/PoweredBy Microsoft Dryad [EuroSys 2007] > 500 cit. http://research.microsoft.com/en-us/projects/dryad/ Apache/Yahoo! Pig [SIGMOD 2008] > 400 cit. https://cwiki.apache.org/confluence/display/PIG/PoweredBy Apache/Facebook Hive [VLDB 2009] https://cwiki.apache.org/confluence/display/Hive/PoweredBy Systems are there from major software product organizations, papers are there, users are there Process “web-scale” data Google: Process 20 PB per day [CACM2008] Run on many machines in parallel 10k programs build search index, process text, graphs, etc. Jeff Dean, Sanjay Ghemawat, Google MapReduce implemented in C++, user programs in C++, Java, etc. Hadoop is implemented in Java; user programs in Java, etc. Dryad is implemented in C++, user programs in different high-level languages Pig is implemented in Java; user programs in Pig Latin, user can extend Pig Latin with user-defined functions implemented in Java, etc. Hive runs on Hadoop, is implemented in Java, user programs in HiveQL (SQL-like ad-hoc data warehouse queries) Links list industrial and academic users Citations according to Google Scholar “Apache” == open source Testing MapReduce-style programs

MapReduce programming model Programmer implements sequential code Two functions: map and reduce For example, in sequential Java code System distributes, schedules, handles faults Invokes map on many nodes in parallel Collects and re-distributes intermediate results Invokes reduce on many nodes in parallel Programmer can focus on problem domain ... From a Software Engineering perspective, one interesting question of course is: What can possibly go wrong? Testing MapReduce-style programs

Testing MapReduce-style programs Map: (key;value)* Group By Key Reduce: avg of first 3 Input Output ...;dept; salary ...; A; 250,000 ...; X; 220,000 ...; F; 220,000 ...; Z; 210,000 ... ...; O; 150,000 ...; T; 150,000 ...; A; 150,000 ...; A; 140,000 ...; E; 100,000 ...; S; 100,000 ...; A; 95,000 ...; C; 95,000 A; 250,000 X; 220,000 F; 220,000 Z; 210,000 ... A; 250,000 A; 150,000 A; 140,000 A; 95,000 ... 180,000 O; 150,000 T; 150,000 A; 150,000 A; 140,000 ... B; ... ... ... C; ... ... ... E; 100,000 S; 100,000 A; 95,000 C; 95,000 ... ... ... Testing MapReduce-style programs

Testing MapReduce-style programs Map: (key;value)* Group By Key Reduce: avg of first 3 Input Output ...;dept; salary ...; A; 250,000 ...; X; 220,000 ...; F; 220,000 ...; Z; 210,000 ... ...; O; 150,000 ...; T; 150,000 ...; A; 150,000 ...; A; 140,000 ...; E; 100,000 ...; S; 100,000 ...; A; 95,000 ...; C; 95,000 A; 250,000 X; 220,000 F; 220,000 Z; 210,000 ... A; 250,000 A; 95,000 A; 150,000 A; 140,000 ... 180,000 165,000 O; 150,000 T; 150,000 A; 150,000 A; 140,000 ... B; ... ... ... C; ... ... ... E; 100,000 S; 100,000 A; 95,000 C; 95,000 ... ... ... Testing MapReduce-style programs

Testing MapReduce-style programs Example bug: /* Report avg of top-3 salaries, if avg>100k */ public void reduce(String dept, Iterator<Integer> salaries) { int sum = 0; int i = 0; while (salaries.hasNext() && i<3) { sum += salaries.next(); i += 1; } emit( (i>0 && sum/i > 100000)? sum/i : -1); Code depends on order of salaries, just uses first-3 Programmer may be confused by order of salaries in input files, that order is not maintained Bug, possibly because MapReduce systems have built-in ordering, but not always use them Testing MapReduce-style programs

User reduce program has to satisfy correctness conditions Reduce must not rely on a particular order: For each input list of values L, for each permutation P: reduce(key, L) == reduce(key, P(L)) Program also has to satisfy other MapReduce-specific correctness conditions Current tools do not check these conditions Testing MapReduce-style programs

Goal: Find such bugs automatically Find an input list of values L and a permutation P: reduce(key, L) ≠ reduce(key, P(L)) Current tools do not find such bugs There are many input lists and permutations Trying all of them is impossible Testing MapReduce-style programs

Testing MapReduce-style programs Example bug: /* Report avg of top-3 salaries, if avg>100k */ public void reduce(String dept, Iterator<Integer> salaries) { int sum = 0; int i = 0; while (salaries.hasNext() && i<3) { sum += salaries.next(); i += 1; } emit( (i>0 && sum/i > 100000)? sum/i : -1); Need specific list of salaries & permutation List of more than 3 elements Average of first 3 elements > 100k Permutation has to swap element at position≤3 with element at position>3 Testing MapReduce-style programs

Testing MapReduce-style programs Observations Example MapReduce programs are typically small and contain few execution paths How do industrial MapReduce programs look like? Dynamic symbolic execution may be a good fit Heavy-weight but precise analysis Systematically explores all execution paths Well-suited for reasoning about few paths reduce(key, L), reduce(key, P(L)) may trigger different execution paths Not enough to analyze one path at a time Testing MapReduce-style programs

Check correctness conditions with dynamic symbolic execution Derive symbolic path condition, return value Maintain them in an indexed execution tree Index leaf nodes by length of input list Sibling(path): Triggered by input list of same length Encode potential violation of correctness condition in constraint system Solving constraints with off-the-shelf constraint solver yields concrete input values L and permutation P Convert solution to test case, run, confirm violation Testing MapReduce-style programs

Encode correctness conditions in symbolic program constraints // Permutation P as a function: 0  p[0], 1  p[1], .. // Symbolic list L = L[0], L[1], .. P(L) = L[p[0]], L[p[1]], .. SymbolicInt[] p  SymbolicIndices; // distinct list positions Assert PathCond; // e.g.: L[0]==5 Assert SubstituteIndices(SiblingPath, p); // e.g.: L[p[0]]==5 // Find a concrete list + a concrete permutation such that: // reduce(key, list) ≠ reduce(key, permutation(list)) Assert Result ≠ SubstituteIndices(SiblingResult, p); Testing MapReduce-style programs

Input length heuristic Pick “representative” input lengths Initially: |L| := 2 For shorter lists: L == P(L) Binary back-off scheme Each subsequent iteration doubles length of L Testing MapReduce-style programs

Testing MapReduce-style programs Conclusions New programming paradigm with new bugs To produce deterministic results, a MapReduce system requires user programs to satisfy certain high-level correctness conditions Neither MapReduce execution systems nor tools check these conditions Proposed approach: Encode MapReduce correctness conditions in symbolic program constraints Check correctness conditions at runtime Testing MapReduce-style programs

Testing MapReduce-style programs References [OSDI 2004] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proc. 6th USENIX Symposium on Operating Systems Design and Implementation, pages 137—150. [EuroSys 2007] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In Proc. 2nd ACM SIGOPS European Conference on Computer Systems, pages 59—72. [SIGMOD 2008] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In Proc. 34th ACM SIGMOD International Conference on Management of Data, pages 1099—1110. [CACM2008] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107—113. [VLDB 2009] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wycko, and R. Murthy. Hive: A warehousing solution over a map-reduce framework. Proc. VLDB Endowment, 2(2):1626—1629. Testing MapReduce-style programs

Testing MapReduce-style programs Questions Testing MapReduce-style programs

MapReduce used for variety of jobs Process “web-scale” data (PB = peta-byte = 1015) Run on many machines in parallel Google: Process 20 PB per day [CACM2008] 10k programs build search index, process text, graphs, etc. New York Times: Convert 4TB of articles to PDF http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/ Yahoo!: Sort TB in 209 seconds: http://sortbenchmark.org/ “First time that either a Java or an open source program has won this challenge” [http://hadoop.apache.org/] Facebook: Hive-based data warehouse Testing MapReduce-style programs

MapReduce ≠ map-reduce Inspired by functional programming map-reduce But different  For detailed comparison, see: Ralf Lämmel. Google's MapReduce programming model — Revisited. Science of Computer Programming 68(3): 208—237. Oct. 2007. Testing MapReduce-style programs

MapReduce correctness condition 2: Optional combine function Combine: programmer-defined sequential code Similar to map and reduce May be invoked on Map node, after map Locally “pre-reduce” results, by key Reduce transmission overhead to “real reduce” System can invoke combine 0—n times Must not affect semantics Similar approach: Encode in symbolic path condition, result value Testing MapReduce-style programs