4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Parallel Computing MapReduce Examples Parallel Efficiency Assignment
MAP-REDUCE: WIN EPIC WIN MAP-REDUCE: WIN -OR- EPIC WIN CSC313: Advanced Programming Topics.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
MapReduce Technical Workshop This presentation includes course content © University of Washington Redistributed under the Creative Commons Attribution.
MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.
1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
Lecture 2 – MapReduce: Theory and Implementation CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.
Lecture 2 – MapReduce: Theory and Implementation CSE 490h – Introduction to Distributed Computing, Spring 2007 Except as otherwise noted, the content of.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce: Simplified Data Processing on Large Clusters
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Introduction to MapReduce Amit K Singh. “The density of transistors on a chip doubles every 18 months, for the same cost” (1965) Do you recognize this.
MapReduce: Acknowledgements: Some slides form Google University (licensed under the Creative Commons Attribution 2.5 License) others from Jure Leskovik.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
3. Statistical Inference Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
Web Search and Text Mining Lecture 3. Outline Distributed programming: MapReduce Distributed indexing Several other examples using MapReduce Zones in.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
1 MapReduce: Theory and Implementation CSE 490h – Intro to Distributed Computing, Modified by George Lee Except as otherwise noted, the content of this.
© 2012 Unisys Corporation. All rights reserved. 1 Unisys Corporation. Proprietary and Confidential.
PPCC Spring Map Reduce1 MapReduce Prof. Chris Carothers Computer Science Department
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
Image taken from: slideshare
Some slides adapted from those of Yuan Yu and Michael Isard
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Scalable systems.
15-826: Multimedia Databases and Data Mining
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MapReduce Simplied Data Processing on Large Clusters
Cse 344 May 2nd – Map/reduce.
February 26th – Map/Reduce
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Distributed System Gang Wu Spring,2018.
Cloud Computing MapReduce, Batch Processing
CS639: Data Management for Data Science
CS639: Data Management for Data Science
CS639: Data Management for Data Science
CS639: Data Management for Data Science
Google Map Reduce OSDI 2004 slides
Presentation transcript:

4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z

Today’s Lecture Where we’ve been – How to say “hapax legomenon” and “heteroskedasticity” – Interpretation of Statistics – Attributes of Big Data Where we’re going today – Threats to validity – Scalability – MapReduce Where we’re going next – Machine learning 2

The IROP Keyboard [Zeller, 2011] 3 To prevent bugs, remove the keystrokes that predict 74% of failure-prone modules in Eclipse

4 Sample C Sample D Sample E V1 ? V2 ? V3 ? Does this work? What am I measuring? How well does this work in the real world? Will this work tomorrow? D E F C G N ST Reconstruct Lineage Korgo worm family

What Am I Measuring: Scalability vs. Latency Analyzing data in parallel – To access 1 TB in 1 min, must distribute data over 20 disks – Parallelism is useful for algorithms where complexity constants matter N log N operations sequentially => (N log N)/K operations in parallel – Scalability: ability to throw resources at the problem You can measure scalability – Scaleup (weak scalability): More resources => solve proportionally bigger problem with same latency – Speedup (strong scalability): More resources => proportionally lower latency with same problem size 5 Can we make use of 1000s of cheap computers?

Some Problems Are Embarrassingly Parallel (1) 6 Input: many TIFF images Distribute images among K computers f is a function to convert TIFF to PNG; apply it to every item Output: a big distributed set of converted images f f f f f f f f f f f f Task: Convert 405K TIFF images (~4 TB) to PNG

Some Problems Are Embarrassingly Parallel (2) 7 Input: millions of documents Distribute documents among K computers For each document f returns a set of pairs Output: a big a big distributed list of sets of word freqs. f f f f f f f f f f f f Task: Compute the word frequency of 5M documents Adapted from slides by Bill Howe

Some Problems Are Embarrassingly Parallel (3) 8 Input: millions of documents Distribute documents among K computers For each document f returns a set of pairs f f f f f f f f f f f f Task: Compute the word frequency across all documents Now what? We don’t want a bunch of little histograms – we want one big histogram

MapReduce Distribute documents among K computers For each document f returns a set of pairs A big distributed list of sets of word freqs. map Task: Compute the word frequency across all documents reduce Add the counts of each word Shuffle pairs so that all the counts for a word are sent to the same host Output: the distributed histogram

Hadoop on One Slide Source: Huy Vo MapReduce was invented at Google [Dean & Ghemawat, OSDI’04] Hadoop = open source implementation Data stored on HDFS distributed file system – Direct-attached storage – No schema needed on load Programmers write Map and Reduce functions Framework provides automated parallelization and fault tolerance – Data replication, restarting failed tasks – Scheduling Map and Reduce tasks on hosts with local copies of input data 10

MapReduce Programming Model 11 Iput & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) – Processes input key/value pair – Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) – Combines all intermediate values for a particular key – Produces a set of merged output values (usually just one) Inspired by primitives from functional programming languages such as Lisp, Scheme, and Haskell Slide source: Google

Example: What Does This Do? map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, 1); reduce(String output_key, Iterator intermediate_values): // output_key: word // output_values: ???? int result = 0; for each v in intermediate_values: result += v; EmitFinal(output_key, result); 12

Big Data in the Security Industry Booz Allen Hamilton – Dr. Brian Keller’s colloquium “Innovating with Analytics” – Sponsors Data Science Bowl, October 5th 1-5:30 pm CSIC 2117 & Symantec – WINE platform for data analytics in security Google – Mine user access patterns to mitigate data loss due to stolen credentials Supplementary to passwords and two-factor authentication – Fuzz testing at scale 13

Big Data for Security: Benefits and Challenges Benefits – Ability to analyze data at scale (e.g., the information on the 403 millions malware variants created in 2011) – MapReduce provides simple programming model, automated parallelization and fault tolerance Commercial parallel DBs (e.g. Vertica, Greenplum, Aster Data) also provide some of these benefits, but they are very expensive Challenges – Lack of ground truth on malware families – Lack of contextual data: e.g., date and time of appearance – Inability to collect some types of data owing to privacy concerns – Sharing data (e.g., malware samples are dangerous, some data sets may include personal information) 14 Illustrate general threats to validity in experimental cyber security

Threats to Validity Construct validity: use metrics that model the hypothesis Internal validity: establish causal connection Content validity: include only and all relevant data External validity: generalize results beyond experimental data Does it work? What am I measuring? Will it work in the real world? Will it work tomorrow? 15

Review of Lecture What did we learn? – Construct, content, internal, external validity – Programming in MapReduce – Measuring scalability What’s next? – Paper discussion: ‘Before We Knew It: An Empirical Study of Zero-Day Attacks In The Real World’ – Next lecture: Machine learning techniques Deadline reminder – Pilot project reports due on Wednesday – Post report on Piazza 16