Presentation is loading. Please wait.

Presentation is loading. Please wait.

4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z

Similar presentations


Presentation on theme: "4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z"— Presentation transcript:

1 4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z http://ter.ps/759d https://www.facebook.com/SDSAtUMD

2 Today’s Lecture Where we’ve been – How to say “hapax legomenon” and “heteroskedasticity” – Interpretation of Statistics – Attributes of Big Data Where we’re going today – Threats to validity – Scalability – MapReduce Where we’re going next – Machine learning 2

3 The IROP Keyboard [Zeller, 2011] 3 To prevent bugs, remove the keystrokes that predict 74% of failure-prone modules in Eclipse

4 4 Sample C Sample D Sample E V1 ? V2 ? V3 ? Does this work? What am I measuring? How well does this work in the real world? Will this work tomorrow? D E F C G N ST Reconstruct Lineage Korgo worm family

5 What Am I Measuring: Scalability vs. Latency Analyzing data in parallel – To access 1 TB in 1 min, must distribute data over 20 disks – Parallelism is useful for algorithms where complexity constants matter N log N operations sequentially => (N log N)/K operations in parallel – Scalability: ability to throw resources at the problem You can measure scalability – Scaleup (weak scalability): More resources => solve proportionally bigger problem with same latency – Speedup (strong scalability): More resources => proportionally lower latency with same problem size 5 Can we make use of 1000s of cheap computers?

6 Some Problems Are Embarrassingly Parallel (1) 6 Input: many TIFF images Distribute images among K computers f is a function to convert TIFF to PNG; apply it to every item Output: a big distributed set of converted images f f f f f f f f f f f f Task: Convert 405K TIFF images (~4 TB) to PNG http://open.blogs.nytimes.com/2008/05/21/the-new-york-times-archives-amazon-web-services-timesmachine/

7 Some Problems Are Embarrassingly Parallel (2) 7 Input: millions of documents Distribute documents among K computers For each document f returns a set of pairs Output: a big a big distributed list of sets of word freqs. f f f f f f f f f f f f Task: Compute the word frequency of 5M documents Adapted from slides by Bill Howe

8 Some Problems Are Embarrassingly Parallel (3) 8 Input: millions of documents Distribute documents among K computers For each document f returns a set of pairs f f f f f f f f f f f f Task: Compute the word frequency across all documents Now what? We don’t want a bunch of little histograms – we want one big histogram

9 MapReduce Distribute documents among K computers For each document f returns a set of pairs A big distributed list of sets of word freqs. map Task: Compute the word frequency across all documents reduce Add the counts of each word Shuffle pairs so that all the counts for a word are sent to the same host Output: the distributed histogram

10 Hadoop on One Slide Source: Huy Vo MapReduce was invented at Google [Dean & Ghemawat, OSDI’04] Hadoop = open source implementation Data stored on HDFS distributed file system – Direct-attached storage – No schema needed on load Programmers write Map and Reduce functions Framework provides automated parallelization and fault tolerance – Data replication, restarting failed tasks – Scheduling Map and Reduce tasks on hosts with local copies of input data 10

11 MapReduce Programming Model 11 Iput & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) – Processes input key/value pair – Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) – Combines all intermediate values for a particular key – Produces a set of merged output values (usually just one) Inspired by primitives from functional programming languages such as Lisp, Scheme, and Haskell Slide source: Google

12 Example: What Does This Do? map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, 1); reduce(String output_key, Iterator intermediate_values): // output_key: word // output_values: ???? int result = 0; for each v in intermediate_values: result += v; EmitFinal(output_key, result); 12

13 Big Data in the Security Industry Booz Allen Hamilton – Dr. Brian Keller’s colloquium “Innovating with Analytics” – Sponsors Data Science Bowl, October 5th 1-5:30 pm CSIC 2117 & 2120 https://www.datasciencebowl.com/ https://www.datasciencebowl.com/ Symantec – WINE platform for data analytics in security Google – Mine user access patterns to mitigate data loss due to stolen credentials Supplementary to passwords and two-factor authentication – Fuzz testing at scale 13

14 Big Data for Security: Benefits and Challenges Benefits – Ability to analyze data at scale (e.g., the information on the 403 millions malware variants created in 2011) – MapReduce provides simple programming model, automated parallelization and fault tolerance Commercial parallel DBs (e.g. Vertica, Greenplum, Aster Data) also provide some of these benefits, but they are very expensive Challenges – Lack of ground truth on malware families – Lack of contextual data: e.g., date and time of appearance – Inability to collect some types of data owing to privacy concerns – Sharing data (e.g., malware samples are dangerous, some data sets may include personal information) 14 Illustrate general threats to validity in experimental cyber security

15 Threats to Validity Construct validity: use metrics that model the hypothesis Internal validity: establish causal connection Content validity: include only and all relevant data External validity: generalize results beyond experimental data Does it work? What am I measuring? Will it work in the real world? Will it work tomorrow? 15

16 Review of Lecture What did we learn? – Construct, content, internal, external validity – Programming in MapReduce – Measuring scalability What’s next? – Paper discussion: ‘Before We Knew It: An Empirical Study of Zero-Day Attacks In The Real World’ – Next lecture: Machine learning techniques Deadline reminder – Pilot project reports due on Wednesday – Post report on Piazza 16


Download ppt "4. Scalability and MapReduce Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z"

Similar presentations


Ads by Google