Download presentation
Presentation is loading. Please wait.
Published byMichael Griffin Modified over 9 years ago
1
Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014
2
OUTLINE Introduction Motivation System model Language Performance Future work & Conclusion
3
OUTLINE Introduction Motivation System model Language Performance Future work & Conclusion
4
Many data sets are too large, too dynamic Files distributed across many disks on many computers An analysis may consume months of CPU time With a thousand machines that will only take a few hours of real time Break our calculations into two phases Evaluates the analysis on each record individually Aggregates the results Introduction
5
GFS and MapReduce Fault tolerance and reliability and provide a powerful framework upon which to implement a large, parallel system for distributed analysis Sawzall Expressing the analysis cleanly and executing it quickly
6
OUTLINE Introduction Motivation System model Language Performance Future work & Conclusion
7
Motivation Google's server logs Stored as large collections of records (protocol buffers) Partitioned over many disks within Google File System (GFS) Perform calculations Write MapReduce programs
8
Motivation Parallelism Separating out the aggregators Providing a restricted model for distributed processing (one record at a time) Clearer, more compact, more expressive Support domain-special types at a lower level Easier to write quick scripts
9
OUTLINE Introduction Motivation System model Language Performance Future work & Conclusion
10
System model Sawzall language written in C++ The compiler and byte-code interpreter are part of the same binary Aggregators are implemented by saw Implemented above MapReduce, running Sawzall in the map phase and the aggregators in the reduce phase MapReduce manages the distribution of execution and aggregation machines, locates the computation, and handles machine failure and other faults
11
System model Input is located on multiple storage nodes Input is divided into pieces to be processed separately Sawzall interpreter is instantiated for each piece of data The Sawzall program operates on each input record individually The output of the program is intermediate values, for each record These intermediate values are sent to further computation nodes running the aggregators After collated and reduced, the final results are created (In a typical run, the majority of machines will run Sawzall and a smaller fraction will run the aggregators)
12
System model sawcommand programSawzall source file Input_filesstandard Unix shell file-name-matching metacharacters destination names of the output files @the number of files (the number of aggregation machines)
13
OUTLINE Introduction Motivation System model Language Performance Future work & Conclusion
14
Language Example 1 Overview Aggregators Indexed aggregators Example 2 Quantifiers Example 3
15
Example 1 Return the number of records, the sum of the values, and the sum of the squares of the values
16
Overview Basic types intsigned 64-bit quantity float64-bit IEEE floating-point value boolBoolean value timeunsigned 64-bit quantity recording microseconds Array-like types bytes string of 8-bit unsigned bytes strings string of 16-bit Unicode characters. Compound types arrays an (unspecified) number of components, all of the same type mapskey-value pairs tuples a fixed number of members of possibly different types
17
Overview Declarations Statements emit(send intermediate values to the aggregators)
18
Overview proto(imports the DDL for a protocol buffer from a file) static(avoid initialization for every record)
19
Aggregators Collection(A list of all the duplicates emitted values in arbitrary order) Sum(Summation of all the emitted arithmetic values) Maximum(The highest-weighted values) Top(The most popular values)
20
Indexed aggregators An aggregator can be indexed Create a distinct individual aggregator for each unique value Find the 1000 most popular request for each country, for each hour
21
Example 2 Show how the queries are distributed around the globe
22
Query distribution Example 2
23
Quantifiers when statement, defines a quantifier, a variable, a Boolean condition. Three quantifier types some if the condition is true for any value (arbitrary choice for more than one values) each for all the values that satisfy the condition all if the condition is true for all valid values
24
Example 3 Count the occurrences of certain words, for each day
25
OUTLINE Introduction Motivation System model Language Performance Future work & Conclusion
26
Performance Test the single-CPU speed of Sawzall interpreter Compare with that of other interpreted languages Test how the speed scales Run program using different numbers of machines
27
Test the single-CPU speed of Sawzall interpreter Computes pixel values for displaying the Mandelbrot set Measure basic arithmetic and loop performance Recursive function to calculate the first 35 Fibonacci numbers Measures function invocation 2.8 GHz x86 desktop machine
28
1.6 times slower than interpreted Java 21 times slower than compiled Java 51 times slower than compiled C++ Test the single-CPU speed of Sawzall interpreter
29
450 GB sample of compressed query log data count the occurrences of certain words 50 – 600 2.4 GHz Xeon computers Test how the speed scales
30
Sawzall program Test how the speed scales
31
The solid line is elapsed time The dashed line is the product of machines and elapsed time The machine-minutes product degrades only 30%
32
OUTLINE Introduction Motivation System model Language Performance Future work & Conclusion
33
Future work Aggressive compilation More complex analyses Run once per machine, apply the accelerated binary to each input record Interface to query an external database Suspend processing of one record Language extensions Multiple passes over the data Join operations Join data from multiple input sources
34
Conclusion New interpreted programming language called Sawzall Programming model (one record at a time) Interface to a novel set of aggregators Write short, clear programs that are guaranteed to work well on thousands of machines in parallel Know nothing about parallel programming
35
THANK YOU
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.