Download presentation
Presentation is loading. Please wait.
Published byPatience Robbins Modified over 9 years ago
1
Terasort Using SAGA-MapReduce Given by: Sharath Maddineni
CCT: Center for Computation & Technology
2
CCT: Center for Computation & Technology
Why Terasort? Sorting the large datasets in scientific computations. Google processes around 20 Petabytes of data per day using MapReduce. So, Google may sort the huge datasets containing WebPages makes the searching and retrieval faster. Center CCT: Center for Computation & Technology
3
CCT: Center for Computation & Technology
Introduction Sort Benchmark ( Google won the 2010 competition, Yahoo Hadoop In 2009 But, Google sorting is limited to Google File System(GFS), and Yahoo is tied to Yahoo-Hadoop File System(HDFS) SAGA-MapReduce is infrastructure independent. Center CCT: Center for Computation & Technology
4
SAGA MapReduce Execution Overview
Start the Master with a executable linked to SAGA-MapReduce and creates advert directory The master looks the InputFormat specified in the JobDescription to chunk the input data. The master spawns workers on the host machines specified in the configuration file using the SAGA Job API Worker puts its status information into an advert directory and will communicate with master using this advert service. Workers will process the chunks assigned by master using Map() and partition the Data according the partition function When all chunks mapping is done master moves to reduce Phase. In the reduce, the master assigns sets of partitions to be reduced to idle workers. Center CCT: Center for Computation & Technology
5
CCT: Center for Computation & Technology
Slide Title Center CCT: Center for Computation & Technology
6
CCT: Center for Computation & Technology
Terasort Sort-benchmark’s provides a “Gensort” program to generate Data Records Data Format Each Record has 100 bytes ASCII values contains where 10 bytes random key and rest is the value . 10^10, 100 byte-records for terabyte of data All the records are sorted according to this 10 byte key. Center CCT: Center for Computation & Technology
7
Terasort SAGA Map-Reduce
Similar to SAGA-MapReduce Except the partition list is generated before launching the master The partition list generated will make sure that the keys in map phase goes into partition of its range. This will spread the keys evenly across all the partitions. Center CCT: Center for Computation & Technology
8
CCT: Center for Computation & Technology
9
Distributed Workers for Terasort
Cyder and Cyd01 machines as workers Prerequisites: SSH password less login from Master machine to Worker machines. Fuser Mount the Input and Output Data Locations on each machine. Center CCT: Center for Computation & Technology
10
CCT: Center for Computation & Technology
Results Increasing the input Data size Constant Number of workers (3) (Both Master and Worker on Cyd01 ) Operating System : Redhat 5.5 Architecture : x86_64 Memory : 8 GB CPU Type : Dual-Core AMD Opteron Compiler Version : gcc version 4.4.3, Boost Version : 1.40, X-Axis -> Data set size in MB Y-Axis ->Time to solution in seconds Center CCT: Center for Computation & Technology
11
CCT: Center for Computation & Technology
Results cont… Constant Input File Size(400MB, 6 Chunks, 5 partitions) Increasing number of workers Operating System : Ubuntu 10.04 Architecture : x86_64 AMD Memory : 63 GB CPU Type : 6-Core AMD Opteron Compiler Version : gcc version 4.4.3, Boost Version : 1.40, X-Axis -> Number of workers Y-Axis ->Time to solution in seconds Center CCT: Center for Computation & Technology
12
CCT: Center for Computation & Technology
Results cont… Distributed workers (2 workers, 1 chunk(10mb), 5 partitions) Cyd01 and Cyder are used Case 1 : Master, Worker and Data on same machine Case 2 : Remote Master , Data and workers on same machine Case 3 : Remote Master, Remote data for one worker and local Data for one worker Case 4 : Remote Master, Remote Data for all workers X-Axis -> Cases Y-Axis ->Time to solution in seconds Center CCT: Center for Computation & Technology
13
SAGA Map-Reduce Usability
Usable for users who have some familiarity with the C++,SAGA and prior knowledge of MapReduce. Sufficiently documented. However, some important details about mounting the input and out put with distributed computing were missing Tested on RHEL 4,5 and Ubuntu 10.04 SAGA and 1.5 Boost Version 1.40 Center CCT: Center for Computation & Technology
14
CCT: Center for Computation & Technology
Future Work Currently MapReduce only supports Launching worker through forking Localhost and SSH SAGA- BigJob can be used to launch the workers instead Helps in running MapReduce distributed over LONI Machines But mounting directories is a problem over LONI. Center CCT: Center for Computation & Technology
15
CCT: Center for Computation & Technology
Thank You Center CCT: Center for Computation & Technology
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.