Terasort Using SAGA-MapReduce Given by: Sharath Maddineni CCT: Center for Computation & Technology
CCT: Center for Computation & Technology Why Terasort? Sorting the large datasets in scientific computations. Google processes around 20 Petabytes of data per day using MapReduce. So, Google may sort the huge datasets containing WebPages makes the searching and retrieval faster. Center CCT: Center for Computation & Technology
CCT: Center for Computation & Technology Introduction Sort Benchmark (http://sortbenchmark.org/) Google won the 2010 competition, Yahoo Hadoop In 2009 But, Google sorting is limited to Google File System(GFS), and Yahoo is tied to Yahoo-Hadoop File System(HDFS) SAGA-MapReduce is infrastructure independent. Center CCT: Center for Computation & Technology
SAGA MapReduce Execution Overview Start the Master with a executable linked to SAGA-MapReduce and creates advert directory The master looks the InputFormat specified in the JobDescription to chunk the input data. The master spawns workers on the host machines specified in the configuration file using the SAGA Job API Worker puts its status information into an advert directory and will communicate with master using this advert service. Workers will process the chunks assigned by master using Map() and partition the Data according the partition function When all chunks mapping is done master moves to reduce Phase. In the reduce, the master assigns sets of partitions to be reduced to idle workers. Center CCT: Center for Computation & Technology
CCT: Center for Computation & Technology Slide Title Center CCT: Center for Computation & Technology
CCT: Center for Computation & Technology Terasort Sort-benchmark’s provides a “Gensort” program to generate Data Records Data Format Each Record has 100 bytes ASCII values contains where 10 bytes random key and rest is the value . 10^10, 100 byte-records for terabyte of data All the records are sorted according to this 10 byte key. Center CCT: Center for Computation & Technology
Terasort SAGA Map-Reduce Similar to SAGA-MapReduce Except the partition list is generated before launching the master The partition list generated will make sure that the keys in map phase goes into partition of its range. This will spread the keys evenly across all the partitions. Center CCT: Center for Computation & Technology
CCT: Center for Computation & Technology
Distributed Workers for Terasort Cyder and Cyd01 machines as workers Prerequisites: SSH password less login from Master machine to Worker machines. Fuser Mount the Input and Output Data Locations on each machine. Center CCT: Center for Computation & Technology
CCT: Center for Computation & Technology Results Increasing the input Data size Constant Number of workers (3) (Both Master and Worker on Cyd01 ) Operating System : Redhat 5.5 Architecture : x86_64 Memory : 8 GB CPU Type : Dual-Core AMD Opteron Compiler Version : gcc version 4.4.3, Boost Version : 1.40, X-Axis -> Data set size in MB Y-Axis ->Time to solution in seconds Center CCT: Center for Computation & Technology
CCT: Center for Computation & Technology Results cont… Constant Input File Size(400MB, 6 Chunks, 5 partitions) Increasing number of workers Operating System : Ubuntu 10.04 Architecture : x86_64 AMD Memory : 63 GB CPU Type : 6-Core AMD Opteron Compiler Version : gcc version 4.4.3, Boost Version : 1.40, X-Axis -> Number of workers Y-Axis ->Time to solution in seconds Center CCT: Center for Computation & Technology
CCT: Center for Computation & Technology Results cont… Distributed workers (2 workers, 1 chunk(10mb), 5 partitions) Cyd01 and Cyder are used Case 1 : Master, Worker and Data on same machine Case 2 : Remote Master , Data and workers on same machine Case 3 : Remote Master, Remote data for one worker and local Data for one worker Case 4 : Remote Master, Remote Data for all workers X-Axis -> Cases Y-Axis ->Time to solution in seconds Center CCT: Center for Computation & Technology
SAGA Map-Reduce Usability Usable for users who have some familiarity with the C++,SAGA and prior knowledge of MapReduce. Sufficiently documented. However, some important details about mounting the input and out put with distributed computing were missing Tested on RHEL 4,5 and Ubuntu 10.04 SAGA 1.4.1 and 1.5 Boost Version 1.40 Center CCT: Center for Computation & Technology
CCT: Center for Computation & Technology Future Work Currently MapReduce only supports Launching worker through forking Localhost and SSH SAGA- BigJob can be used to launch the workers instead Helps in running MapReduce distributed over LONI Machines But mounting directories is a problem over LONI. Center CCT: Center for Computation & Technology
CCT: Center for Computation & Technology Thank You Center CCT: Center for Computation & Technology