Experiences with Hadoop and MapReduce

Slides:



Advertisements
Similar presentations
Experiences with Hadoop and MapReduce
Advertisements

The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
HADOOP ADMIN: Session -2
Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Hadoop and HDFS
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
BIG DATA/ Hadoop Interview Questions.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Big Data is a Big Deal!.
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
Large-scale file systems and Map-Reduce
Ch 8 and Ch 9: MapReduce Types, Formats and Features
MapReduce Types, Formats and Features
Spark Presentation.
Introduction to HDFS: Hadoop Distributed File System
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
Overview of Hadoop MapReduce MapReduce is a soft work framework for easily writing applications which process vast amounts of.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
MapReduce: Data Distribution for Reduce
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
CS110: Discussion about Spark
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Overview of big data tools
CS222P: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
Parallel Computation Patterns (Reduction)
Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz
Spark and Scala.
Data processing with Hadoop
Charles Tappert Seidenberg School of CSIS, Pace University
MAPREDUCE TYPES, FORMATS AND FEATURES
MapReduce Algorithm Design
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
Map Reduce, Types, Formats and Features
Presentation transcript:

Experiences with Hadoop and MapReduce Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside

Outline Background on MapReduce Summer 09 (freeman?): Processing Join using MapReduce Spring 09 (Northeastern): NetflixHadoop Fall 09 (UC Irvine): Distributed XML Filtering Using Hadoop

Background on MapReduce Started from Winter 2009 Course work: Scalable Techniques for Massive Data by Prof. Mirek Riedewald. Course project: NetflixHadoop Short explore in Summer 2009 Research topic: Efficient join processing on MapReduce framework. Compared the homogenization and map-reduce- merge strategies. Continued in California UCI course work: Scalable Data Management by Prof. Michael Carey Course project: XML filtering using Hadoop

MapReduce Join: Research Plan Focused on performance analysis on different implementation of join processors in MapReduce. Homogenization: add additional information about the source of the data in the map phase, then do the join in the reduce phase. Map-Reduce-Merge: a new primitive called merge is added to process the join separately. Other implementation: the map-reduce execution plan for joins generated by Hive.

MapReduce Join: Research Notes Cost analysis model on process latency. The whole map-reduce execution plan is divided into several primitives for analysis. Distribute Mapper: partition and distribute data onto several nodes. Copy Mapper: duplicate data onto several nodes. MR Transfer: transfer data between mapper and reducer. Summary Transfer: generate statistics of data and pass the statistics between working nodes. Output Collector: collect the outputs. Some basic attempts on theta-join using MapReduce. Idea: a mapper supporting multi-cast key.

NetflixHadoop: Problem Definition From Netflix Competition Data: 100480507 rating data from 480189 users on 17770 movies. Goal: Predict unknown ratings for any given user and movie pairs. Measurement: Use RMSE to measure the precise. Out approach: Singular Value Decomposition (SVD)

NetflixHadoop: SVD algorithm A feature means… User: Preference (I like sci-fi or comedy…) Movie: Genres, contents, … Abstract attribute of the object it belongs to. Feature Vector Each user has a user feature vector; Each movie has a movie feature vector. Rating for a (user, movie) pair can be estimated by a linear combination of the feature vectors of the user and the movie. Algorithm: Train the feature vectors to minimize the prediction error!

NetflixHadoop: SVD Pseudcode Basic idea: Initialize the feature vectors; Recursively: calculate the error, adjust the feature vectors.

NetflixHadoop: Implementation Data Pre-process Randomize the data sequence. Mapper: for each record, randomly assign an integer key. Reducer: do nothing; simply output (automatically sort the output based on the key) Customized RatingOutputFormat from FileOutputFormat Remove the key in the output.

NetflixHadoop: Implementation Feature Vector Training Mapper: From an input (user, movie, rating), adjust the related feature vectors, output the vectors for the user and the movie. Reducer: Compute the average of the feature vectors collected from the map phase for a given user/movie. Challenge: Global sharing feature vectors!

NetflixHadoop: Implementation Global sharing feature vectors Global Variables: fail! Different mappers use different JVM and no global variable available between different JVM. Database (DBInputFormat): fail! Error on configuration; expecting bad performance due to frequent updates (race condition, query start-up overhead) Configuration files in Hadoop: fine! Data can be shared and modified by different mappers; limited by the main memory of each working node.

NetflixHadoop: Experiments Experiments using single-thread, multi-thread and MapReduce Test Environment Hadoop 0.19.1 Single-machine, virtual environment: Host: 2.2 GHz Intel Core 2 Duo, 4GB 667 RAM, Max OS X Virtual machine: 2 virtual processors, 748MB RAM each, Fedora 10. Distributed environment: 4 nodes (should be… currently 9 node) 400 GB Hard Driver on each node Hadoop Heap Size: 1GB (failed to finish)

NetflixHadoop: Experiments

NetflixHadoop: Experiments

NetflixHadoop: Experiments

XML Filtering: Problem Definition Aimed at a pub/sub system utilizing distributed computation environment Pub/sub: Queries are known, data are fed as a stream into the system (DBMS: data are known, queries are fed).

XML Filtering: Pub/Sub System Docs XML Filters XML Queries

XML Filtering: Algorithms Use YFilter Algorithm YFilter: XML queries are indexed as a NFA, then XML data is fed into the NFA and test the final state output. Easy for parallel: queries can be partitioned and indexed separately.

XML Filtering: Implementations Three benchmark platforms are implemented in our project: Single-threaded: Directly apply the YFilter on the profiles and document stream. Multi-threaded: Parallel YFilter onto different threads. Map/Reduce: Parallel YFilter onto different machines (currently in pseudo-distributed environment).

XML Filtering: Single-Threaded Implementation The index (NFA) is built once on the whole set of profiles. Documents then are streamed into the YFilter for matching. Matching results then are returned by YFilter.

XML Filtering: Multi-Threaded Implementation Profiles are split into parts, and each part of the profiles are used to build a NFA separately. Each YFilter instance listens a port for income documents, then it outputs the results through the socket.

XML Filtering: Map/Reduce Implementation Profile splitting: Profiles are read line by line with line number as the key and profile as the value. Map: For each profile, assign a new key using (old_key % split_num) Reduce: For all profiles with the same key, output them into a file. Output: Separated profiles, each with profiles having the same (old_key % split_num) value.

XML Filtering: Map/Reduce Implementation Document matching: Split profiles are read file by file with file number as the key and profiles as the value. Map: For each set of profiles, run YFilter on the document (fed as a configuration of the job), and output the old_key of the matching profile as the key and the file number as the values. Reduce: Just collect results. Output: All keys (line numbers) of matching profiles.

XML Filtering: Map/Reduce Implementation

XML Filtering: Experiments Hardware: Macbook 2.2 GHz Intel Core 2 Duo 4G 667 MHz DDR2 SDRAM Software: Java 1.6.0_17, 1GB heap size Cloudera Hadoop Distribution (0.20.1) in a virtual machine. Data: XML docs: SIGMOD Record (9 files). Profiles: 25K and 50K profiles on SIGMOD Record. Data 1 2 3 4 5 6 7 8 9 Size 478416 415043 312515 213197 103528 53019 42128 30467 20984

XML Filtering: Experiments Run-out-of-memory: We encountered this problem in all the three benchmarks, however Hadoop is much robust on this: Smaller profile split Map phase scheduler uses the memory wisely. Race-condition: since the YFilter code we are using is not thread-safe, in multi-threaded version race-condition messes the results; however Hadoop works this around by its shared-nothing run-time. Separate JVM are used for different mappers, instead of threads that may share something lower-level.

XML Filtering: Experiments

XML Filtering: Experiments There are memory failures, and jobs are failed too.

XML Filtering: Experiments

XML Filtering: Experiments There are memory failures but recovered.

Questions?