Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China www.jiahenglu.net.

Slides:



Advertisements
Similar presentations
IT0483-PRINCIPLES OF CLOUD COMPUTING ,N.ARIVAZHAGAN
Advertisements

Hashing.
MapReduce.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce in Action Team 306 Led by Chen Lin College of Information Science and Technology.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Developing a MapReduce Application – packet dissection.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Google MapReduce Framework A Summary of: MapReduce & Hadoop API Slides prepared by Peter Erickson
Spark: Cluster Computing with Working Sets
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
Hadoop: The Definitive Guide Chap. 2 MapReduce
Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.
Hadoop: Nuts and Bolts Data-Intensive Information Processing Applications ― Session #2 Jimmy Lin University of Maryland Tuesday, February 2, 2010 This.
Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
CS10 Final Review by Glenn Sugden is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.Glenn Sugdenmons Attribution-NonCommercial-ShareAlike.
Distributed and Parallel Processing Technology Chapter7. MAPREDUCE TYPES AND FORMATS NamSoo Kim 1.
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
HAMS Technologies 1
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.
IBM Research ® © 2007 IBM Corporation INTRODUCTION TO HADOOP & MAP- REDUCE.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Mining High Utility Itemset in Big Data
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Hidemoto Nakada, Hirotaka Ogawa and Tomohiro Kudoh National Institute of Advanced Industrial Science and Technology, Umezono, Tsukuba, Ibaraki ,
On Adding Bloom Filters to Longest Prefix Matching Algorithms
Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
O’Reilly – Hadoop: The Definitive Guide Ch.7 MapReduce Types and Formats 29 July 2010 Taikyoung Kim.
Record Linkage in a Distributed Environment
Elementary Data Organization. Outline  Data, Entity and Information  Primitive data types  Non primitive data Types  Data structure  Definition 
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
A Simple Approach for Author Profiling in MapReduce
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
A Straightforward Author Profiling Approach in MapReduce
Ch 8 and Ch 9: MapReduce Types, Formats and Features
MapReduce Types, Formats and Features
Extraction, aggregation and classification at Web Scale
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
Data processing with Hadoop
Cloud Computing: Project Tutorial Hadoop Map-Reduce Programming
MAPREDUCE TYPES, FORMATS AND FEATURES
Map Reduce, Types, Formats and Features
Presentation transcript:

Advanced topics on Mapreduce with Hadoop Jiaheng Lu Department of Computer Science Renmin University of China

Outline Brief Review Chaining MapReduce Jobs Join in MapReduce Bloom Filter

Brief Review A parallel programming framework Divide and merge split0 split1 split2 Input data Map task Mappers Map task Shuffle Reduce task Reducers Reduce task Output data output0 output1

Chaining MapReduce jobs Chaining in a sequence Chaining with complex dependency Chaining preprocessing and postprocessing steps

Chaining in a sequence Simple and straightforward [MAP | REDUCE]+; MAP+ | REDUCE | MAP* Output of last is the input to the next Similar to pipes Job1Job2Job3

Configuration conf = getConf(); JobConf job = new JobConf(conf); job.setJobName("ChainJob"); job.setInputFormat(TextInputFormat.class); job.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(job, in); FileOutputFormat.setOutputPath(job, out); JobConf map1Conf = new JobConf(false); ChainMapper.addMapper(job, Map1.class, LongWritable.class, Text.class, Text.class, Text.class, true, map1Conf);

Chaining with complex dependency Jobs are not chained in a linear fashion Use addDependingJob() method to add dependency information: Job3 Job1Job2 x.addDependingJob(y)

Chaining preprocessing and postprocessing steps Example: remove stop word in IR Approaches: Separate: inefficient Chaining those steps into a single job Use ChainMapper.addMapper() and ChainReducer.setReducer Map+ | Reduce | Map*

Join in MapReduce Reduce-side join Broadcast join Map-side filtering and Reduce-side join A given key A range from dataset(broadcast) a Bloom filter

Reduce-side join Map output key>>join key, value>>tagged with data source Reduce do a full cross-product of values output the combination results

Example ab 1ab 1cd 4ef ac 1b 2d 4c table x table y map() 1 4 key xab xcd xef value key yb yd yc value tag join key shuffle() 1 key xab xcd yb valuelist 2yd 4 xef yc reduce() abc 1abb 1cdb 4efc output 1

Broadcast join (replicated join) Broadcast the smaller table Do join in Map() Using distributed cache DistributedCache.addCacheFile()

Map-side filtering and Reduce- side join Join key: student IDs from info generate IDs file from info broadcast join What if the IDs file can’t be stored in memory? a Bloom Filter

A Bloom Filter Introduction Implementation of bloom filter Use in MapReduce join

Introduction to Bloom Filter space-efficient data structure, constant size, test elements, add(), contains() no false negatives and a small probability of false positives

Implementation of bloom filter Apply a bit array Add elements generate k indexes set the k bits to 1 Test elements generate k indexes all k bits are 1 >> true, not all are 1 >> false

Example add x(0,2,6) add y(0,3,9) contain m(1,3,9) contain n(0,2,9)initial state ①② ③④⑤ ×√ false positives

Use in MapReduce join A separate subjob to create a Bloom Filter Broadcast the Bloom Filter and use in Map() of join job drop the useless record, and do join in reduce

References Chunk Lam, “Hadoop in action” Jairam Chandar, “Join Algorithms using Map/Reduce”

THANK YOU

Hadoop