An Approach for Processing Large and Non-uniform Media Objects on MapReduce-Based Clusters Rainer Schmidt and Matthias Rella Speaker: Lin-You Wu.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
MapReduce Online Veli Hasanov Fatih University.
Hadoop: The Definitive Guide Chap. 2 MapReduce
For further information Please contact Joe Buck: This work was supported by the Systems Research Lab at the University.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Chapter 7 Structuring System Process Requirements
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Introduction to Hadoop and HDFS
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
MapReduce How to painlessly process terabytes of data.
An Architecture for Distributed High Performance Video Processing in the Cloud Speaker : 吳靖緯 MA0G IEEE 3rd International Conference.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Hidemoto Nakada, Hirotaka Ogawa and Tomohiro Kudoh National Institute of Advanced Industrial Science and Technology, Umezono, Tsukuba, Ibaraki ,
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
O’Reilly – Hadoop: The Definitive Guide Ch.7 MapReduce Types and Formats 29 July 2010 Taikyoung Kim.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Integrating Big Data into the Computing Curricula 02/2015 Achmad Benny Mutiara
MapReduce Joins Shalish.V.J. A Refresher on Joins A join is an operation that combines records from two or more data sets based on a field or set of fields,
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Distributed Video Transcoding System based on MapReduce for Video Content Delivery Myoungjin Kim', Hanku Lee l 'z* Hyeokju Lee' and Seungho Han' ' Department.
Indexing The World Wide Web: The Journey So Far Abhishek Das, Ankit Jain 2011 Paper Presentation : Abhishek Rangnekar 1.
Jimmy Lin and Michael Schatz Design Patterns for Efficient Graph Algorithms in MapReduce Michele Iovino Facoltà di Ingegneria dell’Informazione, Informatica.
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Hadoop MapReduce Framework
MapReduce Types, Formats and Features
Myoungjin Kim1, Yun Cui1, Hyeokju Lee1 and Hanku Lee1,2,*
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MapReduce Simplied Data Processing on Large Clusters
Hadoop MapReduce Types
MapReduce: Data Distribution for Reduce
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS 345A Data Mining MapReduce This presentation has been altered.
Experiences with Hadoop and MapReduce
MAPREDUCE TYPES, FORMATS AND FEATURES
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

An Approach for Processing Large and Non-uniform Media Objects on MapReduce-Based Clusters Rainer Schmidt and Matthias Rella Speaker: Lin-You Wu

Outline 1. Problem Description and Strategy 2. Data Decomposition 3. User Defined Functions 4. Application Design 5. Experiment Setup 6. Conclusion

1. Problem Description and Strategy

This application takes m audiovisual (AV) input files as input and generates n AV output files. In a first phase, the MapReduce framework create a set of mappers for each input file and assigns them to file partitions (splits). The Mappers subdivide and process the input splits based on interpretable binary chunks (records).

For each record, a mapper creates a set of intermediate output data streams which are grouped using keys (K), representing (parts of) the input file. Reducers remotely read the intermediate data streams and generate a result file for each key.

2. Data Decomposition Most binary formats cannot be broken into chunks at arbitrary positions or do not support splitting at all. For processing sets of relatively small files one can overcome this problem by dividing the payload on a per-file basis.

3. User Defined Functions The model has been widely used for analyzing and transforming large-scale data sets like text documents, database query results, or log files that can reside on different storage systems. In order to enable the development of such user- defined function for processing audiovisual content, we have identified two major requirements:

(1)the framework must provide a mechanism to generate meaningful records that can be processed within the map function. (2)the framework must provide the required abstractions and mechanisms to analyze and transform the records. This paper, however, primarily concentrates on data decomposition and record generation.

A data pipeline for extracting records (audio/video frames) from splits of compressed input data. Split borders (s1, s2) are adjusted (s1x, s2x) based on key frames discovered within the multiplexed data stream. Fig.2

4. Application Design AV Splittable Compression Codec. One of the most critical issues when dealing with the parallel processing of large files is handling compression. Hadoop provides codec implementations for a set of file compression codecs including gzip, bzip2, LZO, and DEFLATE.

It is however critical to consider if a file format supports splitting for processing it with MapReduce. Hadoop utilizes a specific interface called SplittableCompressionCodec to denote codecs that support the compression/ decompression of streams at arbitrary positions. We have implemented AV Splittable Compression Codec, a class that supports the compression, decompression, and splitting of audiovisual files.

AV Input Stream. In order to split a binary stream, it must be possible to detect positions where the data can be decomposed into blocks. As shown in figure 2, split boundaries must be repositioned to key frame positions by the codec in order to support decomposition of the data stream.

Frame Record Reader. Record readers are plugged into the input file format of a particular MapReduce job. They typically convert the data provided by the input stream into a set of key/value pairs (called records) that are processed within the map and reduce tasks.

We utilize the concept of packets, which are logical data entities read and uncompressed from the input sequences. Packets are subsequently decoded (optionally error checked and resampled) into objects of a specific data type. For example, a frame record reader can utilize the above described concepts in order to obtain audio/video frames from an generic input split.

Output Generation. Record writers provide the inverse concept to record readers. They write the delivered job outputs in the form of key/value pairs to the file system. Output files are produced per reduce task and might have to be merged in a post processing stage.

5. Experiment Setup Five worker nodes (Single Core 1.86GHz Intel CPU, 1.5GB RAM) connected through Gigabit Ethernet. To utilize mp3 (48KHz) and MPEG4 (25fps) as compression formats and AVI as a container. Table 1. Payload data file sizes are depending on encoding and duration

Results and Improvements

6. Conclusion We have presented an application for the parallel processing of binary media objects based on the MapReduce programming model. Furthermore, we provide insights on interpreting the compressed payload data, as this is highly important in order to assess and balance the application’s workload.

Thank you for listening