Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou.

Slides:

Advertisements

Similar presentations

Algorithms of Google News An Analysis of Google News Personalization Scalable Online Collaborative Filtering 1.

Advertisements

Information Retrieval in Practice

Overview of MapReduce and Hadoop

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Spark: Cluster Computing with Working Sets

CS4670 / 5670: Computer Vision Bag-of-words models Noah Snavely Object

Discriminative and generative methods for bags of features

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Large-Scale Machine Learning Program For Energy Prediction CEI Smart Grid Wei Yin.

Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,

Distributed Computations

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

Scalable Text Mining with Sparse Generative Models

Distributed Computations MapReduce

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.

FLANN Fast Library for Approximate Nearest Neighbors

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

Utilising software to enhance your research Eamonn Hynes 5 th November, 2012.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.

Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

1 A K-Means Based Bayesian Classifier Inside a DBMS Using SQL & UDFs Ph.D Showcase, Dept. of Computer Science Sasi Kumar Pitchaimalai Ph.D Candidate Database.

Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.

MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.

“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Introduction to Hadoop and HDFS

Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve ， Devendra Dahiphale ， Amit Chhajer 報告 : 饒展榕.

MapReduce M/R slides adapted from those of Jeff Dean’s.

Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.

Problems in large-scale computer vision David Crandall School of Informatics and Computing Indiana University.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Supporting Large-scale Social Media Data Analyses with Customizable Indexing Techniques on NoSQL Databases.

Click to edit Master subtitle style 2/23/10 Time and Space Optimization of Document Content Classifiers Dawei Yin, Henry S. Baird, and Chang An Computer.

Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.

Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.

Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

| presented by Vasileios Zois CS at USC 09/20/2013 Introducing Scalability into Smart Grid 1.

Image taken from: slideshare

Experience Report: System Log Analysis for Anomaly Detection

Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD

Introduction to MapReduce and Hadoop

Applying Twister to Scientific Applications

February 26th – Map/Reduce

CMPT 733, SPRING 2016 Jiannan Wang

Cse 344 May 4th – Map/Reduce.

CMPT 733, SPRING 2017 Jiannan Wang

Presentation transcript:

Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou

Outline ► Motivations ► Map-Reduce Framework ► Large-scale Multimedia Processing Parallelization ► Machine Learning Algorithm Transformation ► Map-Reduce Drawbacks and Variants ► Conclusions

Motivations ► Why we need Parallelization?  “Time is Money” ► Simultaneously ► Divide-and-conquer  Data is too huge to handle ► 1 trillion (10^12) unique URLs in 2008 ► CPU speed limitation

Motivations ► Why we need Parallelization?  Increasing Data ► Social Networks ► Scalability!  “Brute Force” ► No approximations ► Cheap clusters v.s. expensive computers

Motivations ► Why we choose Map-Reduce?  Popular ► A parallelization framework Google proposed and Google uses it everyday ► Yahoo and Amazon also involve in  Popular  Good? ► “Hides” parallelization details from users ► Provides high-level operations that suit for majority algorithms  Good start on deeper parallelization researches

Map-Reduce Framework ► Simple idea inspired by function language (like LISP)  map ► a type of iteration in which a function is successively applied to each element of one sequence  reduce ► a function combines all the elements of a sequence using a binary operation

Map-Reduce Framework ► Data representation    map generates pairs  reduce combines pairs according to same key ► “Hello, world!” Example

Map-Reduce Framework data split0 split1 split2 map reduce output

Map-Reduce Framework ► Count the appearances of each different word in a set of documents void map (Document) for each word in Document generate void reduce (word,CountList) int count = 0; for each number in CountList count += number generate

Map-Reduce Framework ► Different Implementations  Distributed computing ► each computer acts as a computing node ► focusing on reliability over distributed computer networks ► Google’s clusters  closed source  GFS: distributed file system ► Hadoop  open source  HDFS: hadoop distributed file system

Map-Reduce Framework ► Different Implementations  Multi-Core computing ► each core acts as a computing node ► focusing on high speed computing using large shared memories ► Phoenix++  a two dimensional table stored in the memory where map and reduce read and write pairs  open source created by Stanford ► GPU  10x higher memory bandwidth than a CPU  5x to 32x speedups on SVM training

Large-scale Multimedia Processing Parallelization ► Clustering  k-means  Spectral Clustering ► Classifiers training  SVM ► Feature extraction and indexing  Bag-of-Features  Text Inverted Indexing

Clustering ► k-means  Basic and fundamental  Original Algorithm 1. Pick k initial center points 2. Iterate until converge 1.Assign each point with the nearest center 2.Calculate new centers  Easy to parallel!

Clustering ► k-means  a shared file contains center points  map 1. for each point, find the nearest center 2. generate pair  key : center id  value : current point’s coordinate  reduce 1.collect all points belonging to the same cluster (they have the same key value) 2.calculate the average  new center  iterate

Clustering ► Spectral Clustering  S is huge: 10^6 points (double) need 8TB  Sparse It! ► Retain only S_ij where j is among the t nearest neighbors of i ► Locality Sensitive Hashing?  It’s an approximation ► We can calculate directly  Parallel

Clustering ► Spectral Clustering  Calculate distance matrix ► map  creates so that every n/p points have the same key  p is the number of node in the computer cluster ► reduce  collect points with same key so that the data is split into p parts and each part is stored in each node ► for each point in the whole data set, on each node, find t nearest neighbors

Clustering ► Spectral Clustering  Symmetry ► x_j in t-nearest-neighbor set of x_i ≠ x_i in t- nearest-neighbor set of x_j ► map  for each nonzero element, generates two  for each nonzero element, generates two  first: key is row ID; value is column ID and distance  second: key is column ID; value is row ID and distance ► reduce  uses key as row ID and fills columns specified by column ID in value

Classification ► SVM

Classification ► SVM  SMO  instead of solving all alpha together  coordinate ascent ► pick one alpha, fix others ► optimize alpha_i

Classification ► SVM  SMO  But we cannot optimize only one alpha for SVM  We need to optimize two alpha each iteration

Classification ► SVM  repeat until converge: ► map  given two alpha, updating the optimization information ► reduce  find the two maximally violating alpha

Feature Extraction and Indexing ► Bag-of-Features  features  feature clusters  histogram  feature extraction ► map  takes images in and outputs features directly  feature clustering ► clustering algorithms, like k-means

Feature Extraction and Indexing ► Bag-of-Features  feature quantization histogram ► map  for each feature on one image, find the nearest feature cluster  generates  generates ► reduce    for each feature cluster, updating the histogram  generates  generates

Feature Extraction and Indexing ► Text Inverted Indexing  Inverted index of a term ► a document list containing the term ► each item in the document list stores statistical information  frequency, position, field information  map ► for each term in one document, generates ► for each term in one document, generates  reduce ► ► ► for each document, update statistical information for that term ► generates ► generates

Machine Learning Algorithm Transformation ► How can we know whether an algorithm can be transformed into a Map-Reduce fashion?  if so, how to do that? ► Statistical Query and Summation Form  All we want is to estimate or inference ► cluster id, labels…  From sufficient statistics ► distances between points ► points positions  statistic computation can be divided

Machine Learning Algorithm Transformation ► Linear Regression Summation Form reduce map reduce map reduce map

Machine Learning Algorithm Transformation ► Naïve Bayesian map reduce

Machine Learning Algorithm Transformation ► Solution  Find statistics calculation part  Distribute calculations on data using map  Gather and refine all statistics in reduce

Map-Reduce Systems Drawbacks ► Batch based system  “pull” model ► reduce must wait for un-finished map ► reduce “pull” data from map  no iteration support directly ► Focusing too much on distributed system and failure tolerance  local computing cluster may not need them

Map-Reduce Systems Drawbacks ► Focusing too much on distributed system and failure tolerance

Map-Reduce Variants ► Map-Reduce online  “push” model ► map “pushes” data to reduce  reduce can also “push” results to map from the next job  build a pipeline ► Iterative Map-Reduce  higher level schedulers  schedule the whole iteration process

Map-Reduce Variants ► Series Map-Reduce? Multi-Core Map-Reduce Multi-Core Map-Reduce Multi-Core Map-Reduce Multi-Core Map-Reduce Map-Reduce? MPI? Condor?

Conclusions ► Good parallelization framework  Schedule jobs automatically  Failure tolerance  Distributed computing supported  High level abstraction ► easy to port algorithms on it ► Too “industry”  why we need a large distributed system?  why we need too much data safety?

References [1] Map-Reduce for Machine Learning on Multicore [2] A Map Reduce Framework for Programming Graphics Processors [3] Mapreduce Distributed Computing for Machine Learning [4] Evaluating mapreduce for multi-core and multiprocessor systems [5] Phoenix Rebirth: Scalable MapReduce on a Large-Scale Shared-Memory System [6] Phoenix++: Modular MapReduce for Shared-Memory Systems [7] Web-scale computer vision using MapReduce for multimedia data mining [8] MapReduce indexing strategies: Studying scalability and efficiency [9] Batch Text Similarity Search with MapReduce [10] Twister: A Runtime for Iterative MapReduce [11] MapReduce Online [12] Fast Training of Support Vector Machines Using Sequential Minimal Optimization [13] Social Content Matching in MapReduce [14] Large-scale multimedia semantic concept modeling using robust subspace bagging and MapReduce [15] Parallel Spectral Clustering in Distributed Systems

Thanks Q & A