Download presentation
Presentation is loading. Please wait.
1
Map-Reduce and Parallel Computing for Large-Scale Media Processing Youjie Zhou
2
Outline ► Motivations ► Map-Reduce Framework ► Large-scale Multimedia Processing Parallelization ► Machine Learning Algorithm Transformation ► Map-Reduce Drawbacks and Variants ► Conclusions
3
Motivations ► Why we need Parallelization? “Time is Money” ► Simultaneously ► Divide-and-conquer Data is too huge to handle ► 1 trillion (10^12) unique URLs in 2008 ► CPU speed limitation
4
Motivations ► Why we need Parallelization? Increasing Data ► Social Networks ► Scalability! “Brute Force” ► No approximations ► Cheap clusters v.s. expensive computers
5
Motivations ► Why we choose Map-Reduce? Popular ► A parallelization framework Google proposed and Google uses it everyday ► Yahoo and Amazon also involve in Popular Good? ► “Hides” parallelization details from users ► Provides high-level operations that suit for majority algorithms Good start on deeper parallelization researches
6
Map-Reduce Framework ► Simple idea inspired by function language (like LISP) map ► a type of iteration in which a function is successively applied to each element of one sequence reduce ► a function combines all the elements of a sequence using a binary operation
7
Map-Reduce Framework ► Data representation map generates pairs reduce combines pairs according to same key ► “Hello, world!” Example
8
Map-Reduce Framework data split0 split1 split2 map reduce output
9
Map-Reduce Framework ► Count the appearances of each different word in a set of documents void map (Document) for each word in Document generate void reduce (word,CountList) int count = 0; for each number in CountList count += number generate
10
Map-Reduce Framework ► Different Implementations Distributed computing ► each computer acts as a computing node ► focusing on reliability over distributed computer networks ► Google’s clusters closed source GFS: distributed file system ► Hadoop open source HDFS: hadoop distributed file system
11
Map-Reduce Framework ► Different Implementations Multi-Core computing ► each core acts as a computing node ► focusing on high speed computing using large shared memories ► Phoenix++ a two dimensional table stored in the memory where map and reduce read and write pairs open source created by Stanford ► GPU 10x higher memory bandwidth than a CPU 5x to 32x speedups on SVM training
12
Large-scale Multimedia Processing Parallelization ► Clustering k-means Spectral Clustering ► Classifiers training SVM ► Feature extraction and indexing Bag-of-Features Text Inverted Indexing
13
Clustering ► k-means Basic and fundamental Original Algorithm 1. Pick k initial center points 2. Iterate until converge 1.Assign each point with the nearest center 2.Calculate new centers Easy to parallel!
14
Clustering ► k-means a shared file contains center points map 1. for each point, find the nearest center 2. generate pair key : center id value : current point’s coordinate reduce 1.collect all points belonging to the same cluster (they have the same key value) 2.calculate the average new center iterate
15
Clustering ► Spectral Clustering S is huge: 10^6 points (double) need 8TB Sparse It! ► Retain only S_ij where j is among the t nearest neighbors of i ► Locality Sensitive Hashing? It’s an approximation ► We can calculate directly Parallel
16
Clustering ► Spectral Clustering Calculate distance matrix ► map creates so that every n/p points have the same key p is the number of node in the computer cluster ► reduce collect points with same key so that the data is split into p parts and each part is stored in each node ► for each point in the whole data set, on each node, find t nearest neighbors
17
Clustering ► Spectral Clustering Symmetry ► x_j in t-nearest-neighbor set of x_i ≠ x_i in t- nearest-neighbor set of x_j ► map for each nonzero element, generates two for each nonzero element, generates two first: key is row ID; value is column ID and distance second: key is column ID; value is row ID and distance ► reduce uses key as row ID and fills columns specified by column ID in value
18
Classification ► SVM
19
Classification ► SVM SMO instead of solving all alpha together coordinate ascent ► pick one alpha, fix others ► optimize alpha_i
20
Classification ► SVM SMO But we cannot optimize only one alpha for SVM We need to optimize two alpha each iteration
21
Classification ► SVM repeat until converge: ► map given two alpha, updating the optimization information ► reduce find the two maximally violating alpha
22
Feature Extraction and Indexing ► Bag-of-Features features feature clusters histogram feature extraction ► map takes images in and outputs features directly feature clustering ► clustering algorithms, like k-means
23
Feature Extraction and Indexing ► Bag-of-Features feature quantization histogram ► map for each feature on one image, find the nearest feature cluster generates generates ► reduce for each feature cluster, updating the histogram generates generates
24
Feature Extraction and Indexing ► Text Inverted Indexing Inverted index of a term ► a document list containing the term ► each item in the document list stores statistical information frequency, position, field information map ► for each term in one document, generates ► for each term in one document, generates reduce ► ► ► for each document, update statistical information for that term ► generates ► generates
25
Machine Learning Algorithm Transformation ► How can we know whether an algorithm can be transformed into a Map-Reduce fashion? if so, how to do that? ► Statistical Query and Summation Form All we want is to estimate or inference ► cluster id, labels… From sufficient statistics ► distances between points ► points positions statistic computation can be divided
26
Machine Learning Algorithm Transformation ► Linear Regression Summation Form reduce map reduce map reduce map
27
Machine Learning Algorithm Transformation ► Naïve Bayesian map reduce
28
Machine Learning Algorithm Transformation ► Solution Find statistics calculation part Distribute calculations on data using map Gather and refine all statistics in reduce
29
Map-Reduce Systems Drawbacks ► Batch based system “pull” model ► reduce must wait for un-finished map ► reduce “pull” data from map no iteration support directly ► Focusing too much on distributed system and failure tolerance local computing cluster may not need them
30
Map-Reduce Systems Drawbacks ► Focusing too much on distributed system and failure tolerance
31
Map-Reduce Variants ► Map-Reduce online “push” model ► map “pushes” data to reduce reduce can also “push” results to map from the next job build a pipeline ► Iterative Map-Reduce higher level schedulers schedule the whole iteration process
32
Map-Reduce Variants ► Series Map-Reduce? Multi-Core Map-Reduce Multi-Core Map-Reduce Multi-Core Map-Reduce Multi-Core Map-Reduce Map-Reduce? MPI? Condor?
33
Conclusions ► Good parallelization framework Schedule jobs automatically Failure tolerance Distributed computing supported High level abstraction ► easy to port algorithms on it ► Too “industry” why we need a large distributed system? why we need too much data safety?
34
References [1] Map-Reduce for Machine Learning on Multicore [2] A Map Reduce Framework for Programming Graphics Processors [3] Mapreduce Distributed Computing for Machine Learning [4] Evaluating mapreduce for multi-core and multiprocessor systems [5] Phoenix Rebirth: Scalable MapReduce on a Large-Scale Shared-Memory System [6] Phoenix++: Modular MapReduce for Shared-Memory Systems [7] Web-scale computer vision using MapReduce for multimedia data mining [8] MapReduce indexing strategies: Studying scalability and efficiency [9] Batch Text Similarity Search with MapReduce [10] Twister: A Runtime for Iterative MapReduce [11] MapReduce Online [12] Fast Training of Support Vector Machines Using Sequential Minimal Optimization [13] Social Content Matching in MapReduce [14] Large-scale multimedia semantic concept modeling using robust subspace bagging and MapReduce [15] Parallel Spectral Clustering in Distributed Systems
35
Thanks Q & A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.