Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Aggregating local image descriptors into compact codes
Random Forest Predrag Radenković 3237/10
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.
Similarity Search in High Dimensions via Hashing
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Sparse Solutions for Large Scale Kernel Machines Taher Dameh CMPT820-Multimedia Systems Dec 2 nd, 2010.
ACL, June Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard University of Maryland,
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
FLANN Fast Library for Approximate Nearest Neighbors
The Chinese University of Hong Kong. Research on Private cloud : Eucalyptus Research on Hadoop MapReduce & HDFS.
Dimension Reduction and Visualization of Large High-Dimensional Data via Interpolation Seung-Hee Bae, Jong Youl Choi, Judy Qiu, and Geoffrey Fox School.
Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.
Approximation algorithms for large-scale kernel methods Taher Dameh School of Computing Science Simon Fraser University March 29 th, 2010.
Building Efficient Time Series Similarity Search Operator Mijung Kim Summer Internship 2013 at HP Labs.
Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.
Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Intelligent Database Systems Lab 1 Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Silvia Nittel Kelvin T.Leung Amy Braverman 國立雲林科技大學 National Yunlin.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
Job scheduling algorithm based on Berger model in cloud environment Advances in Engineering Software (2011) Baomin Xu,Chunyan Zhao,Enzhao Hua,Bin Hu 2013/1/251.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Guided By Ms. Shikha Pachouly Assistant Professor Computer Engineering Department 2/29/2016.
June 12, 2016CITALA'121 Cloud Computing Technology For Large Scale and Efficient Arabic Handwriting Recognition System HAMDI Hassen, KHEMAKHEM Maher
Matrix Multiplication in Hadoop
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Image taken from: slideshare
Big Data is a Big Deal!.
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
Distributed Network Traffic Feature Extraction for a Real-time IDS
A modified hyperplane clustering algorithm allows for efficient and accurate clustering of extremely large datasets Ashok Sharma, Robert Podolsky, Jieping.
Parallel Density-based Hybrid Clustering
Cloud Data Anonymization Using Hadoop Map-Reduce Framework With Qos Evaluation and Behaviour analysis PROJECT GUIDE: Ms.S.Subbulakshmi TEAM MEMBERS: A.Mahalakshmi( ).
Author: Ahmed Eldawy, Mohamed F. Mokbel, Christopher Jonathan
Scalable Parallel Interoperable Data Analytics Library
Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.
HPML Conference, Lyon, Sept 2018
Fast and Exact K-Means Clustering
CSE 491/891 Lecture 25 (Mahout).
Minwise Hashing and Efficient Search
Topological Signatures For Fast Mobility Analysis
Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.
Presentation transcript:

Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1

Outline  Introduction  Terminologies  Related Works  Distributed Approximate Spectral Clustering (DASC)  Analysis  Evaluations  Conclusion  Summary  References 2

The idea ….  Challenge : High time and Space complexity in Machine Learning algorithms  Idea : propose a novel kernel-based machine learning algorithm to process large datasets  Challenge : kernel-based algorithms suffer from scalability O(n 2 )  Claim : reduction in computation and memory with insignificant accuracy loss and providing a control over accuracy and reduction tradeoff  Developed a spectral clustering algorithm  Deployed on Amazon Elastic MapReduce service 3

Terminologies  Kernel based methods : one of the methods for pattern analysis,  Based on kernel methods : analyzing raw representation of data by a similarity function over pairs instead of a feature vector  In another word manipulating data or looking at data from another dimension  Spectral clustering : using eigenvalues( characteristic values) of similarity matrix 4

Introduction  Proliferation of data in all fields  Use of Machine Learning in science and engineering  The paper introduces the class of Machine Learning Algorithms efficient for large datasets  Two main techniques :  Controlled Approximation ( tradeoff between accuracy and reduction)  Elastic distribution of computation ( machine learning technique harnessing the distributed flexibility offered by clouds)  THE PERFORMANCE OF KERNEL-BASED ALGORITHMS IMPROVES WHEN SIZE OF TRAINING DATASET INCREASES! 5

Introduction  Kernel-based algorithms -> O(n 2 ) similarity matrix  Idea : design an approximation algorithm for constructing the kernel matrix for large datasets  The steps : I. Design an approximation algorithm for kernel matrix computation scalable to all ML algorithms II. Design a Spectral Classification on top of that to show the effect on accuracy III. Implemented with MapReduce tested on lab cluster and EMR IV. Experiments on both real and synthetic data 6

Distributed Approximate Spectral Clustering (DASC)  Overview  Details  Mapreduce Implementation 7

Overview  Kernel-based algorithms : very popular in the past decade  Classification  Dimensionality reduction  Data clustering  Compute kernel matrix i.e kernelized distance between data pair  1. create compact signatures for all data points  2. group the data with similar signature ( preprocessing)  3. compute similarity values for data in each group to construct similarity matrix  4. perform Spectral Clustering on the appx’s similarity matrix (SC can be replaces by any other kernel base approach) 8

Detail  The first step is preprocessing with Locality Sensitive Hashing(LSH)  LSH -> probability reduction technique  Dimension reduction technique : hash input, map similar items to the same bucket, reduce the size of input universe The idea : if distance of two points is less than a Threshold with probability p1, and distance of those Points is higher than the threshold by appx factor of c Then group them together or output the same hash!! Points with higher collision fall in the same bucket! 9 Courtesy: Wikipedia

Details 10 Courtesy: Wikipedia

Details  Given N data points  Random projection hashing to produce a M-bit signature fro each data point  Choose an arbitrary dimension  Compare with threshold  If feature value in the dimension is larger than threshold -> 1  Otherwise 0  M arbitrary dimensions, M bit 11 O(MN)

Details 12

Detail 13

Mapreduce Implementation 14

15 Courtesy: reference [1]

Analysis and Complexity  Complexity Analysis  Accuracy Analysis 16

Complexity Analysis  Computation analysis  Memory analysis 17 Courtesy: reference [1]

Accuracy analysis  Increasing #hash functions probability of clustering adjacent points in a bucket decreases  The incorrectly clustering increases  Increasing M, collision probability decreases  Hash function increase : more parallel  Use M value to partition  Tuning of M controls the tradeoff 18 Courtesy: reference [1]

Accuracy Analysis  Randomly selected documents  Ratio :Correctly classified points to the total number  DASC better than PSC, close to SC  NYST on MATLAB [no cluster]  Clustering didn’t affect accuracy too much 19 Courtesy: reference [1]

Experimental Evaluations  Platform  Datasets  Performance Metrics  Algorithm Implemented and Compared Agains  Results for Clustering  Results for Time and Space Complexities  Elasticity and Scalability 20

Platform  Hadoop MapReduce Framework  Local cluster and Amazon EMR  Local : 5 machines (1 master/ 4 slaves)  Intel Core2 Duo E6550  1GB DRAM  Hadoop on Ubunto, Java  EMR :variable configs : 16,32,64 machines  1.7GB Memory, 350 GB disk  Hadoop on Debian Linux 21 Courtesy: reference [1]

Dataset  Synthetic  Real : Wikipedia 22 Courtesy: reference [1]

Job flow on EMR  Job Flow: collection of steps on a dataset  First partition dataset into buckets with LSH  Each bucket stored in a file  SC applied to each bucket  Partitions are processed based on mappers available  Data dependent hashing can be used instead of LSH  Locality is achieved through the first step LSH  Highly scalable if nodes are increased or decreased dynamically allocates fewer or more number of partitions to nodes 23

Algorithms and Performance metrics  Computational and space complexity  Clustering accuracy  Ratio of correct #clustered points / #total points  DBI : Davies-Bouldin Index and ASE average squared error  Fnorm to similarity of original and approximate kernel matrix  Implemented DASC, PSC and SC, NYST  DASC modifying Mahout Library M = [(logN)/2−1[, and P = M-1 to have O(1) comparison  SC Mahout  PSC C++ implementation of SC by PARPACK lib  NYST implemented by MATLAB 24

Results for Clustering Accuracy  Similarity between approximate and original Gram matrices  Using low-level F-norm  4K to 512K data points  Not able to use more than 512 K  Needs square of the input size  Different number of buckets ! 4 to 4K  Larger the number of buckets, more partitioning  Lesser accuracy  Increase #buckets, decrease F-norm ratio, less similar  Increase #buckets before Fnorm drops 25 Courtesy: reference [1]

Results for Time and Space Complexities  DASC, PSC and SC comparison:  PSC implemented in C++ with MPI  DASC implemented in Java with Mapreduce framework  SC implemented with Mahout library  Order of magnitude : DASC runs one order of magnitude faster than PSC  Mahout SC didn’t scale to more than Courtesy: reference [1]

Conclusion  Kernel-based Machine Learning  Approximation for computing kernel matrix  LSH to reduce the kernel computation  DASC O(B) reduction in size of problem (B number of buckets)  B depends on number of bits in signature  Tested on real and synthetic dataset -> Real : Wikipedia  90% accuracy 27

Critics  The approximation factor depends on the size of the problem  The algorithm is scaled just to kernel-based machine learning methods which are applicable just to a limited number of datasets  Algorithms compared are each implemented in different frameworks, i.e MPI implementation might be really inefficient, thus the comparison is not really fair! 28

References  Hefeeda, Mohamed, Fei Gao, and Wael Abd-Almageed. "Distributed approximate spectral clustering for large-scale datasets." Proceedings of the 21st international symposium on High- Performance Parallel and Distributed Computing. ACM,  Wikipedia  Tom M Mitchell slides : Tom M Mitchell slides : 29