Load Balancing for Partition-based Similarity Search Date : 2014/09/01 Author : Xun Tang, Maha Alabduljalil, Xin Jin, Tao Yang Source : SIGIR’14 Advisor.

Slides:



Advertisements
Similar presentations
Sumblr: Continuous Summarization of Evolving Tweet Streams
Advertisements

Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :
Linking Named Entity in Tweets with Knowledge Base via User Interest Modeling Date : 2014/01/22 Author : Wei Shen, Jianyong Wang, Ping Luo, Min Wang Source.
Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.
Date : 2013/09/17 Source : SIGIR’13 Authors : Zhu, Xingwei
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
DOMAIN DEPENDENT QUERY REFORMULATION FOR WEB SEARCH Date : 2013/06/17 Author : Van Dang, Giridhar Kumaran, Adam Troy Source : CIKM’12 Advisor : Dr. Jia-Ling.
Toward Whole-Session Relevance: Exploring Intrinsic Diversity in Web Search Date: 2014/5/20 Author: Karthik Raman, Paul N. Bennett, Kevyn Collins-Thompson.
Advanced Topics in Algorithms and Data Structures Lecture 7.1, page 1 An overview of lecture 7 An optimal parallel algorithm for the 2D convex hull problem,
Mining Query Subtopics from Search Log Data Date : 2012/12/06 Resource : SIGIR’12 Advisor : Dr. Jia-Ling Koh Speaker : I-Chih Chiu.
Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University COT 5410 – Spring 2004.
On Sparsity and Drift for Effective Real- time Filtering in Microblogs Date : 2014/05/13 Source : CIKM’13 Advisor : Prof. Jia-Ling, Koh Speaker : Yi-Hsuan.
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
 Focused Clustering and Outlier Detection in Large Attributed Graphs Author : Bryan Perozzi, Leman Akoglu, Patricia lglesias Sánchez, Emmanuel Müller.
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Tie-Yan.
Load Balancing for Partition-based Similarity Search Xun Tang, Maha Alabduljalil, Xin Jin, Tao Yang Department of Computer Science University of California.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Graph Partitioning and Clustering E={w ij } Set of weighted edges indicating pair-wise similarity between points Similarity Graph.
ROBUST RESOURCE ALLOCATION OF DAGS IN A HETEROGENEOUS MULTI-CORE SYSTEM Luis Diego Briceño, Jay Smith, H. J. Siegel, Anthony A. Maciejewski, Paul Maxwell,
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Load Balancing for Partition-based Similarity Search Xun Tang, Maha Alabduljalil, Xin Jin, Tao Yang Department of Computer Science University of California.
Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.
Date : 2014/01/14 Author : Thanh-Son Nguyen, Hady W. Lauw, Panayiotis Tsaparas Source : CIKM’13 Advisor : Jia-ling Koh Speaker : Shao-Chun Peng.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Date : 2012/10/25 Author : Yosi Mass, Yehoshua Sagiv Source : WSDM’12 Speaker : Er-Gang Liu Advisor : Dr. Jia-ling Koh 1.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.
ROCK: A Robust Clustering Algorithm for Categorical Attributes Authors: Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Data Engineering, Proceedings.,
Record Linkage in a Distributed Environment
LOGO Identifying Opinion Leaders in the Blogosphere Xiaodan Song, Yun Chi, Koji Hino, Belle L. Tseng CIKM 2007 Advisor : Dr. Koh Jia-Ling Speaker : Tu.
Date: 2014/05/27 Author: Xiangnan Kong, Bokai Cao, Philip S. Yu Source: KDD’13 Advisor: Jia-ling Koh Speaker: Sheng-Chih Chu Multi-Label Classification.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
Domain decomposition in parallel computing Ashok Srinivasan Florida State University.
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
Compact Query Term Selection Using Topically Related Text Date : 2013/10/09 Source : SIGIR’13 Authors : K. Tamsin Maxwell, W. Bruce Croft Advisor : Dr.Jia-ling,
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,
Topical Clustering of Search Results Date : 2012/11/8 Resource : WSDM’12 Advisor : Dr. Jia-Ling Koh Speaker : Wei Chang 1.
TO Each His Own: Personalized Content Selection Based on Text Comprehensibility Date: 2013/01/24 Author: Chenhao Tan, Evgeniy Gabrilovich, Bo Pang Source:
Parallel tree search: An algorithmic approach for multi- field packet classification Authors: Derek Pao and Cutson Liu. Publisher: Computer communications.
{ Adaptive Relevance Feedback in Information Retrieval Yuanhua Lv and ChengXiang Zhai (CIKM ‘09) Date: 2010/10/12 Advisor: Dr. Koh, Jia-Ling Speaker: Lin,
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Finding similar items by leveraging social tag clouds Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: SAC 2012’ Date: October 4, 2012.
CiteData: A New Multi-Faceted Dataset for Evaluating Personalized Search Performance CIKM’10 Advisor : Jia-Ling, Koh Speaker : Po-Hsien, Shih.
1 Munther Abualkibash University of Bridgeport, CT.
ClusCite:Effective Citation Recommendation by Information Network-Based Clustering Date: 2014/10/16 Author: Xiang Ren, Jialu Liu,Xiao Yu, Urvashi Khandelwal,
QUERY-PERFORMANCE PREDICTION: SETTING THE EXPECTATIONS STRAIGHT Date : 2014/08/18 Author : Fiana Raiber, Oren Kurland Source : SIGIR’14 Advisor : Jia-ling.
Date : 2016/08/09 Advisor : Jia-ling Koh Speaker : Yi-Yui Lee
Detecting and Classifying Duplicate Tweets
Optimizing Parallel Algorithms for All Pairs Similarity Search
Distributed Network Traffic Feature Extraction for a Real-time IDS
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Density-based Hybrid Clustering
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Sameh Shohdy, Yu Su, and Gagan Agrawal
Carlos Ordonez, Predrag T. Tosic
Intent-Aware Semantic Query Annotation
Enriching Taxonomies With Functional Domain Knowledge
Date: 2012/11/15 Author: Jin Young Kim, Kevyn Collins-Thompson,
Dynamic improved pixel value ordering reversible data hiding
Connecting the Dots Between News Article
Preference Based Evaluation Measures for Novelty and Diversity
Presentation transcript:

Load Balancing for Partition-based Similarity Search Date : 2014/09/01 Author : Xun Tang, Maha Alabduljalil, Xin Jin, Tao Yang Source : SIGIR’14 Advisor : Jia-ling Koh Speaker : Sheng-Chih Chu

Outline  Introduction  All Pair Similarity Search  Partition-based Similarity Search  Load Assignment Problem  Load Banlancing Optimization  Two-stage algorithm  Data Partition Optimization  Experiments  Conclusion 2

Introduction 3  All Pair Similarity Search(APSS),which identifies similar objects in a dataset.  The complexity of nave APSS can be quadratic to the dataset size. Document s (n item) 1.Compare with pair(n*n) 2.If Sim(di,dj) >= threshold Output pair

Partition-based Similarity Search 4  Divides the dataset into a set of partitions.  Assigns a partition to each task and each task compares this partition with other potentially similar partitions. Document s vector (n item)

Partition With Dissimilarity Detection 5

6

Outline  Introduction  All Pair Similarity Search  Partition-based Similarity Search  Load Assignment Problem  Load Banlancing Optimization  Two-stage algorithm  Data Partition Optimization  Experiments  Conclusion 7

Load Banlancing Optimization 8

Stage1 - Initial Load Assignment 9 G1G2G3G4G5G6G7 P _ P28______ P337 ____ P __ P P61816_____ P784 66___

Stage-2 Assugnment Refinement 10 Step : 1.select P4 (max cost) 2.P4’s incoming neighbors is P1 and P5(selected min cost) 3.Reverse dirction of selected edge. 4.Check P5’s cost(67.1) > P4’s original cost(81.6) true : reject its edge false : continue step1

Data Partition Optimization 11  Dissimilar Detection with Holder’s Inequality.  Step : 1.Sort all vector based on 1-norm and divide its into l layers 2. Subdivide each layer.ex :L 1,1,L 2,1,L 2,2,……L i,j.

12 Ex: d 1,d 2,d 3,d 4 ……d 9 and L1:{d 1,d 2,d 3 } L2:{d 4,d 5,d 6 } L3:{d 7,d 8,d 9 }

Outline  Introduction  All Pair Similarity Search  Partition-based Similarity Search  Load Assignment Problem  Load Banlancing Optimization  Two-stage algorithm  Data Partition Optimization  Experiments  Conclusion 13

DataSet 14  Twitter dataset : 100 million tweets (20 million real data and 80 synyhetic data), feature 18.5 per tweet  ClubWeb : 40 million web pages (ramdomly selected 40M) feature is 320 per web pages.  Yahoo!Music : songs,feature 404.5

Scalability and Comparisons 15  Speedup = sequential time / parallel time  Efficiency = speedup / the number of core

Effectiveness of Two-Stage Load Balance 16

Improved Data Partitioning 17

18

Outline  Introduction  All Pair Similarity Search  Partition-based Similarity Search  Load Assignment Problem  Load Banlancing Optimization  Two-stage algorithm  Data Partition Optimization  Experiments  Conclusion 19

Conclusion 20  The main contribution of this paper is a two-stage load balancing algorithm for eciently executing partition-based 200 similarity search in parallel.  Presents an improved and hierarchical static data partitioning method to detect dissimilarity and even out the partitions sizes.