Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California.

Slides:

Advertisements

Similar presentations

For more information please send to or EFFICIENT QUERY SUBSCRIPTION PROCESSING.

Advertisements

Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.

Book Recommender System Guided By: Prof. Ellis Horowitz Kaijian Xu Group 3 Ameet Nanda Bhaskar Upadhyay Bhavana Parekh.

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara.

1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.

Evaluating Search Engine

Distributed Computations

2/25/2004 The Google Cluster Architecture February 25, 2004.

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.

A machine learning approach to improve precision for navigational queries in a Web information retrieval system Reiner Kraft

Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.

Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.

Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

A Grid-enabled Branch and Bound Algorithm for Solving Challenging Combinatorial Optimization Problems Authors: M. Mezmaz, N. Melab and E-G. Talbi Presented.

Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle.

A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.

University of Kansas Data Discovery on the Information Highway Susan Gauch University of Kansas.

FLANN Fast Library for Approximate Nearest Neighbors

Multi-level Selective Deduplication for VM Snapshots in Cloud Storage Wei Zhang*, Hong Tang †, Hao Jiang †, Tao Yang*, Xiaogang Li †, Yue Zeng † * University.

Cache-Conscious Runtime Optimization for Ranking Ensembles Xun Tang, Xin Jin, Tao Yang Department of Computer Science University of California at Santa.

Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Tag-based Social Interest Discovery

Load Balancing for Partition-based Similarity Search Date : 2014/09/01 Author : Xun Tang, Maha Alabduljalil, Xin Jin, Tao Yang Source : SIGIR’14 Advisor.

Load Balancing for Partition-based Similarity Search Xun Tang, Maha Alabduljalil, Xin Jin, Tao Yang Department of Computer Science University of California.

FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.

Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.

Data Structures & Algorithms and The Internet: A different way of thinking.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.

Pairwise Document Similarity in Large Collections with MapReduce Tamer Elsayed, Jimmy Lin, and Douglas W. Oard Association for Computational Linguistics,

Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa.

1 Lecture 11: Bloom Filters, Final Review December 7, 2011 Dan Suciu -- CSEP544 Fall 2011.

Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.

Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.

PMLAB Finding Similar Image Quickly Using Object Shapes Heng Tao Shen Dept. of Computer Science National University of Singapore Presented by Chin-Yi Tsai.

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.

Load Balancing for Partition-based Similarity Search Xun Tang, Maha Alabduljalil, Xin Jin, Tao Yang Department of Computer Science University of California.

Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

The Simigle Image Search Engine Wei Dong

Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.

Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

Social Tag Prediction Paul Heymann, Daniel Ramage, and Hector Garcia- Molina Stanford University SIGIR 2008.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

ASSOCIATIVE BROWSING Evaluating 1 Jinyoung Kim / W. Bruce Croft / David Smith for Personal Information.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,

Document Clustering and Collection Selection Diego Puppin Web Mining,

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.

October 15-18, 2013 Charlotte, NC Accelerating Database Performance Using Compression Joseph D’Antoni, Solutions Architect Anexinet.

The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.

Efficient Multi-User Indexing for Secure Keyword Search

Optimizing Parallel Algorithms for All Pairs Similarity Search

Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.

Augmented Sketch: Faster and More Accurate Stream Processing

Department of Computer Science University of California, Santa Barbara

Searching Similar Segments over Textual Event Sequences

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

Clustering and Load Balancing Optimization for Redundant Content Removal Shanzhong Zhu (Ask.com) Alexandra Potapova, Maha Alabduljalil (Univ. of California at Santa Barbara) Xin Liu (Amazon.com) Tao Yang (Univ. of California at Santa Barbara)

Redundant Content Removal in Search Engines Over 1/3 of Web pages crawled are near duplicates When to remove near duplicates?  Offline removal  Online removal with query-based duplicate removal Online index matching & result ranking Duplicate removal User query Final results Offline data processing Duplicate filtering Web Pages Online index

Tradeoff of online vs. offline removal Online-dominating approach Offline-dominating approach Impact to offlineHigh precision Low recall Remove fewer duplicates High precision High recall Remove most of duplicates Higher offline burden Impact to onlineMore burden to online deduplication Less burden to online deduplication Impact to overall cost Higher serving costLower serving cost

Challenges &issues in offline duplicate handling Achieve high-recall with high precision  All-to-all duplicate comparison for complex/deep pairwise analysis  Expensive  parallelism management & unnecessary computation elimination Maintain duplicate groups instead of duplicate pairs  Reduce storage requirement.  Aid winner selection for duplicate removal  Continuous group update is expensive.  Approximation.  Error handling

Optimization for faster offline duplicate handling Incremental duplicate clustering and group management  Approximated transitive relationship  Lazy update Avoid unnecessary computation while balancing computation among machines  Multi-dimensional partitioning  Faster many-to-all duplicate comparisons Page partition … Page partition Page partition Page partition

Two-tier Architecture for Incremental Duplicate Detection

Approximation in Incremental Duplicate Group Management Example of incremental group merging/splitting Approximation  Group is unchanged when updated pages are still similar to group signatures  Group splitting does not re-validate all relations Error of transitive relation after content update  A B, B C  A C  A C may not be true if content B is updated. Error prevention during duplicate filtering:  double check similarity threshold between a winner and a loser

Multi-dimensional page partitioning Objective  One page is mapped to one unique partition  Dissimilar pages are mapped to different partitions.  Reduce unnecessary cross-partition comparisons. Partitioning based on document length  Outperform signature-based mapping for higher recall rates. Multi-dimensional mapping  Improve load imbalance caused by skewed length distribution Pages …

Multi-dimensional page partitioning Dictionary Sub-dictionary A=(600) A=(280,320) 1D length space 2D length space

When does Page A compare with B? Page length vector A= (A 1, A 2 ), B=(B 1,B 2 ) Page A needs to be compared with B only if τ is the similarity threshold ρ is a fixed interval enlarging factor

Implementation and Evaluations Implemented in Ask.com offline platform with C++ for processing billions of documents Impact on relevancy  Continuously monitor top query results.  Error rate of false removal is tiny. Impact on cost.  Compare two approaches –A: Online dominating.  Offline removes 5% duplicates first.  Most of duplicates hosted in online tier-2 machines –B: Offline dominating.

Cost Saving with Offline Dominating Approach Fixed QPS target. Two-tier online index for 3-8 billion URLs. 8%-26% cost saving with offline dominating  Less tier-2 machines due to less duplicates hosted.  Online tier 1 machines can answer more queries  Online messages communicated contain less duplicates

Reduction of unnecessary inter-machine communiation & comparison Up to 87% saving when using up to 64 machines

Effectiveness of 3D mapping Load balance factor with upto 64 machines Speedup of processing throughput

Benefits of incremental computation Ratio of non-incremental duplicate detection time over incremental one for a 100 million dataset. Upto 24-fold speedup. During a crawling update, 30% of updated pages have signatures similar to group signatures

Accuracy of distributed clustering and duplicate group management Relative error in precision compared to a single- machine configuration Relative error in recall

Conclusion remarks Budget-conscious solution with offline dominating redundant removal  Up to 26% cost saving. Approximated incremental scheme for duplicate clustering with error handling  Upto 24-fold speedup  Undetected duplicates are handled online. 3D mapping still reduces unnecessary comparisons (upto 87%) while balancing load (3+ fold improvement)