May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.

Slides:



Advertisements
Similar presentations
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Published in May 2007 Presented by : Shruthi Venkateswaran.
Advertisements

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
VisualRank: Applying PageRank to Large-Scale Image Search Yushi Jing, Member, IEEE, and Shumeet Baluja, Member, IEEE.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
1 Presented By Avinash Gutte Under The Guidance of Mrs. Hemangi Kulkarni Department of Computer Engineering Pimpri-Chinchwad College of Engineering, Pune.
DSPIN: Detecting Automatically Spun Content on the Web Qing Zhang, David Y. Wang, Geoffrey M. Voelker University of California, San Diego 1.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Modern Information Retrieval
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
Dg.o conference 2006 Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
Overview of Search Engines
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Databases & Data Warehouses Chapter 3 Database Processing.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Tag-based Social Interest Discovery
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
A Web Crawler Design for Data Mining
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Accessing the Deep Web Bin He IBM Almaden Research Center in San Jose, CA Mitesh Patel Microsoft Corporation Zhen Zhang computer science at the University.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Cloak and Dagger: Dynamics of Web Search Cloaking David Y. Wang, Stefan Savage, and Geoffrey M. Voelker University of California, San Diego 左昌國 Seminar.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
Compact Data Structures and Applications Gil Einziger and Roy Friedman Technion, Haifa.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Improving Content Addressable Storage For Databases Conference on Reliable Awesome Projects (no acronyms please) Advanced Operating Systems (CS736) Brandon.
Intent Subtopic Mining for Web Search Diversification Aymeric Damien, Min Zhang, Yiqun Liu, Shaoping Ma State Key Laboratory of Intelligent Technology.
Brief (non-technical) history Full-text index search engines Altavista, Excite, Infoseek, Inktomi, ca Taxonomies populated with web page Yahoo.
Improving Cloaking Detection Using Search Query Popularity and Monetizability Kumar Chellapilla and David M Chickering Live Labs, Microsoft.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
The Simigle Image Search Engine Wei Dong
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
BNCOD07Indexing & Searching XML Documents based on Content and Structure Synopses1 Indexing and Searching XML Documents based on Content and Structure.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
A Low-bandwidth Network File System Athicha Muthitacharoen et al. Presented by Matt Miller September 12, 2002.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
Bloom Cookies: Web Search Personalization without User Tracking Authors: Nitesh Mor, Oriana Riva, Suman Nath, and John Kubiatowicz Presented by Ben Summers.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
2014 Semantic-based Code and Documentation Search Engine Reshma Thumma Oct 10,2014 #GHC
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presenter: Siyuan Hua.
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Efficient Multi-User Indexing for Secure Keyword Search
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
IST 516 Fall 2011 Dongwon Lee, Ph.D.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
A Comparative Study of Link Analysis Algorithms
Detecting Phrase-Level Duplication on the World Wide Web
Multimedia Information Retrieval
Mining Anchor Text for Query Refinement
Information Retrieval and Web Design
Presentation transcript:

May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at Austin Renu Tewari IBM Almaden Research Center

May 30, 2016Department of Computer Sciences, UT Austin2 Motivation  Google, Yahoo, MSN: Significant fraction of near-duplicates in top search results  Google “emacs manual” query  7 of 20 results redundant 3 identical pairs 4 similar to one document  Similar results for Yahoo, MSN, A9 search engines chapter/emacs_toc.html  29.2% of data common across 150 million pages (Fetterly’03, Broder’97)

May 30, 2016Department of Computer Sciences, UT Austin3 Problem Statement  Goal: Filter near-duplicates in web search results  Given a query search results, identify pages that are either Highly similar in content (and link structure) Contained in another page (Inclusions with small changes)  Key Constraints  Low Space Overhead: Use only a small amount of information per document  Low Time Overhead: (latency unnoticeable to end-user) Perform fast comparison and matching of documents

May 30, 2016Department of Computer Sciences, UT Austin4 Our Contributions  A novel similarity detection technique using content-defined chunking and Bloom filters to refine web search results  Satisfies key requirements  Compact Representation Incurs only about 0.4% extra bytes per document  Quick Matching 66 ms for top-80 search results Document similarity using bit-wise AND of their feature representations  Easy Deployment Attached as a filter over any search engine’s result set

May 30, 2016Department of Computer Sciences, UT Austin5 Talk Outline  Motivation  Our approach  System Overview  Bloom Filters for Similarity Testing  Experimental Evaluation  Related Work and Conclusions

May 30, 2016Department of Computer Sciences, UT Austin6 System Overview  Applying similarity detection to search engines  Crawl time: The web crawler Step 1: fetches a page and indexes it Step 2: computes and stores per-page features  Search time: The search-engine (or end user’s browser) Step 1: Retrieve the top results’ meta-data for a given query Step 2: Similarity Testing to filter highly similar results > Similarity threshold Documents C Similarity Testing Features Fast approximate comparison Feature-set: small space, low complexity

May 30, 2016Department of Computer Sciences, UT Austin7 Feature Extraction and Similarity Testing (1) chunk '3456 Markers Data inserted Original Document Modified Document > Similarity threshold Documents C Fast approximate comparison Feature-set: small space, low complexity  Divide a file into variable-sized blocks (called chunks)  Use Rabin fingerprint to compute block boundaries  SHA-1 hash of each chunk as its feature representation Content-defined Chunking

May 30, 2016Department of Computer Sciences, UT Austin8 Feature Extraction and Similarity Testing (2) Markers Document > Similarity threshold Documents C Fast approximate comparison Feature-set: small space, low complexity  A Bloom filter is an approximate set representation  An array of m bits (initially 0)  k independent hash functions  Supports Insert (x,S) Member (y,S) Content-defined Chunking Bloom filter generation SHA y … Insert(y,S) X … Insert(x,S)

May 30, 2016Department of Computer Sciences, UT Austin9 Feature Extraction and Similarity Testing (3) Markers Document > Similarity threshold Documents C Fast approximate comparison Feature-set: small space, low complexity Content-defined Chunking Bloom filter generation X … SHA % of A’s set bits matched Bit-wise AND A B A /\ B Similarity Testing

May 30, 2016Department of Computer Sciences, UT Austin10 Proof-of-concept examples: Differentiate between multiple similar documents  IBM site ( Dataset  20 MB (590 documents)  /investor/corpgoverance/index.phtml compared with all pages  Similar pages (same base URL) cgcoi.phtml (53%) cgblaws.phtml (69%)  CVS Repository Dataset  Technical doc. file (17 KB)  Extracted 20 consecutive versions from the CVS foo  original document foo.1  first modified version foo.19  last version

May 30, 2016Department of Computer Sciences, UT Austin11 Talk Outline  Motivation  Our approach  System Overview  Bloom Filters for Similarity Testing  Experimental Evaluation  Related Work and Conclusions

May 30, 2016Department of Computer Sciences, UT Austin12 Evaluation (1): Effect of degree of similarity  “emacs manual” query on Google  493 results retrieved using GoogleAPI  Fraction of duplicates 88% (50% similarity), 31% (90% similarity)  Larger Aliasing of higher ranked documents  Initial result set repeated more frequently in later results  Similar results observed for other queries Percentage of Duplicate Documents Number of Top Search Results Retrieved % similar 60% similar 80% similar 50% similar 90% similar

May 30, 2016Department of Computer Sciences, UT Austin13 Evaluation (2): Effect of Search Query Popularity  Queries: Google Zeitgeist (Nov. 2004)  High correlation between occurrence of near-duplicates and search query popularity  # near-duplicates increase with query popularity  Most-popular:24%-44%; Medium-popular:16%-28%; Random:18% (a) Most-popular (b) Medium-popular (c) Random "republican national convention” ” "national hurricane center" "indian larry" "jon stewart crossfire" "electoral college" "day of the dead" "Olympics 2004 doping" "hawking black hole bet" "x prize spaceship" Percentage of Duplicate Documents Number of Top Search Results Retrieved

May 30, 2016Department of Computer Sciences, UT Austin14 Evaluation (3): Analyzing Response Times  Top-80 search results for “emacs manual query”  Offline Computation time (pre-computed and stored) CDC chunks 80 * 0.3 ms Bloom filters generation 80 * 14 ms  Online Matching Time Bit-wise AND of two Bloom filters (4  s) Matching and Clustering time 66 ms + Total (offline + online) 1210 ms Online Time 66 ms

May 30, 2016Department of Computer Sciences, UT Austin15 Selected Related Work  Most prior work based on shingling (many variants)  Basic idea: (Broder’97) Divide document into k-shingles: all k consecutive words/tokens Represent document by shingle-set Shingle-sets intersection large  near-duplicate documents Reduce similarity detection problem to set intersection  Differences with our technique: Document similarity based on feature set intersection Higher feature-set computational overhead Feature set size dependent on sampling (Min s, Mod m, etc.)

May 30, 2016Department of Computer Sciences, UT Austin16 Conclusions  Problem: Highly similar matches in search results  Popular Search engines (Google, Yahoo, MSN) Significant fraction of near-duplicates in top results Adversely affects query search performance  Our Solution: A similarity detection technique using CDC and Bloom filters  Incurs small meta-data overhead 0.4% bytes per document  Performs fast similarity detection Bit-wise AND operations; order of ms  Easily deployed as a filter over any search engine’s results

May 30, 2016Department of Computer Sciences, UT Austin17 For more information: