Do Not Crawl In The DUST: Different URLs Similar Text Uri Schonfeld Department of Electrical Engineering Technion Joint Work with Dr. Ziv Bar Yossef and.

Slides:



Advertisements
Similar presentations
Chapter 6 Server-side Programming: Java Servlets
Advertisements

Google News Personalization: Scalable Online Collaborative Filtering
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Counting the bits Analysis of Algorithms Will it run on a larger problem? When will it fail?
Fast Algorithms For Hierarchical Range Histogram Constructions
Near-Duplicates Detection
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
TrustRank Algorithm Srđan Luković 2010/3482
AUTOMATED DISCOVERY OF PARAMETER POLLUTION VULNERABILITIES IN WEB APPLICATIONS Marco Balduzzi, Carmen Torrano Gimenez, Davide Balzarotti, and Engin Kirda,
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Evaluating Search Engine
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Near Duplicate Detection
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Efficient Search Engine Measurements Maxim Gurevich Technion Ziv Bar-Yossef Technion and Google.
Sublinear time algorithms Ronitt Rubinfeld Computer Science and Artificial Intelligence Laboratory (CSAIL) Electrical Engineering and Computer Science.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
Access Path Selection in a Relational Database Management System Selinger et al.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
Crawling The Web For a Search Engine Or Why Crawling is Cool.
CS 206 Introduction to Computer Science II 09 / 10 / 2009 Instructor: Michael Eckmann.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
CHAPTER 9 Testing a Claim
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
Gold – Crystal Reports Introductory Course Cortex User Group Meeting New Orleans – 2011.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Cross Language Clone Analysis Team 2 February 3, 2011.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Created by Branden Maglio and Flynn Castellanos Team BFMMA.
Chapter 9 Day 2 Tests About a Population Proportion.
DUST Different URLs with Similar Text DUST Different URLs with Similar Text Do Not Crawl in the DUST: Different URLs with Similar Text : ZIV BARYOSSEF.
Introduction to JavaScript MIS 3502, Spring 2016 Jeremy Shafer Department of MIS Fox School of Business Temple University 2/2/2016.
CPS 100e 5.1 Inheritance and Interfaces l Inheritance models an "is-a" relationship  A dog is a mammal, an ArrayList is a List, a square is a shape, …
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Programming in Java (COP 2250) Lecture 12 & 13 Chengyong Yang Fall, 2005.
EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad.
Search Engine Optimization
Tests About a Population Proportion
CHAPTER 9 Testing a Claim
Near Duplicate Detection
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.
CHAPTER 9 Testing a Claim
Finding replicated web collections
CS222: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
CS222P: Principles of Data Management Notes #8 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
CHAPTER 9 Testing a Claim
CHAPTER 9 Testing a Claim
CHAPTER 9 Testing a Claim
Hash Functions for Network Applications (II)
CHAPTER 9 Testing a Claim
CHAPTER 9 Testing a Claim
CS203 Lecture 15.
Information Retrieval and Web Design
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #07 Static Hashing, Extendible Hashing, Linear Hashing Instructor: Chen Li.
Presentation transcript:

Do Not Crawl In The DUST: Different URLs Similar Text Uri Schonfeld Department of Electrical Engineering Technion Joint Work with Dr. Ziv Bar Yossef and Dr. Idit Keidar

Problem statement and motivation Related work Our contribution The DustBuster algorithm Experimental results Concluding remarks Talk Outline

DUST – Different URLs Similar Text Examples: Standard Canonization: “ ”  “ ” Domain names and virtual hosts “ ”  “ ” Aliases and symbolic links: “ ”  “ ” Parameters with little affect on content Print=1 URL transformations: “ ”  “ ” Even the WWW Gets Dusty

Dust rule: Transforms one URL to another Example: “ index.html ”  “” Valid DUST rule: r is a valid DUST rule w.r.t. site S if for every URL u  S, r(u) is a valid URL r(u) and u have “ similar ” contents Why similar and not identical? Comments, news, text ads, counters DUST Rules!

Expensive to crawl Access the same document via multiple URLs Forces us to shingle An expensive technique used to discover similar documents Ranking algorithms suffer References to a document split among its aliases Multiple identical results The same document is returned several times in the search results Any algorithm based on URLs suffers DUST is Bad

Given: a list of URLs from a site S Crawl log Web server log Want: to find valid DUST rules w.r.t. S As many as possible Including site-specific ones Minimize number of fetches Applications: Site-specific canonization More efficient crawling We Want To

Domain name aliases Standard extensions Default file names: index.html, default.htm File path canonizations: “ dirname/../ ”  “”, “ // ”  “ / ” Escape sequences: “ %7E ”  “ ~ ” How do we Fight DUST Today? (1) Standard Canonization

Site-specific DUST: “ story_ ”  “ story?id= “ “ news.google.com ”  “ google.com/news ” “ labs ”  “ laboratories ” This DUST is harder to find Standard Canonization is not Enough

Shingles are document sketches [Broder,Glassman,Manasse 97] Used to compare documents for similarity Pr(Shingles are equal) = Document similarity Compare documents by comparing shingles Calculate Shingle: Take all m word sequences Hash them with h i Choose the min That's your shingle How do we Fight DUST Today? (2) Shingles

Shingles expensive: Require fetch Parsing Hash Shingles do not find rules Therefore, not applicable to new pages Shingles are Not Perfect

Mirror detection [Bharat,Broder 99], [Bharat,Broder,Dean,Henzinger 00], [Cho,Shivakumar,Garcia-Molina 00], [Liang 01] Identifying plagiarized documents [Hoad,Zobel 03] Finding near-replicas [Shivakumar,Garcia-Molina 98], [Di Iorio,Diligenti,Gori,Maggini,Pucci 03] Copy detection [Brin,Davis,Garcia-Molina 95], [Garcia- Molina,Gravano,Shivakumar 96], [Shivakumar,Garcia- Molina 96] More Related Work

An algorithm that finds site-specific valid DUST rules requires minimum number of fetches Convincing results in experiments Benefits to crawling Our contributions

Alias DUST: simple substring substitutions “ story_1259 ”  “ story?id=1259 ” “ news.google.com ”  “ google.com/news ” “ /index.html ”  “” Parameter DUST: Standard URL structure: protocol://domain.name/path/name?para=val&pa=va Some parameters do not affect content: Can be removed Can changed to a default value Types of DUST

Input: URL list Detect likely DUST rules Eliminate redundant rules Validate DUST rules using samples: Eliminate DUST rules that are “ wrong ” Further eliminate duplicate DUST rules No Fetch Required Our Basic Framework

Large support principle: Likely DUST rules have lots of “ evidence ” supporting them Small buckets principle: Ignore evidence that supports many different rules How to detect likely DUST rules?

Large Support Principle A pair of URLs (u,v) is an instance of rule r, if: r(u) = v Support(r) = all instances (u,v) of r Large Support Principle The support of a valid DUST rule is large

Rule Support: An Equivalent View  : a string Ex:  = “ story_ ” u: URL that contains  as a substring Ex: u = “ ” Envelope of  in u: A pair of strings (p,s) p: prefix of u preceding  s: suffix of u succeeding  Example: p = “ ”, s = “ 2659 ” E( α): all envelopes of  in URLs that appear in input URL list

Envelopes Example

Rule Support: An Equivalent View    : an alias DUST rule Ex:  = “ story_ ”,  = “ story?id= “ Lemma: |Support(    )| = | E(  ) ∩ E(  )| Proof: bucket(p,s) = {  | (p,s)  E(  ) } Observation: (u,v) is an instance of    if and only if u = p  s and v = p  s for some (p,s) Hence, (u,v) is an instance of    iff (p,s)  E(  ) ∩ E(  )

Large Buckets Often there is a large set of substrings that are interchangeable within a given URL while not being DUST: page=1,page=2, … lecture-1.pdf, lecture-2.pdf This gives rise to large buckets:

Big Buckets: popular prefix suffix Often do not contain similar content Big buckets are expensive to process I am a DUCK not a DUST Small Bucket Principle Small Buckets Principle Most of the support of valid Alias DUST rules is likely to belong to small buckets

Scan Log and form buckets Ignore big buckets For each small Bucket: For every two substrings α, β in the bucket print (α, β) Sort by (α, β) For every pair (α, β): Count If (Count > threshold) print α  β Algorithm – Detecting Likely DUST Rules No Fetch here!

Size and Comments Consider only instances of rules whose size “ matches ” Use ranges of sizes Running time O(Llog(L)) Process only short substrings Tokenize URLs

Input: URL list Detect likely DUST rules Eliminate redundant rules Validate DUST rules using samples: Eliminate DUST rules that are “ wrong ” Further eliminate duplicate DUST rules No Fetch Required Our Basic Framework

Eliminating Redundant Rules “/vlsi / ”  “/labs/vlsi/” “/vlsi”  “/labs/vlsi” Lemma: A substitution rule α’  β’ refines rule α  β if and only if there exists an envelope (γ,δ) such that α’ = γ◦α◦δ and β’=γ◦β ◦ δ Lemma helps us identify refinements easily φ refines ψ ? remove ψ if supports match Rule φ refines rule ψ if SUPPORT(φ)  SUPPORT(ψ ) No Fetch here!

Validating Likely Rules For each likely rule r, for both directions Find sample URLs from list to which r is applicable For each URL u in the sample: v = r(u) Fetch u and v Check if content(u) is similar to content(v) if fraction of similar pairs > threshold: Declare rule r valid

Assumption: if validation beyond threshold in 100 it will be the same for any validation above Why isn ’ t threshold 100%? A 95% valid rule may still be worth it Dynamic pages change often Comments About Validation

We experiment on logs of two web sites: Dynamic Forum Academic Site Detected from a log of about 20,000 unique URLs On each site we used four logs from different time periods Experimental Setup

Precision at k

Precision vs. Validation

How many of the DUST do we find? What other duplicates are there: Soft errors True copies: Last semesters course All authors of paper Frames Image galleries Recall

In a crawl examined 18% of the crawl was reduced. DUST Distribution 47.1 DUST 25.7% Images 7.6% Soft Errors 17.9% Exact Copy 1.8% misc

DustBuster is an efficient algorithm Finds DUST rules Can reduce a crawl Can benefit ranking algorithms Conclusions

THE END

= => --> all rules with “” Fix drawing urls crossing alpha not all p and all s Things to fix

So far, non-directional Prefer shrinking rules Prefer lexicographically lowering rules Check those directions first

Parameter name and possible values What rules: Remove parameter Substitute one value with another Substitute all values with a single value Rules are validated the same way the alias rules are Will not discuss further Parametric DUST

Unfortunately we see a lot of “ wrong ” rules Substitute 1 with 2 Just wrong: One domain name with another with similar software False rules examples: /YoninaEldar/ != /DavidMalah/ /labs/vlsi/oldsite != /labs/vlsi -2. != -3. False Rules

Filtering out False Rulese Getting rid of the big buckets Using the size field: False dust rules: May give valid URLs Content is not similar Size is probably different Size ranges used Tokenization helps

DustBuster – cleaning up the rules Go over list with a window If Rule a refines rule b Their support size is close Leave only rule a

DustBuster – Validation Validation per rule Get sample URLs URLs that the rule can be applied Apply URL => applied URL Get content Compare using shingles

DustBuster - Validation Stop fetching when: #failures > 100 * (1-threshold) Page that doesn't exist is not similar to anything else Why use threshold < 100%? Shingles not perfect Dynamic pages may change a lot fast

Detect Alias DUST – take 2 Tokenize of course Form buckets Ignore big buckets Count support only if size matches Don't count Long substrings Results are cleaner

Eliminate Redundancies 1: EliminateRedundancies(pairs_list R) 2: for i = 1 to |R| do 3: if (already eliminated R[i]) continue 4: to_eliminate_current := false /* Go over a window */ 5: for j = 1 to min(MW, |R| - i) do /* Support not close? Stop checking */ 6: if (R[i].size - R[i+j].size > max(MRD*R[i].size, MAD)) break /* a refines b? remove b */ 7: if (R[i] refines R[i+j]) 8: eliminate R[i+j] 9: else if (R[i+j] refines R[i]) then 10: to_eliminate_current := true 11: break 12: if (to_eliminate_current) 13: eliminate R[i] 14: return R No Fetch here!

Validate a Single Rule 1:ValidateRule(R, L) 2: positive := 0 3: negative := 0 /* Stop When You Are sure you either succeeded or failed */ 4: while (positive < (1 - ε) N AND (negative < εN) do 5: u := a random URL from L to which R is applicable 6: v := outcome of application of R to u 7: fetch u and v 8: if (fetch u failed) continue /* Something went wrong, negative sample */ 9: if (fetch v failed) OR (shingling(u)  shingling(v)) 10 negative := negative + 1 /* Another positive sample */ 11: else 12: positive := positive : if (negative  ε N ) 14: retrun FALSE 15: return TRUE

Validate Rules 1:Validate(rules_list R, test_log L) 2 create list of rules LR 3: for i = 1 to |R| do /* Go over rules that survived = valid rules */ 4: for j = 1 to i - 1 do 5: if (R[j] was not eliminated AND R[i] refines R[j]) 6: eliminate R[i] from the list 7: break 8: if (R[i] was eliminated) 9: continue /* Test one direction */ 10: if (ValidateRule(R[i].alpha  R[i].beta, L)) 11: add R[i].alpha  R[i].beta to LR /* Test other direction only if first direction failed*/ 12: else if (ValidateRule(R[i].beta  R[i].alpha, L)) 13: add R[i].alpha  R[i].beta to LR 14: else 15: eliminate R[i] from the list 16: return LR