Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.

Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010

May-20-10CS572-Summer2010CAM-2 Outline What is Deduplication? Importance Challenges Approaches

May-20-10CS572-Summer2010CAM-3 What are web duplicates? The same page, referenced by different URLs http://espn.go.comhttp://www.espn.com What are the differences? URL host (virtual hosts), sometimes protocol, sometimes page name, etc.

May-20-10CS572-Summer2010CAM-4 What are web duplicates? Near identical page, referenced by the same URLs Google search for “search engines” What are the differences? Page is within some delta % similar to the other (where delta is a large number), but may differ in e.g., adds, counters, timestamps, etc.

May-20-10CS572-Summer2010CAM-5 Why is it important to consider duplicates? In search engines, URLs tell the crawlers where to go and how to navigate the information space Ideally, given the web’s scale and complexity, we’ll give priority to crawl content that we haven’t already stored or seen before –Saves resources (on the crawler end, as well as the remote host) –Increases crawler politeness –Reduces the analysis that we’ll have to do later

May-20-10CS572-Summer2010CAM-6 Why is it important to consider duplicates? Identification of website mirrors (or copies of content) used to spread the load and bandwidth consumption –Sourceforge.net, CPAN, Apache, etc. If you identify a mirror, you can omit crawling many web pages and save crawler resources

May-20-10CS572-Summer2010CAM-7 “More Like This” Finding similar content to what you were looking for –As we discussed during the lecture on the search engine architecture, much of the time in search engines is spent filtering through the results. Presenting similar documents can cut down on that filtering time

May-20-10CS572-Summer2010CAM-8 XML XML documents, structurally appear very similar –What’s the difference between RSS and RDF and OWL and XSL and XSLT and any number of XML documents out there? With the ability to identify similarity and reduce duplication of XML, we could identify XML documents with similar structure –RSS feeds that contain the same links –Differentiate RSS (crawl more often) from other less frequently updated XML

May-20-10CS572-Summer2010CAM-9 Detect Plagiarism Determine web sites and reports that plagiarize one another Important for copyright laws and enforcement Determine similarity between source code Licensing issues –Open Source, other.

May-20-10CS572-Summer2010CAM-10 Detection of SPAM Identifying malicious SPAM content –Adult sites –Pharmaceutical drug and prescription drug SPAM –Malware and phishing scams Need to ignore this content from a crawling perspective –Or to “flag” it and not include it in (general) search results

May-20-10CS572-Summer2010CAM-11 Challenges Scalability –Most approaches to detecting duplicates rely on training and analytical approaches that may be computationally expensive Challenge is to perform the evaluation at low cost What to do with the duplicates? –The answer isn’t always throw them out – they may be useful for study –The content may require indexing for later comparison in legal issues, or for “snapshot”ing the web at the time i.e., the Internet Archive

May-20-10CS572-Summer2010CAM-12 Challenges Structure versus Semantics –Documents that are structurally dissimilar may content the exact same content Think the use of tags to emphasize versus tags in HTML Need to take this into account Online versus offline –Depends on crawling strategy, but offline typically can provide more precision at the cost of inability to dynamically react

May-20-10CS572-Summer2010CAM-13 Approaches for Deduplication SIMHASH and Hamming Distance –Treat web documents as a set of features, constituting an n dimension vector – transform this vector into an f-bit fingerprint of a small size, e.g., 64 –Compare fingerprints and look for difference in at most k bits –Manku et al., WWW 2007 Syntactic similarity –Shingling Treat web documents as continuous subsequence of words Compute w-shingling Border et al., WWW 1997

May-20-10CS572-Summer2010CAM-14 Approaches for Deduplication Link structure similarity –Identify similar in the linkages between web collections –Choo et al.

May-20-10CS572-Summer2010CAM-15 Approaches for Deduplication Exploiting the structure and links between physical network hosts –Look at: Language Geographical connection Continuations and proxies –Zipifan function –Bharat et al., ICDM 2001

May-20-10CS572-Summer2010CAM-16 Wrapup Need Deduplication for conserving resources and ensuring quality and accuracy of resultant search indices –Can assist in other areas like plagiarism, SPAM detection, fraud detection, etc. Deduplication at web scale is difficult, need efficient means to perform this computation online or offline Techniques look at page structure/content, page link structure content, or physical web node structure

Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.

Similar presentations

Presentation on theme: "Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.

Similar presentations

Presentation on theme: "Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010."— Presentation transcript:

Similar presentations

About project

Feedback