1 Finding Replicated Web Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina.

Slides:



Advertisements
Similar presentations
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Published in May 2007 Presented by : Shruthi Venkateswaran.
Advertisements

Shuai Ma, Yang Cao, Wenfei Fan, Jinpeng Huai, Tianyu Wo Capturing Topology in Graph Pattern Matching University of Edinburgh.
Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.
Web as Network: A Case Study Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
Safeguarding and Charging for Information on the Internet Hector Garcia-Molina, Steven P. Ketchpel, Narayanan Shivakumar Stanford University Presented.
Web Search – Summer Term 2006 VI. Web Search - Indexing (c) Wolfgang Hürst, Albert-Ludwigs-University.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
Identifying Translations Philip Resnik, Noah Smith University of Maryland.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Web as Graph – Empirical Studies The Structure and Dynamics of Networks.
1 Collaborative Filtering and Pagerank in a Network Qiang Yang HKUST Thanks: Sonny Chee.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Web queries classification Nguyen Viet Bang WING group meeting June 9 th 2006.
1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.
A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)
Web Search – Summer Term 2006 V. Web Search - Page Repository (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Intelligent Crawling Junghoo Cho Hector Garcia-Molina Stanford InfoLab.
Overview of Search Engines
CS246 Search Engine Scale. Junghoo "John" Cho (UCLA Computer Science) 2 High-Level Architecture  Major modules for a search engine? 1. Crawler  Page.
Hands-On Microsoft Windows Server 2008 Chapter 8 Managing Windows Server 2008 Network Services.
Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.
How Search Engines Work. Any ideas? Building an index Dan taylor Flickr Creative Commons.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
Web Spam Detection with Anti- Trust Rank Vijay Krishnan Rashmi Raj Computer Science Department Stanford University.
Know your Neighbors: Web Spam Detection Using the Web Topology Presented By, SOUMO GORAI Carlos Castillo(1), Debora Donato(1), Aristides Gionis(1), Vanessa.
Efficient Identification of Overlapping Communities Jeffrey Baumes Mark Goldberg Malik Magdon-Ismail Rensselaer Polytechnic Institute, Troy, NY.
FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Week 3 LBSC 690 Information Technology Web Characterization Web Design.
COM1721: Freshman Honors Seminar A Random Walk Through Computing Lecture 2: Structure of the Web October 1, 2002.
1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol.
Adaptive On-Line Page Importance Computation Serge, Mihai, Gregory Presented By Liang Tian 7/13/2010 1Adaptive On-Line Page Importance Computation.
May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.
Parallel Crawlers Junghoo Cho (UCLA) Hector Garcia-Molina (Stanford) May 2002 Ke Gong 1.
Ch 14. Link Analysis Padmini Srinivasan Computer Science Department
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Parallel Crawlers Efficient URL Caching for World Wide Web Crawling Presenter Sawood Alam AND.
WebBase: Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh,
Setting up a search engine KS 2 Search: appreciate how results are selected.
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Web Page Clustering using Heuristic Search in the Web Graph IJCAI 07.
CS 440 Database Management Systems Web Data Management 1.
Mathematics of the Web Prof. Sara Billey University of Washington.
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
Gurmeet Singh Manku, Arvind Jain, Anish Das Sarma Presenter: Siyuan Hua.
Finding Replicated web collections
Cohesive Subgraph Computation over Large Graphs
DATA MINING Introductory and Advanced Topics Part III – Web Mining
Near Duplicate Detection
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
IDENTIFICATION OF DENSE SUBGRAPHS FROM MASSIVE SPARSE GRAPHS
Finding replicated web collections
Detecting Phrase-Level Duplication on the World Wide Web
CS246 Search Engine Scale.
Junghoo “John” Cho UCLA
CS246: Search-Engine Scale
CS246: Web Characteristics
CS639: Data Management for Data Science
Information Retrieval and Web Design
Presentation transcript:

1 Finding Replicated Web Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina

2 Replication is common!

3 Statistics (Preview) More than 48% of pages have copies!

4 Reasons for replication Actual replication zSimple copying or Mirroring Apparent replication zAliases (multiple site names) zSymbolic links zMultiple mount points

5 Challenges zSubgraph isomorphism: NP zHundreds of millions of pages zSlight differences between copies

6 Outline zDefinitions yWeb graph, collection yIdentical collection zSimilar collection zAlgorithm zApplications zResults

7 Web graph zNode: web page zEdge: link between pages zNode label: page content (excluding links)

8 Identical web collection zCollection: induced subgraph zIdentical collection: one-to-one (equi-size)

9 Collection similarity zCoincides with intuitively similar collections zComputable similarity measure

10 Collection similarity zPage content

11 Page content similarity zFingerprint-based approach (chunking) yShingles [Broders et al., 1997] ySentence [Brin et al., 1995] yWord [Shivakumar et al., 1995] zMany interesting issues yThreshold value yIceberg query

12 Collection similarity zLink structure

13 Collection similarity zSize

14 Collection similarity zSize vs. Cardinality

15 Growth strategy

16 Essential property Rb a a bbb a Ra |Ra| = Ls = Ld = |Rb| Ls: # of pages linked from Ld: # of pages linked to

17 Essential property a a bbb a Rb Ra |Ra|  Ls = Ld  |Rb| Ls: # of pages linked from Ld: # of pages linked to

18 Algorithm zBased on the property we identified zInput: set of pages collected from web zOutput: set of similar collections zComplexity: O(n log n)

19 Algorithm zStep 1: Similar page identification (iceberg query) 25 million pages Fingerprint computation: 44 hours Replicated page computation: 10 hours Step 1 web pages RidPid

20 Algorithm zStep 2: link structure check RidPid RidPid Pid Group by (R1.Rid, R2.Rid) Ra = |R1|, Ls = Count(R1.Rid), Ld = Count(R2.Rid), Rb = |R2| LinkR1R2 (Copy of R1)

21 Algorithm zStep 3: S = {} For every (|Ra|, Ls, Ld, |Rb|) in step 2 If (|Ra| = Ls = Ld = |Rb|) S = S U { } Union-Find(S) zStep 2-3: 10 hours

22 Experiment z25 widely replicated collections (cardinality: 5-10 copies, size: pages) => Total number of pages : 35, ,000 random pages zResult: 180 collections y149 “good” collections y31 “problem” collections

23 Results

24 Applications zWeb crawling & archiving ySave network bandwidth ySave disk storage

25 Application (web crawling) zBefore experiment: 48% zWith our technique: 13% initial crawl offline copy detection second crawl replication info crawled pages

26 Applications (web search)

27 Related work zCollection similarity yAltavista [Bharat et al., 1999] zPage similarity yCOPS [Brin et al., 1995]: sentence ySCAM [Shivakumar et al., 1995]: word yAltavista [Broder et al., 1997]: shingle

28 Summary zComputable similarity measure zEfficient replication-detection algorithm zApplication to real-world problems