Finding Replicated web collections Authors Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina Paper Presentation By Radhika Malladi and Vijay Reddy Mara
Introduction Similarity Measures Similarity in Web Pages Similarity in Link Structure Similar Clusters Computing Implementing Quality of Clusters Exploiting Clusters Improving Crawling Improving search engine results
Introduction Replication across the web ? Replicated collections constitute several hundreds and thousands of pages. Mirrored Web Pages?
Mirrored pages.
What we will know in this paper? Similarity Measures: Compute similarity measures for collections of web pages. Improved Crawling: use replication information to improve crawling. Reduce clutter from search engines: use replication information.
By automatically identifying mirrored collections We can improve the following: Crawling: fetches web pages. Ranking: pages ranked by replication factor. Archiving: stores subset of the web. Caching: holds web pages that are frequently accessed.
Difficulties detecting replicated collections: Update frequency: Mirror copies may not be updated regularly. Mirror partial coverage: differ from primary Different formats Partial crawls: snapshots may be different
Similarity Measures Definition (Web Graph): Graph G=(V,E) having nodes vi for each page pi and a directed edge from vi and vJ ,if there is a hyperlink from pi to pJ . Definition (Collection): Collection Size: No. of pages in the collection
Definition (Identical Collections): Equisized collections C1 and C2 are identical if there is a one-to-one mapping M that maps C1 pages to C2 pages such that: Identical pages: For each page p C1 ,p M( p). Identical link structure: For each link in C1 from page a to b, we have a link from M(a) to M(b) in C2 .
Similarity of link structure: Collection sizes: One-to-One:
Break points: Link Similarity:
Definition (Similar Collections): Equisized collections C and C are similar if there is a one-to-one mapping M that maps all C pages to all C pages such that Similar pages: For each page p C1 ,p M( p). Similar links: Two corresponding pages should have atleast one parent in their corresponding collections that are also similar pages.
Similar collections
Similar Clusters Computing Example 1 of similar clusters Cluster size(cardinality):2 Collection size:5
Computing Example 2 of similar clusters Cluster size(cardinality):3 Collection size:3
Cluster Algorithm Example: Step 1: Find all trivial clusters
Step2: Merge trivial clusters that leads to similar clusters
Step 3: Outcome
Another example of cluster algorithm with two possible clusters.
Cont…
Quality of Clusters
Concept of fingerprints: entire document fingerprint four line fingerprint two line fingerprint
Exploiting Clusters Improving Crawling:
Improving Search engine results: Reduces clutter by using a prototype. Prototype has links to ‘Collections’ and ‘Replica’.
Conclusion
Discussion Question: Should Cluster size be more or collection size?
Discussion Question: Suppose p is similar to pI and pII is similar pI and p and pII are not similar. Do you think all the three pages are similar?