Download presentation
Presentation is loading. Please wait.
1
Finding Replicated web collections
Authors Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina Paper Presentation By Radhika Malladi and Vijay Reddy Mara
2
Introduction Similarity Measures Similarity in Web Pages Similarity in Link Structure Similar Clusters Computing Implementing Quality of Clusters Exploiting Clusters Improving Crawling Improving search engine results
3
Introduction Replication across the web ?
Replicated collections constitute several hundreds and thousands of pages. Mirrored Web Pages?
4
Mirrored pages.
6
What we will know in this paper?
Similarity Measures: Compute similarity measures for collections of web pages. Improved Crawling: use replication information to improve crawling. Reduce clutter from search engines: use replication information.
7
By automatically identifying mirrored collections
We can improve the following: Crawling: fetches web pages. Ranking: pages ranked by replication factor. Archiving: stores subset of the web. Caching: holds web pages that are frequently accessed.
8
Difficulties detecting replicated collections:
Update frequency: Mirror copies may not be updated regularly. Mirror partial coverage: differ from primary Different formats Partial crawls: snapshots may be different
9
Similarity Measures Definition (Web Graph): Graph G=(V,E) having nodes vi for each page pi and a directed edge from vi and vJ ,if there is a hyperlink from pi to pJ . Definition (Collection): Collection Size: No. of pages in the collection
10
Definition (Identical Collections): Equisized collections C1 and C2 are identical if there is a one-to-one mapping M that maps C1 pages to C2 pages such that: Identical pages: For each page p C1 ,p M( p). Identical link structure: For each link in C1 from page a to b, we have a link from M(a) to M(b) in C2 .
12
Similarity of link structure:
Collection sizes: One-to-One:
13
Break points: Link Similarity:
14
Definition (Similar Collections): Equisized collections C and C are similar if there is a one-to-one mapping M that maps all C pages to all C pages such that Similar pages: For each page p C1 ,p M( p). Similar links: Two corresponding pages should have atleast one parent in their corresponding collections that are also similar pages.
15
Similar collections
16
Similar Clusters Computing Example 1 of similar clusters
Cluster size(cardinality):2 Collection size:5
17
Computing Example 2 of similar clusters Cluster size(cardinality):3 Collection size:3
18
Cluster Algorithm Example:
Step 1: Find all trivial clusters
19
Step2: Merge trivial clusters that leads to similar clusters
20
Step 3: Outcome
21
Another example of cluster algorithm with two possible clusters.
22
Cont…
23
Quality of Clusters
24
Concept of fingerprints:
entire document fingerprint four line fingerprint two line fingerprint
28
Exploiting Clusters Improving Crawling:
31
Improving Search engine results:
Reduces clutter by using a prototype. Prototype has links to ‘Collections’ and ‘Replica’.
32
Conclusion
33
Discussion Question: Should Cluster size be more or collection size?
34
Discussion Question: Suppose p is similar to pI and pII is similar pI and p and pII are not similar. Do you think all the three pages are similar?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.