Finding Replicated web collections

Finding Replicated web collections
Authors Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina Paper Presentation By Radhika Malladi and Vijay Reddy Mara

Introduction Similarity Measures Similarity in Web Pages Similarity in Link Structure Similar Clusters Computing Implementing Quality of Clusters Exploiting Clusters Improving Crawling Improving search engine results

Introduction Replication across the web ?
Replicated collections constitute several hundreds and thousands of pages. Mirrored Web Pages?

Mirrored pages.

What we will know in this paper?
Similarity Measures: Compute similarity measures for collections of web pages. Improved Crawling: use replication information to improve crawling. Reduce clutter from search engines: use replication information.

By automatically identifying mirrored collections
We can improve the following: Crawling: fetches web pages. Ranking: pages ranked by replication factor. Archiving: stores subset of the web. Caching: holds web pages that are frequently accessed.

Difficulties detecting replicated collections:
Update frequency: Mirror copies may not be updated regularly. Mirror partial coverage: differ from primary Different formats Partial crawls: snapshots may be different

Similarity Measures Definition (Web Graph): Graph G=(V,E) having nodes vi for each page pi and a directed edge from vi and vJ ,if there is a hyperlink from pi to pJ . Definition (Collection): Collection Size: No. of pages in the collection

Definition (Identical Collections): Equisized collections C1 and C2 are identical if there is a one-to-one mapping M that maps C1 pages to C2 pages such that: Identical pages: For each page p C1 ,p M( p). Identical link structure: For each link in C1 from page a to b, we have a link from M(a) to M(b) in C2 .

Similarity of link structure:
Collection sizes: One-to-One:

Break points: Link Similarity:

Definition (Similar Collections): Equisized collections C and C are similar if there is a one-to-one mapping M that maps all C pages to all C pages such that Similar pages: For each page p C1 ,p M( p). Similar links: Two corresponding pages should have atleast one parent in their corresponding collections that are also similar pages.

Similar collections

Similar Clusters Computing Example 1 of similar clusters
Cluster size(cardinality):2 Collection size:5

Computing Example 2 of similar clusters Cluster size(cardinality):3 Collection size:3

Cluster Algorithm Example:
Step 1: Find all trivial clusters

Step2: Merge trivial clusters that leads to similar clusters

Step 3: Outcome

Another example of cluster algorithm with two possible clusters.

Cont…

Quality of Clusters

Concept of fingerprints:
entire document fingerprint four line fingerprint two line fingerprint

Exploiting Clusters Improving Crawling:

Improving Search engine results:
Reduces clutter by using a prototype. Prototype has links to ‘Collections’ and ‘Replica’.

Conclusion

Discussion Question: Should Cluster size be more or collection size?

Discussion Question: Suppose p is similar to pI and pII is similar pI and p and pII are not similar. Do you think all the three pages are similar?

Finding Replicated web collections

Similar presentations

Presentation on theme: "Finding Replicated web collections"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Finding Replicated web collections

Similar presentations

Presentation on theme: "Finding Replicated web collections"— Presentation transcript:

Similar presentations

About project

Feedback