Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle.

Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle

Description of algorithms ● 1 st pass through the data : Both algorithms compute a signature for each document, and perform LSH on these signatures. ● 2 nd pass through the data : Verification of the relevance of the duplicates pairs found (Jaccard similarity). ● Algorithm SH uses Shingles + MinHashing to compute the signatures. ● Algorithm SK uses sketches of projections on random hyperplanes to compute the signatures.

Experimentation method ● Run both algorithms on the data set (WebBase), and compute precision. ● Remove duplicates pairs found from the data set. ● Generate and insert large amounts of (near-) duplicates documents (~10% of the data set). ● Run both algorithms on the new dataset, and compute precision and recall.

Results (original data set)

Results (modified dataset)

Conclusion ● Algorithm SK rocks ! ● However, it is computationally more expensive ● Tradeoff between speed and recall/precision (given that algorithm SH performs quite well)

Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle.

Similar presentations

Presentation on theme: "Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle.

Similar presentations

Presentation on theme: "Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle."— Presentation transcript:

Similar presentations

About project

Feedback