Download presentation
Presentation is loading. Please wait.
1
Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle
2
Description of algorithms ● 1 st pass through the data : Both algorithms compute a signature for each document, and perform LSH on these signatures. ● 2 nd pass through the data : Verification of the relevance of the duplicates pairs found (Jaccard similarity). ● Algorithm SH uses Shingles + MinHashing to compute the signatures. ● Algorithm SK uses sketches of projections on random hyperplanes to compute the signatures.
3
Experimentation method ● Run both algorithms on the data set (WebBase), and compute precision. ● Remove duplicates pairs found from the data set. ● Generate and insert large amounts of (near-) duplicates documents (~10% of the data set). ● Run both algorithms on the new dataset, and compute precision and recall.
4
Results (original data set)
5
Results (modified dataset)
6
Conclusion ● Algorithm SK rocks ! ● However, it is computationally more expensive ● Tradeoff between speed and recall/precision (given that algorithm SH performs quite well)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.