© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms.

Slides:

Advertisements

Similar presentations

Information Retrieval in Practice

Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions

Near-Duplicates Detection

3/13/2012Data Streams: Lecture 161 CS 410/510 Data Streams Lecture 16: Data-Stream Sampling: Basic Techniques and Results Kristin Tufte, David Maier.

Latent Semantic Indexing (mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18.

©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.

Estimating TCP Latency Approximately with Passive Measurements Sriharsha Gangam, Jaideep Chandrashekar, Ítalo Cunha, Jim Kurose.

SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.

1 Foundations of Software Design Fall 2002 Marti Hearst Lecture 18: Hash Tables.

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.

1 Lecture 18 Syntactic Web Clustering CS

Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.

Near Duplicate Detection

Michael Ernst, page 1 Improving Test Suites via Operational Abstraction Michael Ernst MIT Lab for Computer Science Joint.

Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.

Introducing Hashing Chapter 21 Copyright ©2012 by Pearson Education, Inc. All rights reserved.

1 The Mystery of Cooperative Web Caching 2 b b Web caching : is a process implemented by a caching proxy to improve the efficiency of the web. It reduces.

Near-duplicates detection Comparison of the two algorithms seen in class Romain Colle.

Random Sampling  In the real world, most R.V.’s for practical applications are continuous, and have no generalized formula for f X (x) and F X (x). 

To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.

India Research Lab Auto-grouping s for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM.

1 Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine.

Navigating and Browsing 3D Models in 3DLIB Hesham Anan, Kurt Maly, Mohammad Zubair Computer Science Dept. Old Dominion University, Norfolk, VA, (anan,

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma

Managing Multi-Configuration Hardware via Dynamic Working Set Analysis By Ashutosh S.Dhodapkar and James E.Smith Presented by Kyriakos Yioutanis.

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.

FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.

Parallel and Distributed IR. 2 Papers on Parallel and Distributed IR Introduction Paper A: Inverted file partitioning schemes in Multiple Disk Systems.

Author ： Ozgun Erdogan and Pei Cao Publisher ： IEEE Globecom 2005 (IJSN 2007) Presenter ： Zong-Lin Sie Date ： 2010/12/08 1.

Approximate Encoding for Direct Access and Query Processing over Compressed Bitmaps Tan Apaydin – The Ohio State University Guadalupe Canahuate – The Ohio.

Quantification of the non- parametric continuous BBNs with expert judgment Iwona Jagielska Msc. Applied Mathematics.

1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering.

CPSC 531: RN Generation1 CPSC 531:Random-Number Generation Instructor: Anirban Mahanti Office: ICT Class Location:

Chapter 7 Random-Number Generation

TinyLFU: A Highly Efficient Cache Admission Policy

Systems and Internet Infrastructure Security (SIIS) LaboratoryPage Systems and Internet Infrastructure Security Network and Security Research Center Department.

1 Lesson 8: Basic Monte Carlo integration We begin the 2 nd phase of our course: Study of general mathematics of MC We begin the 2 nd phase of our course:

Sorting Chapter 10. Chapter Objectives  To learn how to use the standard sorting methods in the Java API  To learn how to implement the following sorting.

Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.

May 30, 2016Department of Computer Sciences, UT Austin1 Using Bloom Filters to Refine Web Search Results Navendu Jain Mike Dahlin University of Texas at.

A Low-bandwidth Network File System Athicha Muthitacharoen et al. Presented by Matt Miller September 12, 2002.

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.

Document duplication (exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.

UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.

Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.

LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.

Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.

Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Module D: Hashing.

File Systems - Part I CS Introduction to Operating Systems.

Statistics (cont.) Psych 231: Research Methods in Psychology.

KNN & Naïve Bayes Hongning Wang

Updating SF-Tree Speaker: Ho Wai Shing.

Near Duplicate Detection

Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms By Monika Henzinger Presented.

Fast Sequence Alignments

Detecting Phrase-Level Duplication on the World Wide Web

Psych 231: Research Methods in Psychology

Similarity based deduplication

Minwise Hashing and Efficient Search

On the resemblance and containment of documents (MinHash)

Presentation transcript:

© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Applying Syntactic Similarity Algorithms for Enterprise Information Management Lucy Cherkasova, Kave Eshghi, Brad Morrey, Joseph Tucek, Alistair Veitch Hewlett-Packard Labs

2 New Applications in the Enterprise Document deletion and compliance rules −how do you identify all the users who might have a copy of these files? E-Discovery −identify and retrieve a complete set of related documents (all earlier or later versions of the same document) −Simplify the review process: in the set of semantically similar documents (returned to the expert) identify clusters of syntactically similar documents Keep document repositories with up-to-date information −to identify and filter out the documents that are largely duplicates of newer versions in order to improve the quality of the collection.

3 Syntactic Similarity Syntactic similarity is useful to identify documents with a large textual intersection. Syntactic similarity algorithms are entirely defined by the syntactic (text) properties of the document Shingling technique (Broder et al) −Goal: to identify near-duplicates on the web −document A is represented by the set of shingles (sequences of adjacent words)

4 w1w1 w2w2 w3w3 w4w4 A: wNwN Shingling technique S(A) = {w 1, w 2, …, w j, …, w N } the set of all shingles in document A Parameter: a shingle size (moving window) Traditionally, a shingle size is defined as a number of words. In our work, we define a shingle size (moving window) via the number of bytes. wjwj wjwj = 6

5 Basic Metrics Similarity metric (documents A and B are ~similar) Containment metric (document A is ~contained in B)

6 Shingling-Based Approach Instead of comparing shingles (sequences of words) it is more convenient to deal with fingerprints (hashes) of shingles 64-bit Rabin fingerprints are used due to fast software implementation To further simplify the computation of similarity metric one can sample the document shingles to build a more compact document signature −i.e., instead of 1000 shingles take a sample of 100 shingles Different ways of sampling the shingles lead to different syntactic similarity algorithms

7 Four Algorithms We will compare performance and properties of the four syntactic similarity algorithms: −Three shingling-based algorithms (Min n, Mod n, Sketch n ) −Chunking-based algorithm (BSW n ) Three shingling-based algorithms ( Min n, Mod n, Sketch n ) differ how they sample the set of document shingles and build the document signature.

8 Min n Algorithm Let S(A)={f(w 1 ), f(w 2 ), …., f(w N )} be all fingerprinted shingles for document A. Min n : it selects the n numerically smallest fingerprinted shingles. Documents are represented by fixed-size signatures f(w 1 ) f(w 2 ) A:

9 Mod n Algorithm Let S(A)={f(w 1 ), f(w 2 ), …., f(w N )} be all fingerprinted shingles for A. Mod n selects all fingerprints whose value modulo n is zero. −Example: If n=100 and A=1000 bytes then Mod 100 (A) is represented by approximately 10 fingerprints. Documents are represented by variable-size signatures (proportional to the document size) f(w 1 ) f(w 2 ) A:

10 Sketch n Algorithm Each shingle is fingerprinted with a family of independent hash functions f 1,…, f n For each f i the fingerprint with smallest value is retained in the sketch. Documents are represented by fixed-size signatures: {min f 1 (A), min f 2 (A), …, min f n (A) } This algorithm has an elegant theoretical justification that the percentage of common entries in sketches of A and B accurately approximates the percentage of common shingles in A and B. w1w1 w2w2 f1f2…fnf1f2…fn min {f 1 (w 1 ), f 1 (w 2 ), …., f 1 (w N ) } min { f i (w 1 ), f i (w 2 ), …., f i (w N ) } A:

11 BSW n (Basic Sliding Window) Algorithm Document is represented by the chunks. Documents are represented by variable-size signatures (the signature is proportional to the document size) −Example: If n=100 and A=1000 bytes then BSW 100 (A) is represented by approximately 10 fingerprints. f(w 1 ) f(w 2 ) f(w k ) mod n = 0 f(w k ) Chunk is represented by the smallest fingerprint within the chunk min { f(w 1 ), f(w 2 ), …., f(w k ) } chunk boundary condition A:

12 Algorithm’s Properties and Parameters Algorithm’s properties: Algorithm’s parameters: −Sliding window size −Sampling frequency −Published papers use very different values Questions: −Sensitivity of the similarity metric to different values of algorithm’s parameters −Comparison of the four algorithms

13 Objective and Fair Comparison How to objectively compare the algorithms? −While one document collection might favor a particular algorithm, the other collection might show better results for a different algorithm −Can we design a framework for fair comparison? −Can the same framework be used for sensitivity analysis of the parameters?

14 Methodology Controlled set of modifications over a given document set:  add/remove words in the documents a predefined number of times

15 Methodology Research corpus RC orig : 100 different HPLabs TRs from 2007 converted to a text format Introduce modifications to documents in a controlled way: −Add/remove words to/from the document a predefined number of times −Modifications can be done in a random fashion or uniformly spread through the document RC i a = {RC orig, where word “a “ is inserted into each document i times } New average similarity metric:

16 Sensitivity to Sliding Window Size Window=20 is a good choice (~4words) Larger size window decreases significantly the similarity metric.

17 Frequency Sampling A big variance in similarity metric values for different documents under the smaller frequency sampling. Frequency sampling parameter depends on the document length distribution and should be tuned accordingly. Trade-off between the accuracy and the storage requirements RC a 50

18 Comparison of Similarity Algorithms Sketch n and BSW n are more sensitive to the number of changes in the documents (especially short ones) than Mod n are Min n

19 Case study using Enterprise Collections Two enterprise collections: −Collection_1 with 5040 documents; −Collection_2 with 2500 documents.

20 Results Algorithms Mod n are Min n have identified higher number of similar documents (with Mod n being a leader). However, Mod n has a higher number of false positives. For longer documents the difference between the algorithms is smaller. Moreover, for long documents (> than100KB) BSW n and related chunking-based algorithms might be a better choice (accuracy and storage wise).

21 Runtime Comparison Executing Sketch n is more expensive, especially for larger window size.

22 Conclusion Syntactic similarity is useful to identify documents with a large textual intersection. We designed a useful framework for a fair algorithm comparison: −compared performance of four syntactic similarity algorithms, and −identified a useful range of their parameters Future work: modify, refine, and optimize the BSW algorithm: −Chunking-based algorithms are actively used for deduplication in backup and storage enterprise solutions.

23 Sensitivity to Sliding Window Size Potentially, Mod n algorithm might have a higher rate of false positives