L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.

Slides:

Advertisements

Similar presentations

Pattern Matching against Distributed Datasets within DAME Andy Pasley University of York.

Advertisements

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.

Indexing DNA Sequences Using q-Grams

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:

CrowdER - Crowdsourcing Entity Resolution

Aggregating local image descriptors into compact codes

Chapter 5: Introduction to Information Retrieval

HASH TABLE. HASH TABLE a group of people could be arranged in a database like this: Hashing is the transformation of a string of characters into a.

Presented by Xinyu Chang

Space-for-Time Tradeoffs

Computer Vision Lecture 18: Object Recognition II

What we learn with pleasure we never forget. Alfred Mercier Smitha N Pai.

A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.

Jan. 2013Dr. Yangjun Chen ACS Outline Signature Files - Signature for attribute values - Signature for records - Searching a signature file Signature.

1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.

Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:

CSE3201/CSE4500 Information Retrieval Systems

Spring 2003 ECE569 Lecture ECE 569 Database System Engineering Spring 2003 Yanyong Zhang

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.

Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.

Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)

Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.

Liang Jin and Chen Li VLDB’2005 Supported by NSF CAREER Award IIS Selectivity Estimation for Fuzzy String Predicates in Large Data Sets.

Spring 2004 ECE569 Lecture ECE 569 Database System Engineering Spring 2004 Yanyong Zhang

Modern Information Retrieval Chapter 4 Query Languages.

1 CS 502: Computing Methods for Digital Libraries Lecture 4 Identifiers and Reference Links.

Overview of Search Engines

Aparna Kulkarni Nachal Ramasamy Rashmi Havaldar N-grams to Process Hindi Queries.

Chapter 1: Introduction to Visual Basic.NET: Background and Perspective Visual Basic.NET Programming: From Problem Analysis to Program Design.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Hash Table March COP 3502, UCF.

General Computer Science for Engineers CISC 106 Lecture 02 Dr. John Cavazos Computer and Information Sciences 09/03/2010.

Graph Data Management Lab, School of Computer Science gdm.fudan.edu.cn XMLSnippet: A Coding Assistant for XML Configuration Snippet.

1 TEMPLATE MATCHING  The Goal: Given a set of reference patterns known as TEMPLATES, find to which one an unknown pattern matches best. That is, each.

DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.

Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.

1 A Feature Selection and Evaluation Scheme for Computer Virus Detection Olivier Henchiri and Nathalie Japkowicz School of Information Technology and Engineering.

© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.

Lead Black Slide Powered by DeSiaMore1. 2 Chapter 8 Personal Productivity and Problem Solving.

1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester

ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.

Database What is a database? A database is a collection of information that is typically organized so that it can easily be storing, managing and retrieving.

Problem Solving Techniques. Compiler n Is a computer program whose purpose is to take a description of a desired program coded in a programming language.

CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.

Honours Project Proposal Sanvir Manilal Lebogang Molwantoa Kyle Williams Project Supervisor: Dr Hussein Suleman 22 May 2009 Bushman OnLine Dictionary.

Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.

Recent Results in Combined Coding for Word-Based PPM Radu Rădescu George Liculescu Polytechnic University of Bucharest Faculty of Electronics, Telecommunications.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.

EndNote: The Next Steps Rebecca Starkey Reference Librarian The Joseph Regenstein Library

Data Structures and Algorithms Dr. Tehseen Zia Assistant Professor Dept. Computer Science and IT University of Sargodha Lecture 1.

LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.

Hashing & Hash Tables. Sets/Dictionaries Set - Our best efforts to date:

Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.

Evaluating Translation Memory Software Francie Gow MA Translation, University of Ottawa Translator, Translation Bureau, Government of Canada

Updating SF-Tree Speaker: Ho Wai Shing.

Collection Fusion in Carrot2

Music Matching Speaker : 黃茂政指導教授 : 陳嘉琳博士.

CS 430: Information Discovery

Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2

Query Languages.

Managing data Resources:

Software Requirements Specification Document

Space-for-time tradeoffs

Extracting Patterns and Relations from the World Wide Web

Instructor Materials Chapter 5: Ensuring Integrity

Presentation transcript:

L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries School of Information Technology, JNT University, Hyderabad, , India.

Motivation Books scanned in Digital Libraries are procured from varied sources. Scanning centers are distributed across the country. Duplicates could arise between scanning points. Pre-scanning duplicate detection is required

Challenges Duplicate detection is by using metadata (title, author, publishing year, edition, etc) Duplicate detection is by using metadata (title, author, publishing year, edition, etc) Entered by varied operators and so there is scope for Entered by varied operators and so there is scope for – Incorrectness – Incompleteness Errors could be - Errors could be -  Typographical mistakes  Word disorder  Inconsistent abbreviations  Even with missing words Makes duplicate detection more difficult. Makes duplicate detection more difficult. Duplicate detection must have quick turnaround time and accuracy Duplicate detection must have quick turnaround time and accuracy

RELATED WORK Most traditional methods based on string similarity are: Most traditional methods based on string similarity are:  character-based techniques  vector space based techniques. Character-based technique Character-based technique – rely on character edit operations, such as deletions, insertions, substitutions and sub sequence comparison. Vector space based techniques Vector space based techniques – transform strings into vector representation on which similarity computations are conducted. In the present work we used an efficient and fast duplication detection technique using similarity search. In the present work we used an efficient and fast duplication detection technique using similarity search.

Our Approach Uses Signature file method Uses Signature file method Uses Similarity search techniques to find duplicates with close proximity match Uses Similarity search techniques to find duplicates with close proximity match Language independent Language independent Fast and Accurate Fast and Accurate Uses Online Tool to customize Uses Online Tool to customize

The Process Metadata is created at scanning centers Metadata is created at scanning centers Signature is computed for the metadata Signature is computed for the metadata – Use superimposed Technique and Hashing method Signature is stored in central repository Signature is stored in central repository Pre-scanned book metadata is submitted as a query Pre-scanned book metadata is submitted as a query  Use same technique to compute the signature Similarity search gives close proximity match duplicate Similarity search gives close proximity match duplicate

Duplicate Detection in Digital Library system Duplicate Detection in Digital Library system Duplicate Detection Technique Scanning Centre-I Scanning Centre-II Central Database Metadata Signature Query Metadata Signature Y/N

Central Repository Metadata of Books Signatures The Meaning And Teaching Of Music -Will Earhart Some Famous Singers Of The 19th Century -Francis Rogers A Dictionary of Musical Terms - Dr.th.baker The Arts of Japan - Edward Dillon Query - Spell Mistakes Query - Missing Words Query - Jumbled Words The Ars of Japa Edward Dilon The of Japan - Edward Dillon Dillon Edward -The Japan of Arts Result : The Arts of Japan - Edward Dillon Example of the process Books Data Example Query: The Arts of Japan - Edward Dillon Result

Superimposed Coding Technique In Superimposed Coding Technique each record is mapped into an individual binary signature. In Superimposed Coding Technique each record is mapped into an individual binary signature. Record is either the title or the author name of the book or the combination. Record is either the title or the author name of the book or the combination. Signatures of the records in the training data and testing data are encoded binary representations. Signatures of the records in the training data and testing data are encoded binary representations. The signature of the 'title or author name' of the book is obtained by superimposing the signatures of the words with OR operation. The signature of the 'title or author name' of the book is obtained by superimposing the signatures of the words with OR operation. ComputerProgramming Signature of the book

The Hashing method The signature of each word is obtained by hashing method. The signature of each word is obtained by hashing method. The hashing function H(w) maps the word(w) into one of the patterns generated by computing a hash value of the word. The hashing function H(w) maps the word(w) into one of the patterns generated by computing a hash value of the word. The hash function uses shift and add strategy. The hash function uses shift and add strategy. The ASCII values of the characters in the word are added and shifted by H(w). The ASCII values of the characters in the word are added and shifted by H(w). in order to compute the hash value. The final hash value is obtained by mod operation with nCr. in order to compute the hash value. The final hash value is obtained by mod operation with nCr.

Duplicate Detection in Digital Library System The Similarity Match Algorithm for Library Database Input : L library database consists of documents D1, D2, ……, Dm, query Q. Output : B book corresponding to query Q Procedure Library (D1, D2, ……,Dm, Q : in; B : out) 1.for i=1 to m do 2. Si = superimposed-coding (Di) 3.end do 4.X = superimposed-coding (Q) 5.O = Jaccard (S1, S2,……Sm, X) 6.Look up in Library database L for a book B (document) whose Signature matches with minimum Jaccard distance. 7.End

Jaccard Distance The Jaccard distance between the query signature and target signature can be obtained by using the expression The Jaccard distance between the query signature and target signature can be obtained by using the expression d = (r + s) / (q + r + s+t) q - The number of bits that equals to1 for both target and query signatures. q - The number of bits that equals to1 for both target and query signatures. r - The number of bits that equals to 1 for target signature but that are 0 for the query signatures. r - The number of bits that equals to 1 for target signature but that are 0 for the query signatures. s - The number of bits that equals to 0 for the target signature but equals to 1 for the query signature s - The number of bits that equals to 0 for the target signature but equals to 1 for the query signature t - The number of bits that equals to 0 for both target and query signatures. t - The number of bits that equals to 0 for both target and query signatures.

False drops Minimized on the appropriate choice of two parameters n and r. Minimized on the appropriate choice of two parameters n and r. Online Tool Online Tool

EXPERIMENTAL RESULTS Metadata Query-Spell mistakes Query-Missing Words Query-Jumbled Words False drop (%)DR(%)falsedrop(%)DR(%) false drop (%)DR(%) DR: Detection Rate

Scalability and accuracy of duplicate detection system

CONCLUSION Effective and efficient duplicate detection technique is proposed. Effective and efficient duplicate detection technique is proposed. – Duplicate detection was done by similarity search using signature file method where we can detect the duplicate with typographical mistakes, word disorder, and inconsistent abbreviations and even with missing words. Language independent and High performance with 95% accuracy Language independent and High performance with 95% accuracy

Questions?