Download presentation
Presentation is loading. Please wait.
Published byVernon Burns Modified over 9 years ago
1
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries School of Information Technology, JNT University, Hyderabad, 500 072, India. srmeda@gmail.com srmeda@gmail.com
2
Motivation Books scanned in Digital Libraries are procured from varied sources. Scanning centers are distributed across the country. Duplicates could arise between scanning points. Pre-scanning duplicate detection is required
3
Challenges Duplicate detection is by using metadata (title, author, publishing year, edition, etc) Duplicate detection is by using metadata (title, author, publishing year, edition, etc) Entered by varied operators and so there is scope for Entered by varied operators and so there is scope for – Incorrectness – Incompleteness Errors could be - Errors could be - Typographical mistakes Word disorder Inconsistent abbreviations Even with missing words Makes duplicate detection more difficult. Makes duplicate detection more difficult. Duplicate detection must have quick turnaround time and accuracy Duplicate detection must have quick turnaround time and accuracy
4
RELATED WORK Most traditional methods based on string similarity are: Most traditional methods based on string similarity are: character-based techniques vector space based techniques. Character-based technique Character-based technique – rely on character edit operations, such as deletions, insertions, substitutions and sub sequence comparison. Vector space based techniques Vector space based techniques – transform strings into vector representation on which similarity computations are conducted. In the present work we used an efficient and fast duplication detection technique using similarity search. In the present work we used an efficient and fast duplication detection technique using similarity search.
5
Our Approach Uses Signature file method Uses Signature file method Uses Similarity search techniques to find duplicates with close proximity match Uses Similarity search techniques to find duplicates with close proximity match Language independent Language independent Fast and Accurate Fast and Accurate Uses Online Tool to customize Uses Online Tool to customize
6
The Process Metadata is created at scanning centers Metadata is created at scanning centers Signature is computed for the metadata Signature is computed for the metadata – Use superimposed Technique and Hashing method Signature is stored in central repository Signature is stored in central repository Pre-scanned book metadata is submitted as a query Pre-scanned book metadata is submitted as a query Use same technique to compute the signature Similarity search gives close proximity match duplicate Similarity search gives close proximity match duplicate
7
Duplicate Detection in Digital Library system Duplicate Detection in Digital Library system Duplicate Detection Technique Scanning Centre-I Scanning Centre-II Central Database Metadata Signature 10001011 Query Metadata Signature Y/N
8
Central Repository Metadata of Books Signatures The Meaning And Teaching Of Music -Will Earhart Some Famous Singers Of The 19th Century -Francis Rogers A Dictionary of Musical Terms - Dr.th.baker The Arts of Japan - Edward Dillon 011111110000101111100011111011111001010000001001111110110110111100101000110100000111111111111101100000000000000011001111 Query - Spell Mistakes Query - Missing Words Query - Jumbled Words The Ars of Japa Edward Dilon The of Japan - Edward Dillon Dillon Edward -The Japan of Arts 111101100001110000000011001111111101100000011000000011001111111101100000100000000011001111 Result : The Arts of Japan - Edward Dillon Example of the process Books Data Example Query: The Arts of Japan - Edward Dillon Result
9
Superimposed Coding Technique In Superimposed Coding Technique each record is mapped into an individual binary signature. In Superimposed Coding Technique each record is mapped into an individual binary signature. Record is either the title or the author name of the book or the combination. Record is either the title or the author name of the book or the combination. Signatures of the records in the training data and testing data are encoded binary representations. Signatures of the records in the training data and testing data are encoded binary representations. The signature of the 'title or author name' of the book is obtained by superimposing the signatures of the words with OR operation. The signature of the 'title or author name' of the book is obtained by superimposing the signatures of the words with OR operation. ComputerProgramming110000011000010101000100 Signature of the book 110111010100
10
The Hashing method The signature of each word is obtained by hashing method. The signature of each word is obtained by hashing method. The hashing function H(w) maps the word(w) into one of the patterns generated by computing a hash value of the word. The hashing function H(w) maps the word(w) into one of the patterns generated by computing a hash value of the word. The hash function uses shift and add strategy. The hash function uses shift and add strategy. The ASCII values of the characters in the word are added and shifted by H(w). The ASCII values of the characters in the word are added and shifted by H(w). in order to compute the hash value. The final hash value is obtained by mod operation with nCr. in order to compute the hash value. The final hash value is obtained by mod operation with nCr.
11
Duplicate Detection in Digital Library System The Similarity Match Algorithm for Library Database Input : L library database consists of documents D1, D2, ……, Dm, query Q. Output : B book corresponding to query Q Procedure Library (D1, D2, ……,Dm, Q : in; B : out) 1.for i=1 to m do 2. Si = superimposed-coding (Di) 3.end do 4.X = superimposed-coding (Q) 5.O = Jaccard (S1, S2,……Sm, X) 6.Look up in Library database L for a book B (document) whose Signature matches with minimum Jaccard distance. 7.End
12
Jaccard Distance The Jaccard distance between the query signature and target signature can be obtained by using the expression The Jaccard distance between the query signature and target signature can be obtained by using the expression d = (r + s) / (q + r + s+t) q - The number of bits that equals to1 for both target and query signatures. q - The number of bits that equals to1 for both target and query signatures. r - The number of bits that equals to 1 for target signature but that are 0 for the query signatures. r - The number of bits that equals to 1 for target signature but that are 0 for the query signatures. s - The number of bits that equals to 0 for the target signature but equals to 1 for the query signature s - The number of bits that equals to 0 for the target signature but equals to 1 for the query signature t - The number of bits that equals to 0 for both target and query signatures. t - The number of bits that equals to 0 for both target and query signatures.
13
False drops Minimized on the appropriate choice of two parameters n and r. Minimized on the appropriate choice of two parameters n and r. Online Tool Online Tool
14
EXPERIMENTAL RESULTS Metadata Query-Spell mistakes Query-Missing Words Query-Jumbled Words False drop (%)DR(%)falsedrop(%)DR(%) false drop (%)DR(%) 1000793991397 50008921090595 2300010901288595 DR: Detection Rate
15
Scalability and accuracy of duplicate detection system
18
CONCLUSION Effective and efficient duplicate detection technique is proposed. Effective and efficient duplicate detection technique is proposed. – Duplicate detection was done by similarity search using signature file method where we can detect the duplicate with typographical mistakes, word disorder, and inconsistent abbreviations and even with missing words. Language independent and High performance with 95% accuracy Language independent and High performance with 95% accuracy
19
Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.