Technical Seminar 2004 RAMAKANTA BEHERA IT200127207 An Adaptive Algorithm for Detection of Duplicate Records 1 Presented By: Rama kanta Behera IT200127207.

Slides:



Advertisements
Similar presentations
An Adaptive Algorithm for Detection of Duplicate Records.
Advertisements

1 Term 2, 2004, Lecture 9, Distributed DatabasesMarian Ursu, Department of Computing, Goldsmiths College Distributed databases 3.
MINAP: DATA HANDLING PROCEDURES & DATA ACCESS Data Management Group, 13 July 2009.
Automata Theory Part 1: Introduction & NFA November 2002.
Modeling issues Book: chapters 4.12, 5.4, 8.4, 10.1.
Toward Scalable Keyword Search over Relational Data Akanksha Baid, Ian Rae, Jiexing Li, AnHai Doan, and Jeffrey Naughton University of Wisconsin VLDB 2010.
Tintu David Joy. Agenda Motivation Better Verification Through Symmetry-basic idea Structural Symmetry and Multiprocessor Systems Mur ϕ verification system.
Copyright Jiawei Han, modified by Charles Ling for CS411a
1 Designing Hash Tables Sections 5.3, 5.4, Designing a hash table 1.Hash function: establishing a key with an indexed location in a hash table.
Page Table Implementation
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - IV Grid files, dim. curse C. Faloutsos.
Indexing DNA Sequences Using q-Grams
IHE Profile Proposal: Dynamic Configuration Management October, 2013.
An Adaptive System for User Information needs based on the observed meta- Knowledge AKERELE Olubunmi Doctorate student, University of Ibadan, Ibadan, Nigeria;
Distributed DBMS©M. T. Özsu & P. Valduriez Ch.15/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
FRAIGs - A Unifying Representation for Logic Synthesis and Verification - Alan Mishchenko, Satrajit Chatterjee, Roland Jiang, Robert Brayton ERL Technical.
10/28/2009VLSI Design & Test Seminar1 Diagnostic Tests and Full- Response Fault Dictionary Vishwani D. Agrawal ECE Dept., Auburn University Auburn, AL.
Presented by Xinyu Chang
Space-for-Time Tradeoffs
22C:19 Discrete Structures Integers and Modular Arithmetic
The Design and Analysis of Algorithms
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
GNANA SUNDAR RAJENDIRAN JOYESH MISHRA RISHI MISHRA FALL 2008 BIOINFORMATICS Clustering Method for Repeat Analysis in DNA sequences.
Aki Hecht Seminar in Databases (236826) January 2009
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Benefits of Early Cache Miss Determination Memik G., Reinman G., Mangione-Smith, W.H. Proceedings of High Performance Computer Architecture Pages: 307.
Freenet A Distributed Anonymous Information Storage and Retrieval System I Clarke O Sandberg I Clarke O Sandberg B WileyT W Hong.
CHAPTER 11 Searching. 2 Introduction Searching is the process of finding a target element among a group of items (the search pool), or determining that.
Indexing and Searching
Hashing General idea: Get a large array
Design & Analysis of Algorithms Introduction. Introduction Algorithms are the ideas behind computer programs. An algorithm is the thing which stays the.
Sensor Networks Storage Sanket Totala Sudarshan Jagannathan.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Discovering Interesting Subsets Using Statistical Analysis Maitreya Natu and Girish K. Palshikar Tata Research Development and Design Centre (TRDDC) Pune,
PARALLEL TABLE LOOKUP FOR NEXT GENERATION INTERNET
Distributed Protein Structure Analysis By Jeremy S. Brown Travis E. Brown.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Case Study.  Client needed to build a tool to crawl through their data set and identify duplicates  The algorithm should identify exact as well as near.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
1 Fast packet classification for two-dimensional conflict-free filters Department of Computer Science and Information Engineering National Cheng Kung University,
1 On Optimal Worst-Case Matching Cheng Long (Hong Kong University of Science and Technology) Raymond Chi-Wing Wong (Hong Kong University of Science and.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Data and Knowledge Engineering Laboratory Clustered Segment Indexing for Pattern Searching on the Secondary Structure of Protein Sequences Minkoo Seo Sanghyun.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
1 Instance Store Database Support for Reasoning over Individuals S Bechhofer, I Horrocks, D Turi. Instance Store - Database Support for Reasoning over.
A new matching algorithm based on prime numbers N. D. Atreas and C. Karanikas Department of Informatics Aristotle University of Thessaloniki.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
OceanStore : An Architecture for Global-Scale Persistent Storage Jaewoo Kim, Youngho Yi, Minsik Cho.
Spatial Approximate String Search. Abstract This work deals with the approximate string search in large spatial databases. Specifically, we investigate.
Modeling with Recurrence Relations
Fast Hamiltonicity Checking via Bases of Perfect Matchings
Review Graph Directed Graph Undirected Graph Sub-Graph
Advanced Associative Structures
Space-for-time tradeoffs
A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence Yue Ming NJIT#:
Objective of This Course
Indexing and Hashing Basic Concepts Ordered Indices
Space-for-time tradeoffs
3. Brute Force Selection sort Brute-Force string matching
Space-for-time tradeoffs
How to use hash tables to solve olympiad problems
3. Brute Force Selection sort Brute-Force string matching
Space-for-time tradeoffs
3. Brute Force Selection sort Brute-Force string matching
Presentation transcript:

Technical Seminar 2004 RAMAKANTA BEHERA IT An Adaptive Algorithm for Detection of Duplicate Records 1 Presented By: Rama kanta Behera IT Under the guidance of : Miss Ipsita Mishra

Technical Seminar 2004 RAMAKANTA BEHERA IT An Adaptive Algorithm for Detection of Duplicate Records 2 INTRODUCTION  A “records set” is a list of prior distinct records. A new record is to be verified for a duplicate against the records set  A database is a collection of related data.  Various Algorithms like Matching learning algo, Learnable string similarity measures Adaptive Algo

Technical Seminar 2004 RAMAKANTA BEHERA IT An Adaptive Algorithm for Detection of Duplicate Records 3 OBJECTIVES  Reduced cost of duplicate record detection.  Perfect scalability of one such detection procedure.  Cache prior information of distinct records and thus cause retaining of prior records redundant for furthering the search  Keep the algorithm adaptive.

Technical Seminar 2004 RAMAKANTA BEHERA IT An Adaptive Algorithm for Detection of Duplicate Records 4 PREVALENT METHODS  The Brute Force Method This method consumes complexity of the order number of records in the records set and requires all prior records to be stored.  Method by Rail et. al The comparison of a new record against the records set is reduced from being full text match to comparing two integers

Technical Seminar 2004 RAMAKANTA BEHERA IT An Adaptive Algorithm for Detection of Duplicate Records 5 OUTLINE OF THE PROPOSED SOLUTION The central idea behind the present algorithm is based on the fundamental property of primality of numbers I f(x) Record set Integer number space Fig: hashing IP Record set Integer numberPrime number f(x) g(x) Fig: Extended hashing into prime space

Technical Seminar 2004 RAMAKANTA BEHERA IT An Adaptive Algorithm for Detection of Duplicate Records 6 r1 r2 … rn I1 I2 … In P1 P2 … Pn PRODUCT( P prior) f(x) g(x) P1*p2…*pn= P prior Fig: The complete algorithm

Technical Seminar 2004 RAMAKANTA BEHERA IT An Adaptive Algorithm for Detection of Duplicate Records 7 REALIZATION OF THE ALGORITHM Two functions f(x) and g(x) are to be realized for the implementation of the algorithm.  Realizing f(x)  Realizing g(x)

Technical Seminar 2004 RAMAKANTA BEHERA IT An Adaptive Algorithm for Detection of Duplicate Records 8 STEPS OF THE ALGORITHM Step 1 : For each new record, hash is performed and unique hash value (Hnew) for each distinct record is obtained. Step 2 : Hnew is mapped to its corresponding unique prime (Pnew). Step 3 : Pprior is divided with Pnew. If Pnew exactly divides Pprior, then the corresponding record to Pnew is a duplicate and already exists in Pprior. Else, Pnew is a distinct record. Step 4 : If Pnew is a distinct record, Pprior is multiplied with Pnew and the result is stored back in Pprior. Thus updating Pprior renders the algorithm adaptive.

Technical Seminar 2004 RAMAKANTA BEHERA IT An Adaptive Algorithm for Detection of Duplicate Records 9 Fig: Flowchart

Technical Seminar 2004 RAMAKANTA BEHERA IT An Adaptive Algorithm for Detection of Duplicate Records 10 IMPLEMENTATIONS There are three important implementation details that need to be discussed  Size of Records set  Use of Logarithms  Subsets of Records set

Technical Seminar 2004 RAMAKANTA BEHERA IT An Adaptive Algorithm for Detection of Duplicate Records 11 CONCLUSION  A new approach to handle duplicate records is presented  This approach combines the concepts of number theory and algorithmic to solve the oftener felt problem of “duplicate record detection”.

Technical Seminar 2004 RAMAKANTA BEHERA IT An Adaptive Algorithm for Detection of Duplicate Records 12 THANK YOU !!!

Technical Seminar 2004 RAMAKANTA BEHERA IT An Adaptive Algorithm for Detection of Duplicate Records 13