Www.fakengineer.com An Adaptive Algorithm for Detection of Duplicate Records.

Slides:



Advertisements
Similar presentations
Copyright Jiawei Han, modified by Charles Ling for CS411a
Advertisements

Copyright © 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin Introduction to the Design & Analysis of Algorithms, 2 nd ed., Ch. 1 Chapter 2.
Technical Seminar 2004 RAMAKANTA BEHERA IT An Adaptive Algorithm for Detection of Duplicate Records 1 Presented By: Rama kanta Behera IT
Indexing DNA Sequences Using q-Grams
10/28/2009VLSI Design & Test Seminar1 Diagnostic Tests and Full- Response Fault Dictionary Vishwani D. Agrawal ECE Dept., Auburn University Auburn, AL.
Presented by Xinyu Chang
Space-for-Time Tradeoffs
22C:19 Discrete Structures Integers and Modular Arithmetic
The Design and Analysis of Algorithms
Design of Algorithms by Induction Part 2 Bibliography: [Manber]- Chap 5.
SIGMOD 2006University of Alberta1 Approximately Detecting Duplicates for Streaming Data using Stable Bloom Filters Presented by Fan Deng Joint work with.
Aki Hecht Seminar in Databases (236826) January 2009
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
1 A simple construction of two- dimensional suffix trees in linear time * Division of Electronics and Computer Engineering Hanyang University, Korea Dong.
Introduction to Perfect Hashing Schemes
Ph.D. SeminarUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
Freenet A Distributed Anonymous Information Storage and Retrieval System I Clarke O Sandberg I Clarke O Sandberg B WileyT W Hong.
CHAPTER 11 Searching. 2 Introduction Searching is the process of finding a target element among a group of items (the search pool), or determining that.
Indexing and Searching
Hashing General idea: Get a large array
1 Scalable Pattern-Matching via Dynamic Differentiated Distributed Detection (D 4 ) Author: Kai Zheng, Hongbin Lu Publisher: GLOBECOM 2008 Presenter: Han-Chen.
Design & Analysis of Algorithms Introduction. Introduction Algorithms are the ideas behind computer programs. An algorithm is the thing which stays the.
Sensor Networks Storage Sanket Totala Sudarshan Jagannathan.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Discovering Interesting Subsets Using Statistical Analysis Maitreya Natu and Girish K. Palshikar Tata Research Development and Design Centre (TRDDC) Pune,
PARALLEL TABLE LOOKUP FOR NEXT GENERATION INTERNET
1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.
Distributed Protein Structure Analysis By Jeremy S. Brown Travis E. Brown.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Case Study.  Client needed to build a tool to crawl through their data set and identify duplicates  The algorithm should identify exact as well as near.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
1 Fast packet classification for two-dimensional conflict-free filters Department of Computer Science and Information Engineering National Cheng Kung University,
Scalable High Speed IP Routing Lookups Scalable High Speed IP Routing Lookups Authors: M. Waldvogel, G. Varghese, J. Turner, B. Plattner Presenter: Zhqi.
1 On Optimal Worst-Case Matching Cheng Long (Hong Kong University of Science and Technology) Raymond Chi-Wing Wong (Hong Kong University of Science and.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Unsupervised Auxiliary Visual Words Discovery for Large-Scale Image Object Retrieval Yin-Hsi Kuo1,2, Hsuan-Tien Lin 1, Wen-Huang Cheng 2, Yi-Hsuan Yang.
CS6045: Advanced Algorithms Data Structures. Hashing Tables Motivation: symbol tables –A compiler uses a symbol table to relate symbols to associated.
Algorithms For Solving History Sensitive Cascade in Diffusion Networks Research Proposal Georgi Smilyanov, Maksim Tsikhanovich Advisor Dr Yu Zhang Trinity.
1 Instance Store Database Support for Reasoning over Individuals S Bechhofer, I Horrocks, D Turi. Instance Store - Database Support for Reasoning over.
A new matching algorithm based on prime numbers N. D. Atreas and C. Karanikas Department of Informatics Aristotle University of Thessaloniki.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Domain Name System: DNS To identify an entity, TCP/IP protocols use the IP address, which uniquely identifies the Connection of a host to the Internet.
OceanStore : An Architecture for Global-Scale Persistent Storage Jaewoo Kim, Youngho Yi, Minsik Cho.
Spatial Approximate String Search. Abstract This work deals with the approximate string search in large spatial databases. Specifically, we investigate.
Advanced Algorithms Analysis and Design
Modeling with Recurrence Relations
Fast Hamiltonicity Checking via Bases of Perfect Matchings
Review Graph Directed Graph Undirected Graph Sub-Graph
Advanced Associative Structures
Space-for-time tradeoffs
A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence Yue Ming NJIT#:
Finding replicated web collections
Objective of This Course
Indexing and Hashing Basic Concepts Ordered Indices
Chapter 7 Space and Time Tradeoffs
Space-for-time tradeoffs
3. Brute Force Selection sort Brute-Force string matching
Space-for-time tradeoffs
How to use hash tables to solve olympiad problems
3. Brute Force Selection sort Brute-Force string matching
CSC 380: Design and Analysis of Algorithms
Space-for-time tradeoffs
3. Brute Force Selection sort Brute-Force string matching
Presentation transcript:

An Adaptive Algorithm for Detection of Duplicate Records

INTRODUCTION A records set is a list of prior distinct records. A new record is to be verified for a duplicate against the records set A database is a collection of related data. Various Algorithms like Matching learning algo, Learnable string similarity measures Adaptive Algo

OBJECTIVES Reduced cost of duplicate record detection. Perfect scalability of one such detection procedure. Cache prior information of distinct records and thus cause retaining of prior records redundant for furthering the search Keep the algorithm adaptive.

PREVALENT METHODS The Brute Force Method This method consumes complexity of the order number of records in the records set and requires all prior records to be stored. Method by Rail et. al The comparison of a new record against the records set is reduced from being full text match to comparing two integers

OUTLINE OF THE PROPOSED SOLUTION The central idea behind the present algorithm is based on the fundamental property of primality of numbers I f(x) Record set Integer number space Fig: hashing IP Record set Integer numberPrime number f(x) g(x) Fig: Extended hashing into prime space

r1 r2 … rn I1 I2 … In P1 P2 … Pn PRODUCT( P prior) f(x) g(x) P1*p2…*pn= P prior Fig: The complete algorithm

REALIZATION OF THE ALGORITHM Two functions f(x) and g(x) are to be realized for the implementation of the algorithm. Realizing f(x) Realizing g(x)

STEPS OF THE ALGORITHM Step 1 : For each new record, hash is performed and unique hash value (Hnew) for each distinct record is obtained. Step 2 : Hnew is mapped to its corresponding unique prime (Pnew). Step 3 : Pprior is divided with Pnew. If Pnew exactly divides Pprior, then the corresponding record to Pnew is a duplicate and already exists in Pprior. Else, Pnew is a distinct record. Step 4 : If Pnew is a distinct record, Pprior is multiplied with Pnew and the result is stored back in Pprior. Thus updating Pprior renders the algorithm adaptive.

Fig: Flowchart

IMPLEMENTATIONS There are three important implementation details that need to be discussed Size of Records set Use of Logarithms Subsets of Records set

CONCLUSION A new approach to handle duplicate records is presented This approach combines the concepts of number theory and algorithmic to solve the oftener felt problem of duplicate record detection.

THANK YOU !!!