A new matching algorithm based on prime numbers N. D. Atreas and C. Karanikas Department of Informatics Aristotle University of Thessaloniki
Exact Matching: find all the occurences of a pattern within a text. 1. The Brute Force algorithm: performs character by character comparison in O(N M) time complexity, where M is the length of the pattern and N is the length of the text. 2. The Knuth-Morris-Pratt algorithm: Runs in O(N+M) time, avoiding unecessary re-examinations of previously matched characters.
3. The Boyer-Moore algorithm: involves character by character comparison by using backwards checking. Best case execution: O(N/M), worst time: O(N). involves character by character comparison by using backwards checking. Best case execution: O(N/M), worst time: O(N). 4. The Karp Rabin algorithm: It is a randomised algorithm that seeks a pattern within a text by using hashing. Expected running time O(N+M). It is a randomised algorithm that seeks a pattern within a text by using hashing. Expected running time O(N+M).
A hash function must be: A hash function must be: –efficiently computable; –highly discriminating for strings; –hash(x(j+1... j+M)) must be easily computable from hash(x(j … j+M-1)) and x(j+M). –not injective, i.e. the equality of two hash values suggests, but does not guarantee, equality of the inputs.
Let x = {x(1),…x(N)} be a set of positive integers and p(1) Max{x(i):, i=1,..,N}, we define the transform:
Properties of T(x(1)…x(N)) T(x(1),…x(N)) is one to one. x(1),…,x(N) can be recovered from T(x) as the unique solution of a system of N linear Diophantine equations defined recursively: (p(i+1)…p(N))x(i)+p(i)c(i+1) = c(i) (p(i+1)…p(N))x(i)+p(i)c(i+1) = c(i) where c(1)=T(x)p(1)…P(N). where c(1)=T(x)p(1)…P(N).
Properties of T(x(1)…x(N)) T(x) can be used as a measure of similarity between two strings, since it can be used for counting the different elements between them. It provides a necessary and sufficient condition to detect whenever a binding operation on strings can be implemented. It is not a hash function.
Modelling a hash function approximating T.
Definition of the hash function We prove:
Final form of hash function Theorem
Software implementation Let X={x(1),…,x(N)} be the text and Y={y(1),…,y(M)} be the pattern. Compute T(y(1),…,y(M)) and T(x(1),…,x(M)) in O(M) time. Compute the hash values in O(N-M) time:
Software implementation for some i then x(i+1),…,x(i+M-1) is a candidate for string matching. For all candidates perform at most p (p is the length of the alphabet) character comparisons to throw out false matches. The algorithm executes in O(N) time complexity.
Conclusions We introduce the idea of a hash function approximation in order to reduce the computational complexity of an algorithm. Although the time bounds are the same or in some times inferiors compared to Boyer-Moore algorithm, our algorithm is superior for multiple matching problems.