Fast Fingerprint Calculations Thomas Schwarz, S.J.

Slides:



Advertisements
Similar presentations
Relations, Functions, and Matrices Mathematical Structures for Computer Science Chapter 4 Copyright © 2006 W.H. Freeman & Co.MSCS SlidesThe Mighty Mod.
Advertisements

CSE 1302 Lecture 23 Hashing and Hash Tables Richard Gesick.
Lecture 7 Overview. Advanced Encryption Standard 10, 12, 14 rounds for 128, 192, 256 bit keys – Regular Rounds (9, 11, 13) – Final Round is different.
Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.
Cryptography and Network Security
15-853:Algorithms in the Real World
CSCE 3400 Data Structures & Algorithm Analysis
Hashing as a Dictionary Implementation
Digital Signatures and Hash Functions. Digital Signatures.
CSC1016 Coursework Clarification Derek Mortimer March 2010.
Session 5 Hash functions and digital signatures. Contents Hash functions – Definition – Requirements – Construction – Security – Applications 2/44.
Chapter 4  Hash Functions 1 Overview  Cryptographic hash functions are functions that: o Map an arbitrary-length (but finite) input to a fixed-size output.
Quantum Computation and Error Correction Ali Soleimani.
Hash Tables1 Part E Hash Tables  
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
Tirgul 7. Find an efficient implementation of a dynamic collection of elements with unique keys Supported Operations: Insert, Search and Delete. The keys.
Physical Mapping II + Perl CIS 667 March 2, 2004.
Fast binary and multiway prefix searches for pachet forwarding Author: Yeim-Kuan Chang Publisher: COMPUTER NETWORKS, Volume 51, Issue 3, pp , February.
Cryptography1 CPSC 3730 Cryptography Chapter 11, 12 Message Authentication and Hash Functions.
Cryptography and Network Security Chapter 11 Fifth Edition by William Stallings Lecture slides by Lawrie Brown.
CS2110 Recitation Week 8. Hashing Hashing: An implementation of a set. It provides O(1) expected time for set operations Set operations Make the set empty.
Information Security and Management 13. Digital Signatures and Authentication Protocols Chih-Hung Wang Fall
Lecture slides prepared for “Computer Security: Principles and Practice”, 2/e, by William Stallings and Lawrie Brown, Chapter 21 “Public-Key Cryptography.
Using Algebraic Signatures in Storage Applications Thomas Schwarz, S.J. Associate Professor, Santa Clara University Associate, SSRC UCSC Storage Systems.
1 Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine.
String Matching. Problem is to find if a pattern P[1..m] occurs within text T[1..n] Simple solution: Naïve String Matching –Match each position in the.
MATH 224 – Discrete Mathematics
FINITE FIELDS 7/30 陳柏誠.
Information Security and Management 4. Finite Fields 8
Guomin Yang et al. IEEE Transactions on Wireless Communication Vol. 6 No. 9 September
Hash Functions and the HashMap Class A Brief Overview On Green Marble John W. Benning.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture8.
CS212: DATA STRUCTURES Lecture 10:Hashing 1. Outline 2  Map Abstract Data type  Map Abstract Data type methods  What is hash  Hash tables  Bucket.
Cryptography Dec 29. This Lecture In this last lecture for number theory, we will see probably the most important application of number theory in computer.
Analysis of Algorithms
Basic Concepts in Number Theory Background for Random Number Generation 1.For any pair of integers n and m, m  0, there exists a unique pair of integers.
Chapter 21 Public-Key Cryptography and Message Authentication.
Length Reduction in Binary Transforms Oren Kapah Ely Porat Amir Rothschild Amihood Amir Bar Ilan University and Johns Hopkins University.
1 Pattern Matching Using n-gram Sampling Of Cumulative Algebraic Signatures : Preliminary Results Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas.
Session 1 Stream ciphers 1.
Comp 335 File Structures Hashing.
Data Security and Encryption (CSE348) 1. Lecture # 12 2.
COEN 180 Erasure Correcting, Error Detecting, and Error Correcting Codes.
Linear Feedback Shift Register. 2 Linear Feedback Shift Registers (LFSRs) These are n-bit counters exhibiting pseudo-random behavior. Built from simple.
Great Theoretical Ideas in Computer Science.
Copyright 1999 S.D. Personick. All Rights Reserved. Telecommunications Networking II Lecture 41b Cryptography and Its Applications.
Chapter 5: Hashing Part I - Hash Tables. Hashing  What is Hashing?  Direct Access Tables  Hash Tables 2.
Chapter 10 Hashing. The search time of each algorithm depend on the number n of elements of the collection S of the data. A searching technique called.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Hashtables. An Abstract data type that supports the following operations: –Insert –Find –Remove Search trees can be used for the same operations but require.
Cryptography and Network Security Chapter 4. Introduction  will now introduce finite fields  of increasing importance in cryptography AES, Elliptic.
Hash Functions Ramki Thurimella. 2 What is a hash function? Also known as message digest or fingerprint Compression: A function that maps arbitrarily.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
Cryptographic Hash Functions
Computer Science and Engineering Computer System Security CSE 5339/7339 Lecture 11 September 23, 2004.
15-499Page :Algorithms and Applications Cryptography II – Number theory (groups and fields)
Hash Tables Ellen Walker CPSC 201 Data Structures Hiram College.
Lecture 9 Overview. RSA Invented by Cocks (GCHQ), independently, by Rivest, Shamir and Adleman (MIT) Two keys e and d used for Encryption and Decryption.
CS480 Cryptography and Information Security Huiping Guo Department of Computer Science California State University, Los Angeles 13.Message Authentication.
Cryptographic Hash Function. A hash function H accepts a variable-length block of data as input and produces a fixed-size hash value h = H(M). The principal.
Hash Maps Rem Collier Room A1.02 School of Computer Science and Informatics University College Dublin, Ireland.
Rabin & Karp Algorithm.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CS202 - Fundamental Structures of Computer Science II
Erasure Correcting Codes for Highly Available Storage
Data Structures – Week #7
Jordi Cortadella and Jordi Petit Department of Computer Science
Presentation transcript:

Fast Fingerprint Calculations Thomas Schwarz, S.J.

Fingerprints Definition: A fingerprint (a.k.a. signature) of an object Ob is a small string f(Ob) with the following properties: 1.f is a function of Ob. In particular, if two objects are equal, then so are their fingerprints. 2. Prob(f(Ob 1 ) = f(Ob 2 )) << 1 for “random” objects Ob 1 ≠ Ob 2.

Usage Fingerprints are used to: Identify Objects Compare Objects Remotely Test an Object for Changes Since fingerprints are smaller, they are very useful as stand-ins for remote objects.

Usage Object identification example: Software Cloning: During maintenance, the need arises for a module very similar in character to one that already exists. Because of time pressure, this module is simply copied, all names are systematically changed, and then modified to serve the new needs. Maintenance Problem: If a bug is detected in a clone or in the original, it probably subsides in the original and / or other clones. Besides, clones arise because of time pressure, but the short-cut ends up costing in the long run. Thus, it is better during maintenance to identify clones. Clone Identification: Systematically suppress names, then test for function code to be identical. Johnson, J.H. Substring Matching for Software Clone Detection, and Change Tracking. International Conference on Software Maintenance. Victoria, BC, 1994, p

Usage Similarity Testing for Files n-gram: Contiguous substring of n characters in a file. File Similarity: Count the number of occurrences of a particular n-gram. Use the fingerprint of an n-gram as a hash value. Count the fingerprints instead of the n-gram. Cohen, J.D. Recursive Hashing Functions for n-grams. ACM Trans. Information Systems, p

Usage Remote String Searches: Find all occurrences of a given string in files on remote servers. Instead of sending the string to all servers, only a fingerprint and the length l of the string is sent. The servers generate running fingerprints of l-grans and compare them with the string’s fingerprint.

Usage Remote File Comparison Original Problem: How to compare pages of remote replicas of a database. Solution: Calculate fingerprints (“signatures”) of each page. Calculate a super-signature from the pages. If super- signatures coincide, conclude that the replicas are in sync. If not, run a “smart” protocol to find the non-fitting signatures. Abdel-Ghaffar, K. A. S., El-Abbadi, A. Efficient Detection of Corrupted Pages in a Replicated File. ACM Symp. Distributed Computing, 1993, p Barbara, D., Garcia-Molina, H., Feijoo, B. Exploiting Symmetries for Low-Cost Comparison of File Copies. Proc. Int. Conf. Distributed Computing Systems, 1988, p Barbara, D., Lipton, R. J.: A class of Randomized Strategies for Low-Cost Comparison of File Copies. IEEE Trans. Parallel and Distributed Systems, vol. 2(2), 1991, p Fuchs, W. Wu, K. L., Abraham, J. A. Low-Cost Comparison and Diagnosis of Large, Remotely Located Files. Proc. Symp. Reliability Distributed Software and Database Systems, p , Schwarz, Th., Bowdidge, B., Burkhard, W., Low Cost Comparison of Files, Int. Conf. on Distr. Comp. Syst., (ICDCS 90),

Usage Secure Signatures: To identify an object, maintain its signature. If the object is altered by an adversary, the adversary cannot do so in a computationally feasible way without changing the signature. “Cryptographically secure signature” SHA-1, MD5 Used for authentication, e.g. in computer forensics, digital signatures, etc.

SHA-1 20B long. Designed for Fast Calculation Considered unbreakable Used increasingly in applications were cryptographic security is not needed. Radia Pearlman’s Law of Cryptography: “If a lot of smart people spent lots of time trying to break a scheme, and did not succeed, then it cannot be done.”

Useful Properties of Fingerprints Fast Calculation. Low collision rate. If the fingerprints have length l then the probability of a collision should be 2 -l. If there are small changes, then fingerprints should change. Cryptographically unbreakable. Given a signature, one cannot construct an object with this signature.

Useful Properties of Fingerprints Updatable If the object changes, then we can update the signature from the old signature and the changes. Concatenation of Objects If an object is made up of several objects, then we can calculate the signature of the super-object from its constituents. Possibly in a way that allows us to quickly pinpoint different component objects.

Karp Rabin Style Fingerprints Here, the calculation takes places in a ring R with multiplication and addition. Karp, R. M., Rabin, M. O. Efficient randomized pattern-matching algorithms. IBM Journal of Research and Development, Vol. 31, No. 2, March 1987.

Karp Rabin Style Fingerprints Calculation time linear in N (1 multiplication and 1 addition)

Karp Rabin Style Fingerprints Easy to calculate consecutive n-grams Easy to calculate signatures of concatenations sig  (a 0, a 1, … a l-1, a l, a l+1 … a l+n-1 ) =  n sig  (a 0, a 1, … a l-1, )+ sig  (a l, a l+1 … a l+n-1 ) Possibly use a table of values of  n

Karp Rabin Style Fingerprints Cryptographically not secure A cryptographically secure signature is a one- way function, in order to be efficiently calculable, it needs to process large portions of a string at a time. Thus, cryptographical security is not a desirable property in general.

Choice of Ring R 1. Integers modulo prime p The ring is then a well-understood field. But, reducing modulo p is a costly operation. 2. Integers modulo 2 f Powers of two are zero dividers, e.g. 2·2 f-1 =0 This excludes powers of 2 as .

Choice of Ring 3. Reduction by a polynomial. The space of unsigned integers 0,..., can be naturally identified with the space of all polynomials k[t] over k = {0,1} with degree up to 31. Select a polynomial  (t) with degree f up to 31 and consider the ring R = k[t]/(  (t)). Elements in this are naturally identified with all unsigned integers 0,..., 2 f-1. Addition of these polynomials corresponds to the fast XOR, multiplication is more difficult, but multiplication by t is a left shift followed by conditionally XORing with . This is the most promising construction.

R = k[t]/(  (t)) Example Set  = t 5 +t+1, that is,  = Elements of R are all bit strings of length 4. To add 0101 and 1100, just XOR: = To multiply with t = 0010, left shift and XOR conditionally with . To multiply 0010 with 0010, left-shift the first and obtain The leading coefficient is zero, which is dropped. Result is To multiply 1100 with 0010, left shift to obtain 1,1000, the leading coefficient is one, so XOR with  = to obtain result 1011.

Galois Fields If  is irreducible, then R is a Galois field. If we use a Galois field, we can concatenate fingerprints to obtain a signature:

Galois Fields If we use  = ( ,  2,  3 …  n ) then the signature are the parity symbols of a generalized, non-systematic Reed- Solomon code. Since these codes are MDS, the signature will change for up to n changes in the object.

Galois Field Signatures To calculate a Galois field footprint, we only need per symbol: One XOR One left-shift One test whether the leading coefficient is now one. Conditionally one XOR.

Speeding up Galois field footprints However, we do not have to execute the reduction step each time. Instead, left shift and XOR b times (Broder’s idea). Then do a table to reduce the “overhang”. Broder, A. Some applications of Rabin's fingerprinting method. In Capocelli, De Santis, and Vaccaro, (ed.), Sequences II: Methods in Communications, Security, and Computer Science, pages Springer-Verlag, 1993.

Speeding up Galois field footprints String is (1000, 1100, 1010, 0111, 0011,... Choose  = This is an irreducible polynomial. Step 0: 1000 Step 1: 1, = 1,1100 Step 2: 11, = 11,0010 Step 3: 110, = 110,0011 Step 4: 1100, = 1100,0101 Now use table look-up to calculate table[1100] = = elementary ops + 1 shift right + 1 table look-up+1 XOR.

Speeding up Galois field footprints String is (1000, 1100, 1010, 0111, 0011,... Step 0: 1000 Step 1: 1, = 1,1100 = 1,1100+  = 1,1100+1,0011 = Step 2: 1, = 1,0100 = Step 3: 0, = Step 4: 1, = 1,0001 = elementary ops + 4 condition evaluations + 2 elementary ops on average.

Speeding up Galois field footprints How do we calculate the table entries: Systematically reduce by t-multiples of  = To calculate table entry for 12 = 1100 reduce 1100,0000 in four steps. 1100, t 3 ·  = 1100, ,1000 = 0101,1000 Now reduce with t 2 ·  : 0101, t 2 ·  = 0101, ,1100 = 0001,0100 No step with t·  since the corresponding coefficient is zero. Reduce with  : 0001, ,0011 = 0000,0111 = 0111.

Speeding up Galois field footprints Optimal table size needs to be determined experimentally. Table needs to fit in cache, so it cannot be much bigger than If the table is too small, then the look-up costs does not amortize well.

Galois field signatures Galois field signatures are concatenations of Galois field fingerprints. Broder tables work for multiplication with t 2, t 3, t 4 as well, but less efficiently, since now we shift two, three, or four times so that we need to use table-lookup more often.

Performance Results msec per MB for 16 bit parity msec per MB for 16 bit  1 power msec per MB for 16 bit  2 power. Results for a 1.99GHz Pentium 4 w. 512MB memory.

How Long should Signatures be? Key fact: There are 31,557,600 seconds in a year. At x calculations per second, there will be 31,577,600x incidents, which will lead to a collision < * 31,577,600x = 0.015x times per year for 32 bit signatures. So, minimum length should be 64 bits. We can achieve this easily. Larger signatures protects at a better rate than hard drive failures (writing on the wrong track), software failures, etc.

Research Questions * Property of a signature: Changing n symbols changes the n-fold signature for sure. It is known that if we change to a different vector , e.g. one where the components are all primitive elements, we loose the * property. Are there other  with this property? We can use Broder tabling with different irreducible  and then concatenate the Galois field footprints. Can we find a condition under which the * property holds? What properties hold when  is not irreducible? It seems statistically fine as long as  has a constant coefficient.