Watermarking Relational Databases Rakesh Agrawal and Jerry Kiernan
Why Watermark Databases Watermark -- Intentionally introduced pattern in the data ƒhard to occur by chance ƒhard to find => hard to destroy (robust against malicious attack) Increasing use of databases in applications beyond "behind-the-firewall data processing" involving data publication Data providers require technical solutions to deter data theft and assert ownership of pirated copies
Value of the database is significantly reduced if all of k least significant bits of an attribute are dropped or perturbed, but it is acceptable to perturb a small number of attribute values Datasets from many data publishers satisfy the above assumption (Acceptable to tradeoff a small decrease in quality to assert ownership) ƒTables of parametric specifications (mechanical, electrical, electronic, chemical, etc.), surveys (geological, climatic, etc.), life sciences (e.g. gene expressions) ƒHistorical precedence: Logarithm tables, Astronomical E phemerides, H.P. Inappropriate dataset: Online bank balances Assumption
Detectability ƒUsing a subset of the tuples and attributes Robustness ƒUpdates and malicious attacks Incremental Updatability ƒOn tuple insert/update/delete Imperceptibility ƒHard to infer the presence of a watermark Blind System ƒDetection requires neither the original data nor the watermark Key-Based System ƒAlgorithm is public ƒSecurity resides in the choice of secret key Desiderata
Related Work Images [BGM95,HG98,M98,DR00] Audio [BTH96] Text [M94] Software [CT00]
Database Relation Multimedia Object Consists of a large number of bits, with considerable redundancy => Watermark has a large cover to hide in. Consists of tuples, each of which represents a separate object => Watermark needs to be spread over these separate objects. Tuples of a relation constitute a set and there is no implied ordering between them Relative spatial/temporal positioning of various pieces of an object does not change. Portions of an object cannot be dropped or replaced arbitrarily without causing perceptual changes in the object. Pirate can easily drop some tuples/attributes or substitute them with tuples/attributes from other relations Need watermarking techniques designed to take into account special characteristics of relational data Relational data is different from multimedia data
Techniques Introduce watermarks across a fraction of the tuples in a database relation Detect the watermark by retrieving a subset of the tuples Use statistical hypothesis testing to locate the watermark even in the presence of updates to the data
Message Authentication Code h = H(M), where H is a hash function and M is a message ƒGiven M, easy to compute h ƒGiven h, hard to compute M ƒGiven M, hard to find M' such that H(M) = H(M') MD5 and SHA are good choices for H MAC is a one-way hash function which depends on a key K We use: F(r.P) = H(K o H(K o r.P)), where r.P is the primary key of relation r, and o is concatenation
Watermarking Algorithm Determine the attributes(s) to be watermarked, the Gap, and the LSBs For each tuple r, compute MAC: ƒEstablish if r doesn't fall into a gap ƒSelect attribute to be marked ƒDetermine bit position to contain the mark ƒCompute the mark's value ƒUpdate the attribute's value to reflect the watermark, if necessary
Technique A1A2A3A4 PK PK PK PK PK A1A2A3A4 PK PK PK PK PK Before Watermarking After Watermarking PK5 Not selected because in gap B2 of A1 selected for PK1 Value not changed because Mark = 1 Value changed Mark = 1
Without the Private Key, the Watermark is Hard to Destroy Which tuple contains a mark Which attribute got marked Which bit position got marked The expected value of a mark
Detection Algorithm Locate suspicious data and extract sample which might contain watermark For each tuple r, compute MAC: ƒIf r doesn't fall into a gap, extract the mark bit value Count the number of success and Bernoulli trials Apply statistical analysis to establish presence of the watermark
Extensions to the Algorithm Relations with no primary keys Null values
Evaluation Analysis Experiments ƒForest Cover Type dataset from UCI repository
Attacks Bit attacks ƒRandomize, zero-out, bit flipping, rounding, translation Subset attack ƒSelect subset of tuples and attributes Mix-and-match attack ƒCombine data from multiple sources Additive attack ƒInsert new watermark over existing watermark Invertibility attack ƒCounterfeit watermark Benign updates
Cumulative Binomial Probability Distribution b(k;n,p) = ( n k ) p (1-p) k n-k B(k;n,p) = b(i;n,p) S i=k n
Parameters and Defaults Number of tuples: 1 million Number of marked attributes: 1 Number of least significant bits: 1 Fraction of tuples marked: 1/1000 Significance level for hypothesis test: 0.01
Proportion of correctly marked tuples required for detectability The proportion of correctly marked tuples needed for detectability decreases as the number of marks increases For 1M tuples and 10% of tuples marked, that proportion < 51% Illustrates the tolerance of the watermark to updates
Proportion of correctly marked tuples needed for decreasing alpha The data can tolerate a large number of updates while maintaining detectability with high confidence
Excess Error in an Attack Attacker can be forced to make orders of magnitude more errors than the owner,making his data economically much less attractive compared to that of the owner
Samples in Which the Watermark Could be Detected When the Attacker has Dropped Tuples Watermark detected in a subset of the tuples of a watermarked relation Selectivity gives the sample size Each experiment repeated 100 times Results show the percentage of trials in which the watermark could be detected
Samples in Which Watermark was Detected When the Attacker has Dropped some Attributes Watermark detected in a subset of the attributes and tuples of a watermarked relation Watermark spread across 10 attributes Selectivity gives the sample size Each experiment repeated 100 times Results show the percentage of trials in which the watermark could be detected
Mix-and-Match Attack Minimum fraction of tuples from the watermarked relation needed for detectability N is the relation size N x f = tuples from marked relation N x (1 - f) = tuples from other relations
Summary Provided desiderata for a system for watermarking database relations First watermarking algorithm for database relations No dependence on tuple ordering Robust against attacks Watermark can be incrementally updated Requires neither the original relation nor the watermark for detection
Future Work Watermarking extensions to handle non- numeric attributes New algorithms for fingerprinting to track multiple sources of piracy