Data Perturbation An Inference Control Method for Database Security Dissertation Defense Bob Nielson Oct 23, 2009.

Data Perturbation An Inference Control Method for Database Security Dissertation Defense Bob Nielson Oct 23, 2009

I. Introduction Most security concerns can be handled with the grant command. Others require a view approach But what happens if we wish to disclose partial information in a table field but not the individual records?

I. Introduction – The Problem The problem is to allow for statistical analysis of data but still protecting individual records. Example: Given a database of cancer patients. Allow for a researcher to know what the cancer rate is, but not that patient X has cancer.

I. Introduction – The Problem NameDeptSexSalary BobCSM30,000 FredCSM100,000 MaryCSF50,000 TimITM50,000 TomITM60,000 MarthaITF70,000 KenITM50,000

II. Related Work Suppression Anonymization Partitioning Data Logging Conceptual Hybrid Perturbation

II. Related Work- Suppression Must access n records Only n queries per day There are known methods to get around these protections.

II. Related Work- Anonymization Replace the identifying fields with special characters. This method can still be compromised.

II. Related Work- Anonymization NameDeptSexSalary *CSM30,000 *CSM100,000 *CSF50,000 *ITM50,000 *ITM60,000 *ITF70,000 *ITM50,000

II. Related Work- Partitioning All queries must access more than one band of records.

II. Related Work- Partitioning NameDeptSexSalary BobCSM30,000 FredCSM100,000 MaryCSF50,000 TimITM50,000 TomITM60,000 MarthaITF70,000 KenITM50,000

II. Related Work – Logging A log of every query ran is kept. Before a query is allowed all possible inferences are checked. If it releases one record, then that query is not permitted. Soon there are no queries allowed.

II. Related work – Conceptual Design the database so that no confidential information is stored.

II. Related Work – Hybrid Try using a combination of several of these methods.

II. Related Work - Perturbation Output Perturbation Data Perturbation Liew Perturbation Nielson Perturbation Note: Perturbation means data changing

II. Related Work – Output Perturbation Output perturbation works by changing the output of the query not the physical data.

II. Related Work – Output Perturbation

II. Related Work – Data Perturbation Data perturbation works by changing the physical data. Two common methods: 1.To add a random value to each value 2.To multiple each value by a random value

II. Related Work – Data Perturbation

II. Related Work – Liew Perturbation Liew perturbation steps: 1.Calculate the average, standard deviation, and count of the data 2.Generate a new data set with the same average, standard deviation and count 3.Sort both data sets in ascending order 4.Swap the perturbed values with each other.

II. Related Work–Liew Perturbation

III Hypothesis and Proof Prove: H1: Nielson perturbation is better than No Perturbation H2: Nielson perturbation is better than data perturbation (20%) H3: Nielson perturbation is better than Liew perturbation (20%)

III Hypothesis and Proof Disprove: H1: Nielson perturbation is not better than No Perturbation H2: Nielson perturbation is not better than data perturbation (20%) H3: Nielson perturbation is not better than Liew perturbation (20%)

IV. Methodology What is Nielson Perturbation? Calculating the absolute error... Finding optimal values for Nielson perturbation... Experimental design... Conducting the experiment...

IV. Methodology- Nielson Perturbation Nielson Perturbation is a form of data perturbation. Each value is multiplied by a random value between alpha and beta for the first gamma records in the data set. This value is randomly negated.

IV Methodology- Nielson Perturbation

IV. Methodology - Nielson

IV. Methodology- Alpha/Beta/Gamma What are the best values? An evolutionary algorithm was deployed. The results after several days of computation were: 1.Alpha = 2.09 2.Beta = 1.18 3.Gamma = 66.87

IV. Methodology- Evolutionary Results

IV. Methodology- Nielson Perturbation

IV. Methodology- The Method Calculate the average error of each method. Use the law of large numbers: An average of averages approaches a normal distribution as the sample size grows.

IV. Methodology- The Method Use a t-test to calculate whether two sample means are statistically different from each other with a significance of 95%

IV. Methodology- Monte Carlo Simulation Randomly generate 100,000 databases and execute 100’s of queries. I will use arrays to test the accuracy. Speed is of major importance here. Arrays vs. databases do not matter for calculating the accuracy of query outputs

IV. Methodology- Calculating the average error The error should be bigger with smaller query sizes. The error should be smaller with larger query sizes.

IV.Methodology- The Fitness Function e=|x-x’| If q < n/2 fitness=100-e Else fitness=e Smaller fitness scores are better

V. Results and Conclusions

V. Results and Conclusions Significance There is a real need for partial disclosure of a field in a table. My method insures a higher degree of security. My method still allows for release of averages and totals.

VI. Further Studies Transformation Times On the fly perturbing

Data Perturbation An Inference Control Method for Database Security Dissertation Defense Bob Nielson Oct 23, 2009.

Similar presentations

Presentation on theme: "Data Perturbation An Inference Control Method for Database Security Dissertation Defense Bob Nielson Oct 23, 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Perturbation An Inference Control Method for Database Security Dissertation Defense Bob Nielson Oct 23, 2009.

Similar presentations

Presentation on theme: "Data Perturbation An Inference Control Method for Database Security Dissertation Defense Bob Nielson Oct 23, 2009."— Presentation transcript:

Similar presentations

About project

Feedback