Privacy Preserving Data Mining Seminar By Nita Dimble

Privacy Preserving Data Mining Seminar By Nita Dimble

Content Difference between security & Privacy Data Mining
Privacy preserving data mining Techniques of privacy preserving data mining Applications of privacy preserving data mining Limitations of privacy Conclusion References

Difference between security and privacy
Data security, according to common definition is the “confidentiality, integrity and availability” of data. Privacy, on the other hand, is the appropriate use of information.

Data Mining Data mining is a recently emerging field , connecting the three worlds of Databases,Artificial Intelligence and Statistics. The information age has enabled many organizations to gather large volumes of data. However, the usefulness of this data is negligible if “meaningful information” or “knowledge” cannot be extracted from it. Data mining, otherwise known as knowledge discovery,attempts to answer this need.

Privacy Preserving data mining
Privacy preserving data mining has become increasingly popular because it allows sharing of privacy sensitive data for analysis purposes .So people have become increasingly unwilling to share their data, frequently resulting in individuals either refusing to share their data or providing incorrect data. In recent years, privacy preserving data mining has been studied extensively, because of the wide proliferation of sensitive information on the internet. The problem of privacy-preserving data mining has become more important in recent years because of the increasing ability to store personal data about users, and the increasing sophistication of data mining algorithms to leverage this information.

Techniques of Privacy Preserving Data Mining

Method of anonymization
When releasing micro data for research purposes, one needs to limit disclosure risks to an acceptable level while maximizing data utility. To limit disclosure risk, introduced the k-anonymity privacy requirement, which requires each record in an anonymized table to be indistinguishable with at least k other records within the dataset, with respect to a set of quasi-identifier attributes. To achieve the k-anonymity requirement, they used both generalization and suppression for data anonymization.

ANONYMIZATION TECHNIQUE
Merits : This method is used to protect respondents' identities while releasing truthful information. While k-anonymity protects against identity disclosure, it does not provide sufficient protection against attribute disclosure. Demerits: There are two attacks: the homogeneity attack and the background knowledge attack. Because the limitations of the k-anonymity model stem from the two assumptions. First, it may be very hard for the owner of a database to determine which of the attributes are or are not available in external tables. The second limitation is that the k-anonymity model assumes a certain method of attack, while in real scenarios there is no reason why the attacker should not try other methods.

Randomized Response Technique
The model of randomization

Randomized Response Technique
The method of randomization can be described as follows. Consider a set of data records denoted by X = {x1 ..xN}. For record xi from X, we add a noise component which is drawn from the probability distribution fY(y).These noise components are drawn independently, and are denoted y yN Thus, the new set of distorted records aredenoted by x1 +y xN +yN.We denote this new set of records by z z N.

Merits: The randomization method is a simple technique which can be easily implemented at data collection time. It has been shown to be a useful technique for hiding individual data in privacy preserving data mining. The randomization method is more efficient. However, it results in high information loss. Demerits: Randomized Response technique is not for multiple attribute databases.

Perturbation approach
The perturbation approach works under the need that the data service is not allowed to learn or recover precise records. This restriction naturally leads to some challenges. Since the method does not reconstruct the original data values but only distributions, new algorithms need to be developed which use these reconstructed distributions in order to perform mining of the underlying data. This means that for each individual data problem such as classification, clustering, or association rule mining, a new distribution based data mining algorithm needs to be developed.

Merits: Independent treatment of the different attributes by the perturbation approach. Demerits: The method does not reconstruct the original data values, but only distribution, new algorithms have been developed which uses these reconstructed distributions to carry out mining of the data available.

Condensation approach
Condensation approach, which constructs constrained clusters in the data set, and then generates pseudo-data from the statistics of these clusters . We refer to the technique as condensation because of its approach of using condensed statistics of the clusters in order to generate pseudo-data. This technique called as condensation because of its approach of using condensed statistics of the clusters in order to generate pseudo-data.

Merits: This approach works with pseudo-data rather than with modifications of original data, this helps in better preservation of privacy than techniques which simply use modifications of the original data. Demerits: The use of pseudo-data no longer necessitates the redesign of data mining algorithms, since they have the same format as the original data.

Cryptographic technique
Another branch of privacy preserving data mining which using cryptographic techniques was developed. This branch became hugely popular [6] for two main reasons: Firstly, cryptography offers a well-defined model for privacy, which includes methodologies for proving and quantifying it. Secondly, there exists a vast toolset of cryptographic algorithms and constructs to implement privacy-preserving data mining algorithms. However, recent work has pointed that cryptography does not protect the output of a computation. Instead, it prevents privacy leaks in the process of computation. Thus, it falls short of providing a complete answer to the problem of privacy preserving data mining.

Merits: Cryptography offers a well-defined model for privacy, which includes methodologies for proving and quantifying it. There exists a vast toolset of cryptographic algorithms and constructs to implement privacy preserving data mining algorithms. Demerits: This approach is especially difficult to scale when more than a few parties are involved.

Distributed Privacy Preserving Data Mining
The key goal in most distributed methods for privacy-preserving data mining (PPDM) is to allow computation of useful aggregate statistics over the entire data set without compromising the privacy of the individual data sets within the different participants. Thus, the participants may wish to collaborate in obtaining aggregate results, but may not fully trust each other in terms of the distribution of their own data sets. For this purpose, the data sets may either be horizontally partitioned or be vertically partitioned. In horizontally partitioned data sets, the individual records are spread out across multiple entities, each of which has the same set of attributes. In vertical partitioning, the individual entities may have different attributes (or views) of the same set of records.

Applications Of Privacy Preserving Data Mining

Medical Databases The Scrub and Datafly Systems
The Scrub system uses numerous detection algorithms which compete in parallel to determine when a block of text corresponds to a name, address or a phone number. The Scrub System uses local knowledge sources which compete with one another based on the certainty of their findings. It has been shown in that such a system is able to remove more than 99% of the identifying information from the data. The Datafly System was one of the earliest practical applications of privacy-preserving transformations. This system was designed to prevent identification of the subjects of medical records which may be stored in multidimensional format.

Bioterrorism Applications
Often a biological agent such as anthrax produces symptoms which are similar to other common respiratory diseases such as the cough, cold and the flu. The key is to quickly identify a true anthrax attack from a normal outbreak of a common respiratory disease. Therefore,in order to identify such attacks it is necessary to track incidences of these common diseases as well. Therefore, the corresponding data would need to be reported to public health agencies. However, the common respiratory diseases are not reportable diseases by law. The solution proposed in is that of “selective revelation” which initially allows only limited access to the data. However, in the event of suspicious activity, it allows a “drill-down” into the underlying data. This provides more identifiable information in accordance with public health law.

Genomic Privacy Re-identification, in which the uniqueness of patient visit patterns is exploited in order to make identifications. The premise of this work is that patients often visit and leave behind genomic data at various distributed locations and hospitals. The hospitals usually separate out the clinical data from the genomic data and make the genomic data available for research purposes. While the data is seemingly anonymous, the visit location pattern of the patients is encoded in the site from which the data is released. It has been shown in that this information may be combined with publicly available data in order to perform unique re-identifications.

Homeland Security Applications
A number of applications for homeland security are inherently intrusive because of the very nature of surveillance. In a broad overview is provided on how privacy-preserving techniques may be used in order to deploy these applications effectively without violating user privacy. Some examples of such applications are as follows:

Credential Validation Problem
In this problem, we are trying to match the subject of the credential to the person presenting the credential. For example, the theft of social security numbers presents a serious threat to homeland security. In the credential validation approach an attempt is made to exploit the semantics associated with the social security number to determine whether the person presenting the SSN credential truly owns it.

Web Camera Surveillance
One possible method for surveillance is with the use of publicly available webcams , which can be used to detect unusual activity. The approach can be made more privacy-sensitive by extracting only facial count information from the images and using these in order to detect unusual activity. It has been hypothesized in that unusual activity can be detected only in terms of facial count rather than using more specific information about particular individuals.

Video-Surveillance In the context of sharing video-surveillance data, a major threat is the use of facial recognition software, which can match the facial images in videos to the facial images in a driver license database. A more balanced approach is to use selective downgrading of the facial information, so that it scientifically limits the ability of facial recognition software to reliably identify faces, while maintaining facial details in images. The algorithm is referred to as k-Same, and the key is to identify faces which are somewhat similar, and then construct new faces which construct combinations of features from these similar faces.

Limitations of Privacy: The Curse of Dimensionality
Many privacy-preserving data-mining methods are inherently limited by the curse of dimensionality in the presence of public information. For example, the technique in analyzes the k-anonymity method in the presence of increasing dimensionality. The curse of dimensionality becomes especially important when adversaries may have considerable background information, as a result of which the boundary between pseudo-identifiers and sensitive attributes may become blurred.

Conclusion With the development of data analysis and processing technique, the privacy disclosure problem about individual or company is inevitably exposed when releasing or sharing data to mine useful decision information and knowledge, then give the birth to the research field on privacy preserving data mining. While all the purposed methods are only approximate to our goal of privacy preservation, It is need to further perfect those approaches or develop some efficient methods. To address these issues, following problem should be widely studied. 1. In distributed privacy preserving data mining areas, efficiency is an essential issue. It is need to develop more efficient algorithms and achieve a balance between disclosure cost, computation cost and communication cost. 2. Side-effects are unavoidable in data sanitization process. How to reduce their negative impact on privacy preserving needs to be considered carefully.

References 1 : L. Sweeney, (2002)."k-anonymity: a model for protecting privacy ", International Journal on Uncertainty, Fuzziness and Knowledge based Systems, pp 2: Agrawal, R. and Srikant, R,(2000)."Privacy-preserving data mining ".In Proc. SIGMOD00, pp 3: A survey on privacy preserving data mining :Approaches and Techniques by Gayatri Nayak and Swagatika Devi Lecturer, Department Of Computer Science and Engineering, ITER Siksha 'O' Anusandhan University, Khandagiri Square, Orissa, Bhubaneswar, India

Thank You

Privacy Preserving Data Mining Seminar By Nita Dimble

Similar presentations

Presentation on theme: "Privacy Preserving Data Mining Seminar By Nita Dimble"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Privacy Preserving Data Mining Seminar By Nita Dimble

Similar presentations

Presentation on theme: "Privacy Preserving Data Mining Seminar By Nita Dimble"— Presentation transcript:

Similar presentations

About project

Feedback