Data Mining Technologies for Digital Libraries & Web Information Systems Ramakrishnan Srikant.

Data Mining Technologies for Digital Libraries & Web Information Systems Ramakrishnan Srikant

Talk Outline Taxonomy Integration (WWW 2001, with R. Agrawal) Taxonomy Integration (WWW 2001, with R. Agrawal) Searching with Numbers Searching with Numbers Privacy-Preserving Data Mining Privacy-Preserving Data Mining

Taxonomy Integration B2B electronics portal: 2000 categories, 200K datasheets B2B electronics portal: 2000 categories, 200K datasheets Master CatalogNew Catalog DSPMem.Logic ICs abcdef Cat1Cat2 ICs xyzw

Taxonomy Integration (2) After integration: After integration: DSPMem.Logic ICs abcdefxyzw

Goal Use affinity information in new catalog. Use affinity information in new catalog. –Products in same category are similar. Accuracy boost depends on match between two categorizations. Accuracy boost depends on match between two categorizations.

Problem Statement Given Given –master categorization M: categories C 1, C 2, …, C n set of documents in each category set of documents in each category –new categorization N: categories S 1, S 2, …, S n set of documents in each category set of documents in each category Find the category in M for each document in N Find the category in M for each document in N –Standard Alg: Estimate Pr(C i | d) –Enhanced Alg: Estimate Pr(C i | d, S)

Naive Bayes Classifier Estimate probability of document d belonging to class C i Estimate probability of document d belonging to class C i Where Where

Enhanced Naïve Bayes Standard: Standard: Enhanced: Enhanced: How do we estimate Pr(C i |S)? How do we estimate Pr(C i |S)? –Apply standard Naïve Bayes to get number of documents in S that are classified into C i –Incorporate weight w reflecting match between two taxonomies. Only affect classification of borderline documents. Only affect classification of borderline documents. –For w = 0, default to standard classifier.

Enhanced Naïve Bayes (2) Use tuning set to determine w. Use tuning set to determine w.

Intuition behind Algorithm StandardAlgorithm EnhancedAlgorithm

Electronic Parts Dataset 1150 categories; 37,000 documents

Yahoo & OpenDirectory 5 slices of the hierarchy: Autos, Movies, Outdoors, Photography, Software 5 slices of the hierarchy: Autos, Movies, Outdoors, Photography, Software –Typical match: 69%, 15%, 3%, 3%, 1%, …. Merging Yahoo into OpenDirectory Merging Yahoo into OpenDirectory –30% fewer errors (14.1% absolute difference in accuracy) Merging OpenDirectory into Yahoo Merging OpenDirectory into Yahoo –26% fewer errors (14.3% absolute difference)

Summary New algorithm for taxonomy integration. New algorithm for taxonomy integration. –Exploits affinity information in the new (source) taxonomy categorizations. –Can do substantially better, and never does significantly worse than standard Naïve Bayes. Open Problems: SVM, Decision Tree,... Open Problems: SVM, Decision Tree,...

Talk Outline Taxonomy Integration Taxonomy Integration Searching with Numbers (WWW 2002, with R. Agrawal) Searching with Numbers (WWW 2002, with R. Agrawal) Privacy-Preserving Data Mining Privacy-Preserving Data Mining

Motivation A large fraction of useful web consists of specification documents. A large fraction of useful web consists of specification documents. – pairs embedded in text. Examples: Examples: –Data sheets for electronic parts. –Classified ads. –Product catalogs.

Search Engines treat Numbers as Strings Search for 6798.32 (lunar nutation cycle) Search for 6798.32 (lunar nutation cycle) –Returns 2 pages on Google –However, search for 6798.320 yielded no page on Google (and all other search engines) Current search technology is inadequate for retrieving specification documents. Current search technology is inadequate for retrieving specification documents.

Data Extraction is hard Synonyms for attribute names and units. Synonyms for attribute names and units. –"lb" and "pounds", but no "lbs" or "pound". Attribute names are often missing. Attribute names are often missing. –No "Speed", just "MHz Pentium III" –No "Memory", just "MB SDRAM" 850 MHz Intel Pentium III 192 MB RAM 15 GB Hard Disk DVD Recorder: Included; Windows Me 14.1 inch display 8.0 pounds

Searching with Numbers IBM ThinkPad 750 MHz Pentium 3, 196 MB DRAM, … Dell Computer 700 MHz Celeron, 256 MB SDRAM, … Database IBM ThinkPad (750 MHz, 196 MB) … Dell (700 MHz, 256 MB) 800 200 3 lb 800 200

Reflectivity If we get a close match on numbers, how likely is it that we have correctly matched attribute names? If we get a close match on numbers, how likely is it that we have correctly matched attribute names? –Likelihood  Non-reflectivity (of data) Non-overlapping attributes  Non-reflective. Non-overlapping attributes  Non-reflective. –Memory: 64- 512 Mb, Disk: 10 - 40 Gb Correlations or Clustering  Low reflectivity. Correlations or Clustering  Low reflectivity. –Memory: 64 - 512 Mb, Disk: 10 - 100 Gb

Reflectivity: Examples

Reflectivity: Definition Let Let –D: dataset –n i : co-ordinates of point x i –reflections(x i ): permutations of n i –  (n i ): # of points within distance r of n i –  (n i ): # of reflections within distance r of n i

Algorithm How to compute match score (rank) of a document for a given query? How to compute match score (rank) of a document for a given query? How to limit the number of documents for which the match score is computed? How to limit the number of documents for which the match score is computed?

Match Score of a Document Select k numbers from D yielding minimum distance between Q and D. Select k numbers from D yielding minimum distance between Q and D. Relative distance for each term: Relative distance for each term: Euclidean distance (L p norm) to combine term distances: Euclidean distance (L p norm) to combine term distances:

Bipartite Graph Matching Map problem to Bipartite Graph Matching Map problem to Bipartite Graph Matching –k source nodes: corr. to query numbers –m target nodes: corr. to document numbers –An edge from each source to k nearest targets. Assign weight f(q i,n j ) p to the edge (q i,n j ). 2060 102575.5.25.58.25 Query: Doc:

Limiting the Set of Documents Similar to the score aggregation problem [Fagin, PODS 96] Similar to the score aggregation problem [Fagin, PODS 96] Proposed algorithm is an adaptation of the TA algorithm in [Fagin-Lotem-Naor, PODS 01] Proposed algorithm is an adaptation of the TA algorithm in [Fagin-Lotem-Naor, PODS 01]

Let n i := number last looked at for query term q i Let n i := number last looked at for query term q i Let Let Halt when t documents found whose distance <=  Halt when t documents found whose distance <=  t is lower bound on distance of unseen documents t is lower bound on distance of unseen documents Limiting the set of documents k conceptual sorted lists, one for each query term k conceptual sorted lists, one for each query term Do round robin access to the lists. For each document found, compute its distance F(D,Q) Do round robin access to the lists. For each document found, compute its distance F(D,Q)

Empirical Results

Empirical Results (2) Screen Shot Screen Shot Screen Shot Screen Shot

Incorporating Hints Use simple data extraction techniques to get hints, Use simple data extraction techniques to get hints, Names/Units in query matched against Hints. Names/Units in query matched against Hints. 256 MB SDRAM memory Unit Hint: MB Attribute Hint: SDRAM, memory

Summary Allows querying using only numbers or numbers + hints. Allows querying using only numbers or numbers + hints. Data can come from raw text (e.g. product descriptions) or databases. Data can come from raw text (e.g. product descriptions) or databases. End run around data extraction. End run around data extraction. –Use simple extractor to generate hints. Open Problems: integration with keyword search. Open Problems: integration with keyword search.

Talk Outline Taxonomy Integration Taxonomy Integration Searching with Numbers Searching with Numbers Privacy-Preserving Data Mining Privacy-Preserving Data Mining –Motivation –Classification –Associations

Growing Privacy Concerns Popular Press: Popular Press: –Economist: The End of Privacy (May 99) –Time: The Death of Privacy (Aug 97) Govt. legislation: Govt. legislation: –European directive on privacy protection (Oct 98) –Canadian Personal Information Protection Act (Jan 2001) Special issue on internet privacy, CACM, Feb 99 Special issue on internet privacy, CACM, Feb 99 S. Garfinkel, "Database Nation: The Death of Privacy in 21st Century", O' Reilly, Jan 2000 S. Garfinkel, "Database Nation: The Death of Privacy in 21st Century", O' Reilly, Jan 2000

Privacy Concerns (2) Surveys of web users Surveys of web users –17% privacy fundamentalists, 56% pragmatic majority, 27% marginally concerned (Understanding net users' attitude about online privacy, April 99) –82% said having privacy policy would matter (Freebies & Privacy: What net users think, July 99)

Technical Question Fear: Fear: –"Join" (record overlay) was the original sin. –Data mining: new, powerful adversary? The primary task in data mining: development of models about aggregated data. The primary task in data mining: development of models about aggregated data. Can we develop accurate models without access to precise information in individual data records? Can we develop accurate models without access to precise information in individual data records?

Talk Outline Taxonomy Integration Taxonomy Integration Searching with Numbers Searching with Numbers Privacy-Preserving Data Mining Privacy-Preserving Data Mining –Motivation –Private Information Retrieval –Classification (SIGMOD 2000, with R. Agrawal) –Associations

Web Demographics Volvo S40 website targets people in 20s Volvo S40 website targets people in 20s –Are visitors in their 20s or 40s? –Which demographic groups like/dislike the website?

Solution Overview 50 | 40K |...30 | 70K |...... Randomizer Reconstruct distribution of Age Reconstruct distribution of Salary Data Mining Algorithms Model 65 | 20K |...25 | 60K |......

Reconstruction Problem Original values x 1, x 2,..., x n Original values x 1, x 2,..., x n –from probability distribution X (unknown) To hide these values, we use y 1, y 2,..., y n To hide these values, we use y 1, y 2,..., y n –from probability distribution Y Given Given –x 1 +y 1, x 2 +y 2,..., x n +y n –the probability distribution of Y Estimate the probability distribution of X. Estimate the probability distribution of X.

Intuition (Reconstruct single point) Use Bayes' rule for density functions Use Bayes' rule for density functions

Reconstructing the Distribution Combine estimates of where point came from for all the points: Combine estimates of where point came from for all the points: –Gives estimate of original distribution.

Reconstruction: Bootstrapping f X 0 := Uniform distribution f X 0 := Uniform distribution j := 0 // Iteration number j := 0 // Iteration number repeat repeat – (Bayes' rule) –j := j+1 until (stopping criterion met) until (stopping criterion met) Converges to maximum likelihood estimate. Converges to maximum likelihood estimate. –D. Agrawal & C.C. Aggarwal, PODS 2001.

Seems to work well!

Recap: Why is privacy preserved? Cannot reconstruct individual values accurately. Cannot reconstruct individual values accurately. Can only reconstruct distributions. Can only reconstruct distributions.

Talk Outline Taxonomy Integration Taxonomy Integration Searching with Numbers Searching with Numbers Privacy-Preserving Data Mining Privacy-Preserving Data Mining –Motivation –Private Information Retrieval –Classification –Associations (KDD 2002, with A. Evfimievski, R. Agrawal & J. Gehrke)

Association Rules Given: Given: –a set of transactions –each transaction is a set of items Association Rule: 30% of transactions that contain Book1 and Book5 also contain Book20; 5% of transactions contain these items. Association Rule: 30% of transactions that contain Book1 and Book5 also contain Book20; 5% of transactions contain these items. –30% : confidence of the rule. –5% : support of the rule. Find all association rules that satisfy user-specified minimum support and minimum confidence constraints. Find all association rules that satisfy user-specified minimum support and minimum confidence constraints. Can be used to generate recommendations. Can be used to generate recommendations.

Recommendation Service Associations Recommendations Alice Bob Book 5, Book 25 Book 1, Book 11, Book 21 Recommendations Overview Support Recovery Book 3, Book 25 Book 1, Book 7, Book 21

Private Information Retrieval Retrieve 1 of n documents from a digital library without the library knowing which document was retrieved. Retrieve 1 of n documents from a digital library without the library knowing which document was retrieved. Trivial solution: Download entire library. Trivial solution: Download entire library. Can you do better? Can you do better? –Yes, with multiple servers. –Yes, with single server & computational privacy. Problem introduced in [Chor et al, FOCS 95] Problem introduced in [Chor et al, FOCS 95]

Uniform Randomization Given a transaction, Given a transaction, –keep item with 20% probability, –replace with a new random item with 80% probability. Appears to gives around 80% privacy… Appears to gives around 80% privacy… –80% chance that an item in the randomized transaction was not in the original transaction.

Privacy Breach Example 100,000 (1%) have {x, y, z} 9,900,000 (99%) have zero items from {x, y, z} 0.2 3 =.008 6 * (0.8/1000) 3 = 3 * 10 -9 800 transactions.03 transactions (<< 1) 99.99%0.01% 80% privacy “on average,” but not for all items! 80% privacy “on average,” but not for all items! 10 M transactions of size 3 with 1000 items:

Solution “Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest?” “He grows a forest to hide it in.” G.K. Chesterton Insert many false items into each transaction. Insert many false items into each transaction. Hide true itemsets among false ones. Hide true itemsets among false ones. No free lunch: Need more transactions to discover associations. No free lunch: Need more transactions to discover associations.

Related Work S. Rizvi, J. Haritsa, “Privacy-Preserving Association Rule Mining”, VLDB 2002. S. Rizvi, J. Haritsa, “Privacy-Preserving Association Rule Mining”, VLDB 2002. Protecting privacy across databases: Protecting privacy across databases: –Y. Lindell and B. Pinkas, “Privacy Preserving Data Mining”, Crypto 2000. –J. Vaidya and C.W. Clifton, “Privacy Preserving Association Rule Mining in Vertically Partitioned Data”, KDD 2002.

Summary Have your cake and mine it too! Have your cake and mine it too! –Preserve privacy at the individual level, but still build accurate models. –Can do both classification & association rules. Open Problems: Clustering, Lower bounds on discoverability versus privacy, Faster algorithms, … Open Problems: Clustering, Lower bounds on discoverability versus privacy, Faster algorithms, …

Slides available from... www.almaden.ibm.com/cs/people/srikant/talks.html

Backup

Lowest Discoverable Support LDS is s.t., when predicted, is 4  away from zero. LDS is s.t., when predicted, is 4  away from zero. Roughly, LDS is proportional to Roughly, LDS is proportional to |t| = 5,  = 50%

LDS vs. Breach Level |t| = 5, |T| = 5 M

Basic 2-server Scheme Each server returns XOR of green bits. Each server returns XOR of green bits. Client XORs bits returned by server. Client XORs bits returned by server. Communication complexity: O(n) Communication complexity: O(n) 1 2 3 4 6 5 7 8

Sqrt(n) Algorithm Each server returns bit- wise XOR of specified blocks. Each server returns bit- wise XOR of specified blocks. Client XORs the 2 blocks & selects desired bits. Client XORs the 2 blocks & selects desired bits. Each block has sqrt(n) elements => 4*sqrt(n) communication complexity. Each block has sqrt(n) elements => 4*sqrt(n) communication complexity. Server computation time still O(n) Server computation time still O(n) 1 2 3 4 6 5 7 8

Computationally Private IR Use pseudo-random function + mask to generate sets. Use pseudo-random function + mask to generate sets. Quadratic residuosity. Quadratic residuosity. Difficulty of deciding whether a small prime divides  (m) Difficulty of deciding whether a small prime divides  (m) –m: composite integer of unknown factorization –  (m): Euler totient fn, i.e., # of positive integers <=m that are relatively prime to m.

Extensions Retrieve documents (blocks), not bits. Retrieve documents (blocks), not bits. –If n <= l, comm. complexity 4l. –If n <= l 2 /4, comm. complexity 8l. Lower communication complexity. Lower communication complexity. Select documents using keywords. Select documents using keywords. Protect data privacy. Protect data privacy. Preprocessing to reduce computation time. Preprocessing to reduce computation time. Computationally-private information retrieval with single server. Computationally-private information retrieval with single server.

Potential Privacy Breaches Distribution is a spike. Distribution is a spike. –Example: Everyone is of age 40. Some randomized values are only possible from a given range. Some randomized values are only possible from a given range. –Example: Add U[-50,+50] to age and get 125  True age is  75. –Not an issue with Gaussian.

Potential Privacy Breaches (2) Most randomized values in a given interval come from a given interval. Most randomized values in a given interval come from a given interval. –Example: 60% of the people whose randomized value is in [120,130] have their true age in [70,80]. –Implication: Higher levels of randomization will be required. Correlations can make previous effect worse. Correlations can make previous effect worse. –Example: 80% of the people whose randomized value of age is in [120,130] and whose randomized value of income is [...] have their true age in [70,80].

Work in Statistical Databases Provide statistical information without compromising sensitive information about individuals (surveys: AW89, Sho82) Provide statistical information without compromising sensitive information about individuals (surveys: AW89, Sho82) Techniques Techniques –Query Restriction –Data Perturbation Negative Results: cannot give high quality statistics and simultaneously prevent partial disclosure of individual information [AW89] Negative Results: cannot give high quality statistics and simultaneously prevent partial disclosure of individual information [AW89]

Statistical Databases: Techniques Query Restriction Query Restriction –restrict the size of query result (e.g. FEL72, DDS79) –control overlap among successive queries (e.g. DJL79) –suppress small data cells (e.g. CO82) Output Perturbation Output Perturbation –sample result of query (e.g. Den80) –add noise to query result (e.g. Bec80) Data Perturbation Data Perturbation –replace db with sample (e.g. LST83, LCL85, Rei84) –swap values between records (e.g. Den82) –add noise to values (e.g. TYW84, War65)

Statistical Databases: Comparison Statistical Databases: Comparison We do not assume original data is aggregated into a single database. We do not assume original data is aggregated into a single database. Concept of reconstructing original distribution. Concept of reconstructing original distribution. –Adding noise to data values problematic without such reconstruction.

Data Mining Technologies for Digital Libraries & Web Information Systems Ramakrishnan Srikant.

Similar presentations

Presentation on theme: "Data Mining Technologies for Digital Libraries & Web Information Systems Ramakrishnan Srikant."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining Technologies for Digital Libraries & Web Information Systems Ramakrishnan Srikant.

Similar presentations

Presentation on theme: "Data Mining Technologies for Digital Libraries & Web Information Systems Ramakrishnan Srikant."— Presentation transcript:

Similar presentations

About project

Feedback