1 Low-Support, High-Correlation Finding rare, but very similar items.

Slides:



Advertisements
Similar presentations
Hash Functions A hash function takes data of arbitrary size and returns a value in a fixed range. If you compute the hash of the same data at different.
Advertisements

Applications Shingling Minhashing Locality-Sensitive Hashing
Digital Cash Mehdi Bazargan Fall 2004.
Data Mining of Very Large Data
1 Chapter 7-2 Signature Schemes. 2 Outline [1] Introduction [2] Security Requirements for Signature Schemes [3] The ElGamal Signature Scheme [4] Variants.
Near-Duplicates Detection
ITIS 6200/ Secure multiparty computation – Alice has x, Bob has y, we want to calculate f(x, y) without disclosing the values – We can only do.
Foundations of Cryptography Lecture 5 Lecturer: Moni Naor.
Recoverable and Untraceable E-Cash Dr. Joseph K. Liu The Chinese University of HongKong.
MMDS Secs Slides adapted from: J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, October.
Authentication and Digital Signatures CSCI 5857: Encoding and Encryption.
Digital Cash Present By Kevin, Hiren, Amit, Kai. What is Digital Cash?  A payment message bearing a digital signature which functions as a medium of.
ELECTRONIC PAYMENT SYSTEMS FALL 2002COPYRIGHT © 2002 MICHAEL I. SHAMOS Electronic Payment Systems Lecture 11 Electronic Cash.
Slide 1 Vitaly Shmatikov CS 378 Digital Cash. slide 2 Digital Cash: Properties uDigital “payment message” with properties of cash uUnforgeable Users cannot.
Payment Systems 1. Electronic Payment Schemes Schemes for electronic payment are multi-party protocols Payment instrument modeled by electronic coin that.
Introduction to Modern Cryptography, Lecture 13 Money Related Issues ($$$) and Odds and Ends.
Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these.
Association Rule Mining
My Favorite Algorithms for Large-Scale Data Mining
CNS2010handout 10 :: digital signatures1 computer and network security matt barrie.
Data Mining: Associations
ELECTRONIC PAYMENT SYSTEMS SPRING 2004 COPYRIGHT © 2004 MICHAEL I. SHAMOS Electronic Payment Systems Lecture 11 Electronic Cash.
Announcements: 1. Presentations start Friday 2. Cem Kaner presenting O th block today. Questions? This week: DSA, Digital Cash DSA, Digital Cash.
Introduction to Modern Cryptography Homework assignments.
CS 361A1 CS 361A (Advanced Data Structures and Algorithms) Lecture 20 (Dec 7, 2005) Data Mining: Association Rules Rajeev Motwani (partially based on notes.
1 A practical off-line digital money system with partially blind signatures based on the discrete logarithm problem From: IEICE TRANS. FUNDAMENTALS, VOL.E83-A,No.1.
Improvements to A-Priori
Near Duplicate Detection
Electronic Voting Schemes and Other stuff. Requirements Only eligible voters can vote (once only) No one can tell how voter voted Publish who voted (?)
Finding Similar Items. Set Similarity Problem: Find similar sets. Motivation: Many things can be modeled/represented as sets Applications: –Face Recognition.
WS Algorithmentheorie 03 – Randomized Algorithms (Public Key Cryptosystems) Prof. Dr. Th. Ottmann.
1 Near-Neighbor Search Applications Matrix Formulation Minhashing.
Finding Similar Items.
1 Near-Neighbor Search Applications Matrix Formulation Minhashing.
Chapter 9 Cryptographic Protocol Cryptography-Principles and Practice Harbin Institute of Technology School of Computer Science and Technology Zhijun Li.
ELECTRONIC PAYMENT SYSTEMSFALL 2001COPYRIGHT © 2001 MICHAEL I. SHAMOS Electronic Payment Systems Lecture 6 Epayment Security II.
Module 8 – Anonymous Digital Cash Blind Signatures DigiCash coins.
Finding Near Duplicates (Adapted from slides and material from Rajeev Motwani and Jeff Ullman)
E-Money / Digital Cash Lin Huang. Money / Digital Cash What is Money –Coins, Bill – can’t exist on two places at one time –Bearer bonds: immediate cashable.
J. Wang. Computer Network Security Theory and Practice. Springer 2008 Chapter 4 Data Authentication Part II.
1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing Locality-Sensitive Hashing.
Public Key Model 8. Cryptography part 2.
Csci5233 Computer Security1 Bishop: Chapter 10 Key Management: Digital Signature.
Tonga Institute of Higher Education Design and Analysis of Algorithms IT 254 Lecture 9: Cryptography.
Digital Cash By Gaurav Shetty. Agenda Introduction. Introduction. Working. Working. Desired Properties. Desired Properties. Protocols for Digital Cash.
Digital Signatures Good properties of hand-written signatures: 1. Signature is authentic. 2. Signature is unforgeable. 3. Signature is not reusable (it.
Bitcoin (what, why and how?)
Chapter 4: Intermediate Protocols
Topic 22: Digital Schemes (2)
Digital Signatures A primer 1. Why public key cryptography? With secret key algorithms Number of key pairs to be generated is extremely large If there.
Privacy Enhancing Technologies Spring What is Privacy? “The right to be let alone” Confidentiality Anonymity Access Control Most privacy technologies.
Based on Schneier Chapter 5: Advanced Protocols Dulal C. Kar.
Finding Similar Items 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 10: Finding Similar Items Mining.
Software Security Seminar - 1 Chapter 5. Advanced Protocols 조미성 Applied Cryptography.
Chapter 6:Esoteric Protocols Dulal C Kar. Secure Elections Ideal voting protocol has at least following six properties 1.Only authorized voters can vote.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
1 Low-Support, High-Correlation Finding Rare but Similar Items Minhashing.
Linkability of Some Blind Signature Schemes Swee-Huay Heng 1, Wun-She Yap 1 Khoongming Khoo 2 1 Multimedia University, 2 DSO National Laboratories.
A A E E D D C C B B # Symmetric Keys = n*(n-1)/2 F F
DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.
Electronic Cash R. Newman. Topics Defining anonymity Need for anonymity Defining privacy Threats to anonymity and privacy Mechanisms to provide anonymity.
Software Security Seminar - 1 Chapter 4. Intermediate Protocols 발표자 : 이장원 Applied Cryptography.
David Evans CS588: Security and Privacy University of Virginia Computer Science Lecture 15: From Here to Oblivion.
Multi-Party Computation r n parties: P 1,…,P n  P i has input s i  Parties want to compute f(s 1,…,s n ) together  P i doesn’t want any information.
CS276A Text Information Retrieval, Mining, and Exploitation
Near Duplicate Detection
Finding Similar Items: Locality Sensitive Hashing
eCommerce Technology Lecture 13 Electronic Cash
Three Essential Techniques for Similar Documents
Presentation transcript:

1 Low-Support, High-Correlation Finding rare, but very similar items

2 Assumptions 1. Number of items allows a small amount of main-memory/item. 2. Too many items to store anything in main-memory for each pair of items. 3. Too many baskets to store anything in main memory for each basket. 4. Data is very sparse: it is rare for an item to be in a basket.

3 Applications uWhile marketing may require high- support, or there’s no money to be made, mining customer behavior is often based on correlation, rather than support. wExample: Few customers buy Handel’s Watermusick, but of those who do, 20% buy Bach’s Brandenburg Concertos.

4 Matrix Representation uColumns = items. uBaskets = rows. uEntry (r, c ) = 1 if item c is in basket r ; = 0 if not. uAssume matrix is almost all 0’s.

5 In Matrix Form mcpbj {m,c,b}11010 {m,p,b}10110 {m,b}10010 {c,j}01001 {m,p,j}10101 {m,c,b,j}11011 {c,b,j}01011 {c,b}01010

6 Similarity of Columns uThink of a column as the set of rows in which it has 1. uThe similarity of columns C1 and C2, sim (C1,C2), is the ratio of the sizes of the intersection and union of C1 and C2. (Jaccard measure) uGoal of finding correlated columns becomes finding similar columns.

7 Example C1C sim (C1, C2) = 002/5 =

8 Signatures uKey idea: “hash” each column C to a small signature Sig (C), such that: 1.Sig (C) is small enough that we can fit a signature in main memory for each column. 2.Sim (C1, C2) is the same as the “similarity” of Sig (C1) and Sig (C2).

9 An Idea That Doesn’t Work uPick 100 rows at random, and let the signature of column C be the 100 bits of C in those rows. uBecause the matrix is sparse, many columns would have as a signature, yet be very dissimilar because their 1’s are in different rows.

10 Four Types of Rows uGiven columns C1 and C2, rows may be classified as: C1C2 a11 b10 c01 d00 uAlso, a = # rows of type a, etc. uNote Sim (C1, C2) = a /(a +b +c ).

11 Min Hashing uImagine the rows permuted randomly. uDefine “hash” function h (C ) = the number of the first (in the permuted order) row in which column C has 1.

12 Surprising Property uThe probability (over all permutations of the rows) that h (C1) = h (C2) is the same as Sim (C1, C2). uBoth are a /(a +b +c )! uWhy? wLook down columns C1 and C2 until we see a 1. wIf it’s a type a row, then h (C1) = h (C2). If a type b or c row, then not.

13 Min-Hash Signatures uPick (say) 100 random permutations of the rows. uLet Sig (C) = the list of 100 row numbers that are the first rows with 1 in column C, for each permutation. uSimilarity of signatures = fraction of permutations for which minhash values agree = (expected) similarity of columns.

14 Example C1 C2 C S1 S2 S3 Perm 1 = (12345) Perm 2 = (54321) Perm 3 = (34512) Similarities: Col.-Col Sig.-Sig

15 Important Trick uDon’t actually permute the rows. wThe number of passes would be prohibitive. uRather, in one pass through the data: 1.Pick (say) 100 hash functions. 2.For each column and each hash function, keep a “slot” for that min-hash value. 3.For each row r, and for each column c with 1 in row r, and for each hash function h do: if h(r ) is a smaller value than slot(h,c), replace that slot by h(r ).

16 Example RowC1C h(x) = x mod 5 g(x) = 2x+1 mod 5 h(1) = 11- g(1) = 33- h(2) = 212 g(2) = 030 h(3) = 312 g(3) = 220 h(4) = 412 g(4) = 420 h(5) = 010 g(5) = 120

17 Summary uFinding frequent pairs: wA-priori --> PCY (hashing) --> multistage. uFinding all frequent itemsets uFinding similar pairs: wMinhash

18 Public-Key Cryptosystems

19 Digital Cash

20 Blind signatures uBob has public key e, a private key d, and a public modulus, n. uAlice wants Bob blindly to sign message m. uAlice chooses a value k, between 1 and n. Then she blinds m by computing: t=mk e (mod n) uBob signs t; t d =(mk e ) d (mod n) uAlice unblinds t d by computing s= t d /k(mod n) uThe result is s=m d (mod n) uTrue since: t d =(mk e ) d =m d k(mod n), so t d /k=m d xk/k=m d (mod n)

21 More advanced scheme uCash should be anonymous, so in parallel of preventing double spending, we wish also that A’s identity (if follows the protocol) will be kept secret. uWe use blind signatures

22 Withdrawl Let I be A’s identity description. 1. I= I (l1)  I (r1) = I (l2)  I (r2) =……….=I (l100)  I (r100) where each I (l) is a random string. 2. Set M=[“$1”, randomseq#,h(I (l1) ), h(I (r1) ),… ….h(I (l100) ),h(I (r100) )] where random seq# is generated by A. 3. Genrate a blinding of M, B. 4. Repeat 1—3 100 times to get B 1,….B 100, for M 1 …..M 100.

23 Withdrawl (cont.) 5. A sends B 1,….B 100 to the bank (B). 6. B picks 99 out of the 100 (cut and choose) and ask A to unblind these 99 messages, and prove they were built correctly. 7. B signs the only unopen blinded message (if all rest were fine) -- B k 8. A gets the signed B k and unblind in order to get the signature on M k Alice now has a coin: [M,sig B (M)]

24 Spending 1. A  V buy for $1, [M,sig B (M)] 2. V  A verifies B’s signature Pick r  {0,1} 100 randomly and send to Alice. 3. A  V Set R=[I 1,…..,I 100 ] where I k =I (lk) if r k =0 and I k =I (rk) if r k =1 and send R to V. 4. V  A V verifies that R matches hashes, and (if so) release the good (notice that V knows nothing about who Alice is!)

25 Deposit protocol 1. V  B deposit $1, [M,sig B (M),r,R] 2. B verifies its signature and that R=[I 1,…..,I 100 ] matches hashes in the coin. 3. B check if seq# from the coin is already spent. If it is not spent – ok. If the coin has been spent, let coin’= [M,sig B (M),r’,R’] be the coin in the database. Who is guity? V or A? If r=r’ then V is guilty; otherwise A is guilty. But if A is guilty, then who is he/she?

26 Deposit protocol (cont.) If r=(r 1,….,r 100 )  r’=(r’ 1,….,r’ 100 ) then r i  r’ i for some coordinate i, but then I i  I’ I can be computed and expose Alice.