1 Low-Support, High-Correlation Finding rare, but very similar items
2 Assumptions 1. Number of items allows a small amount of main-memory/item. 2. Too many items to store anything in main-memory for each pair of items. 3. Too many baskets to store anything in main memory for each basket. 4. Data is very sparse: it is rare for an item to be in a basket.
3 Applications uWhile marketing may require high- support, or there’s no money to be made, mining customer behavior is often based on correlation, rather than support. wExample: Few customers buy Handel’s Watermusick, but of those who do, 20% buy Bach’s Brandenburg Concertos.
4 Matrix Representation uColumns = items. uBaskets = rows. uEntry (r, c ) = 1 if item c is in basket r ; = 0 if not. uAssume matrix is almost all 0’s.
5 In Matrix Form mcpbj {m,c,b}11010 {m,p,b}10110 {m,b}10010 {c,j}01001 {m,p,j}10101 {m,c,b,j}11011 {c,b,j}01011 {c,b}01010
6 Similarity of Columns uThink of a column as the set of rows in which it has 1. uThe similarity of columns C1 and C2, sim (C1,C2), is the ratio of the sizes of the intersection and union of C1 and C2. (Jaccard measure) uGoal of finding correlated columns becomes finding similar columns.
7 Example C1C sim (C1, C2) = 002/5 =
8 Signatures uKey idea: “hash” each column C to a small signature Sig (C), such that: 1.Sig (C) is small enough that we can fit a signature in main memory for each column. 2.Sim (C1, C2) is the same as the “similarity” of Sig (C1) and Sig (C2).
9 An Idea That Doesn’t Work uPick 100 rows at random, and let the signature of column C be the 100 bits of C in those rows. uBecause the matrix is sparse, many columns would have as a signature, yet be very dissimilar because their 1’s are in different rows.
10 Four Types of Rows uGiven columns C1 and C2, rows may be classified as: C1C2 a11 b10 c01 d00 uAlso, a = # rows of type a, etc. uNote Sim (C1, C2) = a /(a +b +c ).
11 Min Hashing uImagine the rows permuted randomly. uDefine “hash” function h (C ) = the number of the first (in the permuted order) row in which column C has 1.
12 Surprising Property uThe probability (over all permutations of the rows) that h (C1) = h (C2) is the same as Sim (C1, C2). uBoth are a /(a +b +c )! uWhy? wLook down columns C1 and C2 until we see a 1. wIf it’s a type a row, then h (C1) = h (C2). If a type b or c row, then not.
13 Min-Hash Signatures uPick (say) 100 random permutations of the rows. uLet Sig (C) = the list of 100 row numbers that are the first rows with 1 in column C, for each permutation. uSimilarity of signatures = fraction of permutations for which minhash values agree = (expected) similarity of columns.
14 Example C1 C2 C S1 S2 S3 Perm 1 = (12345) Perm 2 = (54321) Perm 3 = (34512) Similarities: Col.-Col Sig.-Sig
15 Important Trick uDon’t actually permute the rows. wThe number of passes would be prohibitive. uRather, in one pass through the data: 1.Pick (say) 100 hash functions. 2.For each column and each hash function, keep a “slot” for that min-hash value. 3.For each row r, and for each column c with 1 in row r, and for each hash function h do: if h(r ) is a smaller value than slot(h,c), replace that slot by h(r ).
16 Example RowC1C h(x) = x mod 5 g(x) = 2x+1 mod 5 h(1) = 11- g(1) = 33- h(2) = 212 g(2) = 030 h(3) = 312 g(3) = 220 h(4) = 412 g(4) = 420 h(5) = 010 g(5) = 120
17 Summary uFinding frequent pairs: wA-priori --> PCY (hashing) --> multistage. uFinding all frequent itemsets uFinding similar pairs: wMinhash
18 Public-Key Cryptosystems
19 Digital Cash
20 Blind signatures uBob has public key e, a private key d, and a public modulus, n. uAlice wants Bob blindly to sign message m. uAlice chooses a value k, between 1 and n. Then she blinds m by computing: t=mk e (mod n) uBob signs t; t d =(mk e ) d (mod n) uAlice unblinds t d by computing s= t d /k(mod n) uThe result is s=m d (mod n) uTrue since: t d =(mk e ) d =m d k(mod n), so t d /k=m d xk/k=m d (mod n)
21 More advanced scheme uCash should be anonymous, so in parallel of preventing double spending, we wish also that A’s identity (if follows the protocol) will be kept secret. uWe use blind signatures
22 Withdrawl Let I be A’s identity description. 1. I= I (l1) I (r1) = I (l2) I (r2) =……….=I (l100) I (r100) where each I (l) is a random string. 2. Set M=[“$1”, randomseq#,h(I (l1) ), h(I (r1) ),… ….h(I (l100) ),h(I (r100) )] where random seq# is generated by A. 3. Genrate a blinding of M, B. 4. Repeat 1—3 100 times to get B 1,….B 100, for M 1 …..M 100.
23 Withdrawl (cont.) 5. A sends B 1,….B 100 to the bank (B). 6. B picks 99 out of the 100 (cut and choose) and ask A to unblind these 99 messages, and prove they were built correctly. 7. B signs the only unopen blinded message (if all rest were fine) -- B k 8. A gets the signed B k and unblind in order to get the signature on M k Alice now has a coin: [M,sig B (M)]
24 Spending 1. A V buy for $1, [M,sig B (M)] 2. V A verifies B’s signature Pick r {0,1} 100 randomly and send to Alice. 3. A V Set R=[I 1,…..,I 100 ] where I k =I (lk) if r k =0 and I k =I (rk) if r k =1 and send R to V. 4. V A V verifies that R matches hashes, and (if so) release the good (notice that V knows nothing about who Alice is!)
25 Deposit protocol 1. V B deposit $1, [M,sig B (M),r,R] 2. B verifies its signature and that R=[I 1,…..,I 100 ] matches hashes in the coin. 3. B check if seq# from the coin is already spent. If it is not spent – ok. If the coin has been spent, let coin’= [M,sig B (M),r’,R’] be the coin in the database. Who is guity? V or A? If r=r’ then V is guilty; otherwise A is guilty. But if A is guilty, then who is he/she?
26 Deposit protocol (cont.) If r=(r 1,….,r 100 ) r’=(r’ 1,….,r’ 100 ) then r i r’ i for some coordinate i, but then I i I’ I can be computed and expose Alice.