Association Rule Mining (ARM)

Association Rule Mining (ARM)
We will look for common models for ARM/Classification/Clustering, e.g., R(K1..Kk,A1..An) where Ks are structure & As are feature attributes What’s the difference between structure and feature attibutes? Sometimes there is none (i.e., no K’s). Other times there are strictly structural attributes (e.g., X,Y-coordinates of an image. We may want to treat these structure attributes differently from feature attributes such as R, G or B. Structural attributes are similar to keys (id tuples, typically by position in space) Association Rule Mining on R is a matter of finding all (qualifying) rules of the form, A  C where A is a subset of tuples (tupleset) called the antecedent and C is a subset called the consequent. Tuplesets for quantitative attributes usually product sets, i=1..nSi (itemsets) or rectangles: i=1..n[li,ui], liui in Ai (some may be full-range, [lowval,hival] and the full-range intervals are often left out of the product notation (not listed)). In Boolean ARM (each Ai is Boolean), there may be only 1 meaningful subinterval [1,1] & antecedent/consequent sets are then thought of as sets of feature attributes (those with interval = [1,1] ).

Slalom Metaphor for an Itemset
The rectangles:  i=1..n [li,ui], where liui in Ai can be visualized as a set of “gates” (e.g., on a ski slope), one gate for each non-full-range attribute. A2 A5 A7 A8 are full-range (l = LowValue or LV and u = HighValue or HV) : Itemset = set of tuples that “ski thru” gates. Metaphor is related to Parallel Coordinates in the literature. Metaphor is also related to some multi-dimensional tuple visualization diagrams: HV=u6 l1 u1 l3 l6 u3 u4 LV=l4 A A A A A A A7 Barrel diagram is similar to Moutain but wrapped around a barrel (I.e., a 3-D helix) A1 A5 Himalayan or Mountain diagram Parallel diagram Jewel diagram A2 A3 A1 A4 A1 A2 A3 A4 A5 A5

Slalom Metaphor cont. Simple example to try to get some intuition on how these diagrams might be used effectively. First, previous configurations: Mountain diagram (chain orientation) Parallel diagram Jewel diagram A2 A3 A5 A1 A1 A4 A1 A2 A3 A4 A5 A5 Mountain diagram (upward orientation) Can the polygon formed from the centroids of rectangles give bounds for itemset support? A5 A1 The upward oriented Mountain appears to better reflect “closeness in shape” than the others??? (the red and blue should be closer to each other than either is to the green???). Therefore it might be better for cluster analysis (later).

Slalom Metaphor for an Association Rule
For a rule, A  C or [u1, l1]1  [u3, l3]3  [u6, l6]6  [u4, l4]4 support of A = count of tuples thru A support of AC = count of tuples thru both A & C. confidence of AC = the fraction of the tuples going thru A that also go thru C. = Supp( A C ) / Supp( A) = blue tuples over sum of blues + greens HV=u6 u1 u3 l1 l6 l3 u4 A A2 A A4 A A6 A A8 0=l4

Precision Ag ARM example
Identifying high and low crop yields E.g., R( X, Y, R, G, B, Y ), R/G/B are red/green/blue reflectances from the pixel (square area) at (x,y) Y is the yield at (x,y). Assume all are 8-bit values. High Support and Confidence rules are expected like: [192,255]G  [0,63]R  [128,255]Y How to apply rules? Obtain rules from previous year’s data. Apply rules in the current year after each aerial photo is taken at different stages of plant growth. By irrigating/adding Nitrate, Green/Red values can be increased, and therefore Yield may be increased.

Market Basket ARM example
Identifying purchasing patterns If a customer buys beer, s/he will buy chips (so shelve the chips near the beer?) E.g., Boolean relation, R(Tid, Aspirin, Beer, Chips, Dates,..,Zippo) Tid=transaction id (for a customer going thru checkout). In any field of a tuple there is a 1 if the customer has that product in his/er basket, else 0. In Boolean ARM we are only interested in Buy/noBuy (not in quantity). Therefore, itemsets are hyperboxes, i=1..n[1,1]ji , where Iji are the items purchased. Support and Confidence: Given itemsets, A and C, Supp(A) = ratio of the number of items in A over the total number of items. Supp(AC) = ratio of the number of items in AB over the number of items total. Conf(AC) = ratio of # items in A&C over # items in A = Supp(AB) / Supp(A). Both can be given as percentages also. Thresholds Frequent Itemsets = Support exceeds a min support threshold (minsupp). Lk denotes the set of frequent k-itemsets (sets with k items in them). High Confidence Rules = Confidence exceeds a min conf threshold (minconf).

Lists versus Vectors in MBR
In most MBR treatments, we have Items, i (purchasable) (I is the universe of all items). Transactions, t (customer thru checkout with an itemset, t-itemset) t-itemset is usually expressed as a list of items, {i1, i2, …, in} t-itemset can expressed as a bit-vector, [ …1000] where each item is assigned to a bit position and that bit is 1 if t-itemset contains that item and 0 otherwise. The Vector version corresponds to the table model we have been using, with R(K1,A1,…,An), K1 = Trans-id and the Ai‘s are the items in the assigned order (the datatype of each is Boolean)

Association Rule Example
Given a database of transactions, each trans is a list (or bit vector) of items purchased by a customer in a visit): Tid A B C D E F 2000 1 1000 4000 5000 Let minsupp=50%, minconf=50% we have AC (50%, 66.6%) CA (50%, 100%) Boolean vs. quantitative associations (Based on types of values handled) buys(x, “SQLServer”) ^ buys(x, “DMBook”) ® buys(x, “DBMiner”) [0.2%, 60%] age(x, “30..39”) ^ income(x, “42..48K”) ® buys(x, “PC”) [1%, 75%] Single dimension vs. multiple dimensional associations Single level vs. multiple-level analysis (e.g., What brands of beers are associated with what brands of diapers?) In large databases, ARM is done in two steps Find all frequent itemsets Generate strong rules (high support and high confidence) from the frequent itemsets.

Mining Association Rules
Minsupp 50% Minconf 50% Tid A B C D E F 2000 1 1000 4000 5000 For rule A  C: supp({AC}) = 50% confidence = supp({A C})/supp({A}) = 66.6% Apriori principle: Any subset of a frequent itemset is frequent Find the frequent itemsets: the sets of items that have minimum support A subset of a frequent itemset must also be a frequent itemset if {AB} is a frequent itemset, both {A} and {B} must be frequent Iteratively find frequent itemsets with size from 1 to k (k-itemset) Use the frequent itemsets to generate association rules. Ck will denote the candidate frequent k-itemsets Lk will denote the frequent k-itemsets.

The Apriori Algorithm Join Step: Ck is generated by joining Lk-1 with itself Assume the elements are lexicographically ordered Two elements of Lk-1 are joinable if their first k-2 items are common. Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset Pseudo-code: Ck: Candidate itemset of size k Lk : frequent itemset of size k L1 = {frequent items}; for (k = 1; Lk !=; k++) do begin Ck+1 = candidates generated from Lk; for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support end return k Lk;

How to Generate Candidates and Count Support
Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1,..,p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck Example: L3={abc, abd, acd, ace, bcd} Self-joining: L3*L3 abcd from abc & abd acde from acd and ace Pruning: acde removed because ade is not in L3 C4={abcd} Why counting supports of candidates a problem? The total number of candidates can be huge One transaction may contain many candidates Method: Candidate itemsets are stored in a hash-tree Leaf node of hash-tree contains list of itemsets & counts Interior node contains a hash table Subset function finds all candidates contained in a trans

Spatial Data Pixel – a point in a space
Band – feature attribute of the pixels Value – usually one byte (0~255) Images have different numbers of bands TM4/5: 7 bands (B, G, R, NIR, MIR, TIR, MIR2) TM7: 8 bands (B, G, R, NIR, MIR, TIR, MIR2, PC) TIFF: 3 bands (B, G, R) Ground data: individual bands (Yield, Moisture, Nitrate, Temp, elevation…) RSI data can be viewed as collection of pixels. Each has a value for each feature attrib. E.g., RSI dataset above has 320 rows and 320 cols of pixels (102,400 pixels) and 4 feature attributes (B,G,R,Y). The (B,G,R) feature bands are in the TIFF image and the Y feature is color coded in the Yield Map. TIFF image Yield Map Existing formats BSQ (Band Sequential) BIL (Band Interleaved by Line) BIP (Band Interleaved by Pixel) New format bSQ (bit Sequential)

Spatial Data Formats (Cont.)
BAND-1 ( ) ( ) ( ) ( ) BAND-2 ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2:

BAND-1 ( ) ( ) ( ) ( ) BAND-2 ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2: BIL format (1 file)

BAND-1 ( ) ( ) ( ) ( ) BAND-2 ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2: BIL format (1 file) BIP format (1 file)

BAND-1 ( ) ( ) ( ) ( ) BAND-2 ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2: BIL format (1 file) BIP format (1 file) bSQ format (16 files) B11 B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26 B27 B28

bSQ and other tabular Formats
Split each band into eight separate files, one for each bit position. Reasons of using bSQ format Different bits contribute to the value differently. bSQ format facilitates representation of a precision hierarchy (from 1 bit to 8 bit). bSQ format facilitates an efficient data structure, the P-tree, P-tree algebra, T-cube. BSQ and bSQ are “tabular” formats BSQ consist of a separate table for each feature band bSQ consist of a separate table for each bit of each band One can view it this way: The data set is initially one “relation” or table, R(K1,..,Kk, A1, A2, …, An) where K1,..,Kk are the structure attributes and each Ai is a feature attribute. Structure attributes of 2-D image are X and Y coordinates of the pixels (rows). The feature attributes are the bands, B,G,R, NIR, … In BSQ we separate features into separate files and suppress structure attribs altogether (under assumption that the pixels are always arranged in raster order. Also separate each bit of each feature into separate file (same raster order assumption P-tree represents spatial bSQ data bit-by-bit in a recursive quadrant-by-quadrant arrangement. An P-tree is a lossless representation of the original data. A P-tree is a compressed structure. A P-tree is “count pre-computed”.

An example of Ptree Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant
55 16 8 15 3 4 1 55 1 3 1 16 15 16 8 1 4 1 4 3 4 4 1 1 Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count Level Fan-out QID (Quadrant ID)

An example of Ptree Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant
001 55 16 8 15 3 4 1 Level-3 1 2 3 Level-2 2 3 Level-1 111 Level-0 Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count Level Fan-out QID (Quadrant ID) ( 7, 1 ) ( 111, 001 )

Ptree Algebra And Or Complement Other (XOR, etc) Ptree: 55
____________/ / \ \___________ / ___ / \___ \ / / \ \ ____8__ _15__ / / | \ / | \ \ //|\ //|\ //|\ Complement: ____________/ / \ \___________ / ___ / \___ \ / / \ \ ____8__ __1__ / / | \ / | \ \ //|\ //|\ //|\

Ptree ANDing Operation
PM-tree1: m ______/ / \ \______ / / \ \ / / \ \ m m / / \ \ / / \ \ m m m 1 //|\ //|\ //|\ PM-tree2: m ______/ / \ \______ / / \ \ / / \ \ m / / \ \ m //|\ 0100 Result: m ________ / / \ \___ / ____ / \ \ / / \ \ m / | \ \ 1 1 m m //|\ //|\ Using Depth-first Pure-1 path code &  RESULT  0  20  21   231

Alternative forms for Ptrees (all lossless)
1 means quadrant is pure-1, 0 otherwise (pure0 if no sub-tree ptrs, otherwise mixed) 1 means quadrant is pure-0, 0 otherwise 1 means quadrant is Not pure-Zero, 0 otherwise (Note: PM = PNZ XOR P1 ) P1: ______/ / \ \______ / / \ \ / / \ \ / / \ \ / / \ \ //|\ //|\ //|\ P0: / / \ \ / / \ \ //|\ //|\ //|\ PNZ (=P0’) ________ / / \ \___ / ____ / \ \ / / \ \ / / \ \ / / \ \ //|\ //|\ //|\ 00 01 10 11 00 01 10 11 00 01 10 11 Vector forms (Table entry for each mixed inode containing its qid and its children-bit-vector; Eliminates need for subtree pointers) P1V (as a table): qid vector [ ] 1001 [01] 0010 [10] 1101 [01.00] 1110 [01.11] 0010 [10.10] 1101 P0V: [ ] 0000 [01] 0100 [10] 0000 [01.00] 0001 [01.11] 1101 [10.10] 0010 PNZV: [ ] 1111 [01] 1011 [10] 1111 Since there is no qid=[01.01] in the table we know it’s pure0, not mixed

Basic, Value and Tuple Ptrees
Basic Ptrees (i.e., P11, P12, …, P18, P21, …, P28, …, P71, …, P78) AND Value Ptrees (i.e., P1, 001 = P11’ AND P12’ AND P13) AND Tuple Ptrees (i.e., P001, 010, 111 = P1, 001 AND P2, 010 AND P3, 111)

B11 B12 B13 Example Blue Band, B1, with 3-bit values PNZP1V11 qid NZ P1 [ ] [01] [10] [01.00] [01.11] [10.10] PNZP1V12 [ ] [10] [10.11] 0111 PNZP1V13 [ ] [01] [10] [01.11] 0110 [10.00] 1000 Redundant! Since no mixed at this level. RC(P 1,6)=RC(P1,110 )=RC(P 11^P 12^P‘13). At [ ]: P11,6 =P11^P12^PNZ’13 (since P‘13 is P0V13=PNZ’13). Calculate COUNT=P1Cnt*4level PNZ1,6=PNZ11^PNZ12^P1’13 (since PN113 = P1’13) Recursively get P1Cnt*4level from each mixed child (PMV1,6=P0’1,6^P1’1,6). L-2 P1[ ]: 1001^1000^1000=1000, contrib 1*42=16, PNZ[ ]=1111^1010^1110 = 1010, P1[ ]=1000, PM[ ]=0010 L-1 P1[10]: 1101^1110^0001=0000, contrib 0*41=0, PNZ[10]=1111^1111^1001=1001, P1[10]=0000, PM[10]=1001 L-0 P1[10.00]: 1111^1111^0111=0111, contrib 3*40 =3; Level-0 P1[10.11]: 1111^0111^1111=0111, contrib 3*40 =3 TOT=22

All nodes return counts to the original root (in this case, NodeC)
P11 P12 P13 qid NZ P1 [ ] [01] [10] [01.00] 1110 [01.11] 0010 [10.10] 1101 qid NZ P1 [ ] [10] [10.11] 0111 qid NZ P1 [ ] [01] [10] [01.11] 0110 [10.00] 1000 Distributed P trees? Assume a 5-computer cluster; NodeC, Node00, Node01, Node10, Node11. Send to Nodeij if qid ends in ij: Bp qid NZ P 11[01.00] 1110 13[10.00] 1000 Bp qid NZ P 11[01] 13[01] Bp qid NZ P C 11[ ] 12[ ] 13[ ] Bp qid NZ P 11[10] 11[10.10] 1101 12[10] 13[10] Bp qid NZ P 11[01.11] 0010 12[10.11] 0111 13[01.11] 0110 Distributed Request: Send request, RC(P 1,6) to NodeC (if Root Count for a subquadrant only, send to that qid, eg, [10] RC(P 1,6) NodeC does: P1[ ]: 1001^1000^1000= Count=1*42=16. PNZ[ ]=1111^1010^1110= P1[ ]= PM[ ]=0010. Send request RC(P 1,6) to Node10. Node10 does: P1[10]: 1101^1110^0001=0000. Count= PNZ[10]=1111^1111^1001=1001. P1[10]= PM[10]=1001. Send request RC(P 1,6) to Nodes00 Node11 Node00 does: P1[10.00]: 1111^1111^0111=0111, contribution is 3*40 =3 Send (3) to NodeC (all nodes return counts to C) Node11 does: P1[10.11]: 1111^0111^1111=0111, contribution is 3*40=3 Sends (3) to NodeC NodeC does: add up Counts: = 22. All nodes return counts to the original root (in this case, NodeC)

All nodes return counts to the original root (in this case, NodeC)
Distributed P trees? P11 qid 3val [ ] 10x0 [10] 111x [10.11] 0111 qid 3val [ ] 1xx1 [01] x01x [10] 11x1 [01.00] 1110 [01.11] 0010 [10.10] 1101 P12 qid 3val [ ] 0xx1 [01] 111x [10] x110 [01.11] 0110 [10.00] 1000 P13 Using 3-value logic (for human readability only): pure1=1 pure0=0 mixed=x Send to Nodeij if qid ends in ij: (Assume a 5-computer cluster; NodeC, Node00, Node01, Node10, Node11.) Bp qid 3val 11[10] 11x1 11[10.10] 1101 12[10] 111x 13[10] x110 Bp qid 3val 11[ ] 1xx1 12[ ] 10x0 13[ ] 0xx1 Bp qid 3val 11[01.00] 1110 13[10.00] 1000 Bp qid 3val 11[01] x01x 13[01] 111x Bp qid 3val 11[01.11] 0010 12[10.11] 0111 13[01.11] 0110 Distributed Request: Send request, RC(P 1,6) to NodeC (if Root Count for a subquadrant only, send to that qid, eg, [10] RC(P 1,6) NodeC does: P1[ ]: 1001^1000^1000= Count=1*42=16. PNZ[ ]=1111^1010^1110= P1[ ]= PM[ ]=0010. Send request RC(P 1,6) to Node10. Node10 does: P1[10]: 1101^1110^0001=0000. Count= PNZ[10]=1111^1111^1001=1001. P1[10]= PM[10]=1001. Send request RC(P 1,6) to Nodes00 Node11 Node00 does: P1[10.00]: 1111^1111^0111=0111, contribution is 3*40 =3 Send (3) to NodeC (all nodes return counts to C) Node11 does: P1[10.11]: 1111^0111^1111=0111, contribution is 3*40=3 Sends (3) to NodeC NodeC does: add up Counts: = 22. All nodes return counts to the original root (in this case, NodeC)

Blue Band, B1, with 3-bit values
Example 1 (bottom-up) B11 Bp qid NZ P1 11[00.00] Blue Band, B1, with 3-bit values Bp qid NZ P1 11[00.00] 11[00.01] Bp qid NZ P1 11[00.00] 11[00.01] 11[00.10] Bp qid NZ P1 11[00.00] 11[00.01] 11[00.10] 11[00.11] Bp qid NZ P1 11[00] Node-C Bp qid NZ P1 11[] __ 10__ This ends the possibility of a larger pure1 quad. So 00 can be installed in parent as a pure1. Bp qid NZ P1 11[00] 11[01.00] Node-00 Bp qid NZ P1 11[01.00] Bp qid NZ P1 11[01.00] 11[01.01] Mixed leaf quad sent. Also ends possibility parent is pure so it & all siblings are installed as bits in parent. Node-01 Bp qid NZ P1 11[01] 11[01.10] Node-10 Bp qid NZ P1 11[01.11] Mixed leaf quad sent. Ends parent so install bits in grandparent also Node-11 Bp qid NZ P1 11[01.11]

Blue Band, B1, with 3-bit values
Example 1 (bottom-up) B11 Bp qid NZ P1 11[10.00] Blue Band, B1, with 3-bit values Bp qid NZ P1 11[10.00] 11[10.01] Bp qid NZ P1 11[10.00] 11[10.01] 11[10.10] 11[10.11] Ends the possibility of a larger pure1 quad. All can be installed in parent/grandparent as a 1-bit. 10.10 can be installed. Bp qid NZ P1 11[11.00] 11[11.01] 11[11.10] 11[11.11] Node-C Bp qid NZ P1 11[] Bp qid NZ P1 11[11] Node-00 Bp qid NZ P1 11[01.00] Node-01 Bp qid NZ P1 11[01] Ends quad-11. All can be installed in Parent as a 1-bit. Node-10 Bp qid NZ P1 11[10.10] 11[10] Node-11 Bp qid NZ P1 11[01.11]

X, Y, B1, B2 Example2 B1 B11 B12 B13 B2 B21 B22 B23 Example2

Example2: Striping Raster order Peano order __PNZV_ __P1V__
Band bit-pos [ ] === === === === Raster order Peano order X, Y, B1, B2 X, Y, B11B12B13B21B22B23 x1y1x2y2x3y3 B11B12B13B21B22B23 OR for PNZ AND for P1 00_PNZV__ __P1V__ Send B21B22B23 to Node00 01_PNZV__ __P1V__ Send B11B13 B22B23 to Node01 Bp qid NZ P1 C 11[ ] 12[ ] 13[ ] 21[ ] 22[ ] 23[ ] Purity Template [ ] 10_PNZV__ __P1V__ Send B12B13 B21B22B23 to Node10 11_PNZV__ __P1V__ Send nothing to Node11

Example2: striping at Node 00
Bp qid NZ P 21[ ] 22[ ] 23[ ] PurityTemplate [00] 11[ ] 23[ ] 12[ ] 13[ ] 21[ ] 22[ ] 23[ ] _PNZV__ __P1___ Band bit-pos [00 ] === === === === x1y1x2y2x3y3B11B12B13 B21B22B23 _PNZV__ __P1V__ Send nothing to Node00 _PNZV__ __P1V__ Send [ ]B21B22 to Node01 _PNZV__ __P1V__ Pages on disk Send nothing to Node10 Bp qid NZ P 12[ ] 13[ ] 21[ ] 21[ ] 22[ ] 22[ ] 23[ ] 23[ ] 23[ ] 11[ ] _PNZV__ __P1V__ Send nothing to Node11 P1 Band bit-pos 13 [01.00 ] == 11 10 00 P1 Band bit-pos [10.00 ] == === 11 011 10 110 x1y1x2y2x3y3 B11 B23 x1y1x2y2x3y3 B12B12 B23B23B23 From [01 ] From [10 ] To [01 ]

To [00 ] _PNZV__ __P1___ Band bit-pos [01 ] === === === === Bp qid NZ P 11[ ] 13[ ] 22[ ] 23[ ] PurityTemplate [01] 21[ ] 22[ ] 23[ ] x1y1x2y2x3y3 B11 B13 B22B23 _PNZV__ __P1V__ Send [01]B11B23 to Node00 _PNZV__ __P1V__ Send nothing to Node01 _PNZV__ __P1V__ Send nothing to Node10 Pages on disk _PNZV__ __P1V__ Bp qid NZ P 11[ ] Send nothing to Node11 Bp qid NZ P 13[ ] Bp qid NZ P 21[ ] P1 Band bit-pos 12 [00.01 ] == 10 01 P1 Band 2 bit-pos 3 [10.01 ] == 1 x1y1x2y2x3y B21B22 x1y1x2y2x3y B23 Bp qid NZ P 22[ ] 22[ ] Bp qid NZ P 23[ ] 23[ ] From [00 ] From [10 ]

To [00 ] _PNZV__ __P1___ Band bit-pos [10 ] === === === === To[01 ] Bp qid NZ P 12[ ] 13[ ] 21[ ] 22[ ] 23[ ] PurityTemplate [10] x1y1x2y2x3y3 B12B13B21B22B23 _PNZV__ __P1V__ Send [10]B13B21B23 to Node00 _PNZV__ __P1V__ Send [10] B23 to Node01 _PNZV__ __P1V__ Send nothing to Node10 Pages on disk _PNZV__ __P1V__ Bp qid NZ P 12[ ] 13[ ] 21[ ] 22[ ] 23[ ] Send [10]B12B21B22 to Node11 To [11 ]

Example2: striping at Node11
Bp qid NZ P 12[ ] 01 22[ ] 10 23[ ] 01 Pages on disk Bp qid NZ P 12[ ] 01 P1 Band bit-pos 223 [10.11 ] === 010 101 x1y1x2y2x3y3 B B21B22 Bp qid NZ P 22[ ] 10 Bp qid NZ P 23[ ] 01 From [10 ]

Example2.1 AND at NodeC or [ ]
NZ-pattern NZ P1 xxxx prime xxxx prime xxxx prime Example2.1 AND at NodeC or [ ] RC(P 101,010) = P11^ P’12^ P13^ P’21^ P22^ P’23 []NZ 1111 0111 ------AND Sum= 8 so far. Invocation= [ ] 101,010 send to Nodes 01, 10 P1-pattern NZ P1 xxxx prime xxxx prime xxxx prime []P1 1011 0101 0001 ------AND Disk 10 PT[10] Disk 01 PT[01] Disk 00 PT[00] Disk C PT[ ] Bp qid NZ P1 12[ ] 23[ ] 22[ ] Disk 11 Bp qid NZ P1 C 11[ ] 12[ ] 13[ ] 21[ ] 22[ ] 23[ ] Bp qid NZ P1 11[ ] Bp qid NZ P1 11[ ] Bp qid NZ P1 12[ ] Bp qid NZ P1 12[ ] Bp qid NZ P1 13[ ] Bp qid NZ P1 13[ ] Bp qid NZ P1 13[ ] Bp qid NZ P1 21[ ] Bp qid NZ P1 21[ ] Bp qid NZ P1 21[ ] 21[ ] Bp qid NZ P1 22[ ] 22[ ] Bp qid NZ P1 22[ ] Bp qid NZ P1 23[ ] 23[01.00] 23[10.00] Bp qid NZ P1 22[ ] 22[ ] Bp qid NZ P1 23[ ] 23[ ] Bp qid NZ P1 23[ ]

Example2.1 AND at Node01 [ ] 101,010 received Sent to Node00
NZ-pattern NZ P1 xxxx prime xxxx prime xxxx prime [01] NZ 12 21 AND------ 1000 Invocation= [01] 101,010 Sent to Node00 [ ] 101,010 received P1-pattern NZ P1 xxxx prime xxxx prime xxxx prime [01] P1 12 21 AND------ 0000 Disk 10 PT[10] Disk 01 PT[01] Disk 00 PT[00] Disk C PT[ ] Bp qid NZ P1 12[ ] 23[ ] 22[ ] Disk 11 Bp qid NZ P1 11[ ] Bp qid NZ P1 11[ ] Bp qid NZ P1 C 11[ ] 12[ ] 13[ ] 21[ ] 22[ ] 23[ ] Bp qid NZ P1 12[ ] Bp qid NZ P1 12[ ] Bp qid NZ P1 13[ ] Bp qid NZ P1 13[ ] Bp qid NZ P1 13[ ] Bp qid NZ P1 21[ ] Bp qid NZ P1 21[ ] Bp qid NZ P1 21[ ] 21[ ] Bp qid NZ P1 22[ ] 22[ ] Bp qid NZ P1 22[ ] Bp qid NZ P1 23[ ] 23[01.00] 23[10.00] Bp qid NZ P1 22[ ] 22[ ] Bp qid NZ P1 23[ ] 23[ ] Bp qid NZ P1 23[ ]

Example2.1 AND at Node10 [ ] 101,010 received Sent nowhere (no mixed)
NZ-pattern NZ P1 11 xxxx prime 13 xxxx prime 22 xxxx prime [10] NZ 11 AND------ 0000 [ ] 101,010 received Invocation= [10] 101,010 Sent nowhere (no mixed) P1-pattern NZ P1 xxxx 12 prime xxxx 21 prime xxxx 23 prime [10] P1 11 12 13 21 22 23 AND------ Disk 10 PT[10] Disk 01 PT[01] Disk 00 PT[00] Disk C PT[ ] Bp qid NZ P1 12[ ] 23[ ] 22[ ] Disk 11 Bp qid NZ P1 11[ ] Bp qid NZ P1 11[ ] Bp qid NZ P1 C 11[ ] 12[ ] 13[ ] 21[ ] 22[ ] 23[ ] Bp qid NZ P1 12[ ] Bp qid NZ P1 12[ ] Bp qid NZ P1 13[ ] Bp qid NZ P1 13[ ] Bp qid NZ P1 13[ ] Bp qid NZ P1 21[ ] Bp qid NZ P1 21[ ] Bp qid NZ P1 21[ ] 21[ ] Bp qid NZ P1 22[ ] 22[ ] Bp qid NZ P1 22[ ] Bp qid NZ P1 23[ ] 23[01.00] 23[10.00] Bp qid NZ P1 22[ ] 22[ ] Bp qid NZ P1 23[ ] 23[ ] Bp qid NZ P1 23[ ]

Example2.1 AND at Node00 [01] 101,010 received
Sum=1, sent to NodeC gives a sum total of = 9 P1-pattern P1 11 xxxx 12 prime 13 xxxx 21 prime 22 xxxx 23 prime [01.00] P1 12 13 21 22 AND------ 0100 Disk 10 PT[10] Disk 01 PT[01] Disk 00 PT[00] Disk C PT[ ] Bp qid NZ P1 12[ ] 23[ ] 22[ ] Disk 11 Bp qid NZ P1 11[ ] Bp qid NZ P1 11[ ] Bp qid NZ P1 C 11[ ] 12[ ] 13[ ] 21[ ] 22[ ] 23[ ] Bp qid NZ P1 12[ ] Bp qid NZ P1 12[ ] Bp qid NZ P1 13[ ] Bp qid NZ P1 13[ ] Bp qid NZ P1 13[ ] Bp qid NZ P1 21[ ] Bp qid NZ P1 21[ ] Bp qid NZ P1 21[ ] 21[ ] Bp qid NZ P1 22[ ] 22[ ] Bp qid NZ P1 22[ ] Bp qid NZ P1 23[ ] 23[01.00] 23[10.00] Bp qid NZ P1 22[ ] 22[ ] Bp qid NZ P1 23[ ] 23[ ] Bp qid NZ P1 23[ ]

Example2.2 AND at NodeC or [ ]
P1-pattern NZ P1 xxxx prime prime xxxx prime xxxx NZ-pattern xxxx prime prime xxxx prime xxxx Example2.2 AND at NodeC or [ ] RC(P 100,101) = P11^ P’12^ P’13^ P21^ P’22^ P23 []NZ ------AND 0010 Sum= 0 so far. Invocation= [ ] 100, 101 send to Node 10 []P1 ------AND 0000 Disk 10 PT[10] Disk 01 PT[01] Disk 00 PT[00] Disk C PT[ ] Bp qid NZ P1 12[ ] 23[ ] 22[ ] Disk 11 Bp qid NZ P1 C 11[ ] 12[ ] 13[ ] 21[ ] 22[ ] 23[ ] Bp qid NZ P1 11[ ] Bp qid NZ P1 11[ ] Bp qid NZ P1 12[ ] Bp qid NZ P1 12[ ] Bp qid NZ P1 13[ ] Bp qid NZ P1 13[ ] Bp qid NZ P1 13[ ] Bp qid NZ P1 21[ ] Bp qid NZ P1 21[ ] Bp qid NZ P1 21[ ] 21[ ] Bp qid NZ P1 22[ ] 22[ ] Bp qid NZ P1 22[ ] Bp qid NZ P1 23[ ] 23[01.00] 23[10.00] Bp qid NZ P1 22[ ] 22[ ] Bp qid NZ P1 23[ ] 23[ ] Bp qid NZ P1 23[ ]

Example2.2 AND at Node10 [ ] 100,101 received Sent to Node 11
P1-pattern NZ P1 xxxx prime prime xxxx prime xxxx NZ-pattern xxxx prime prime xxxx prime xxxx [10] NZ 11 12 13 21 22 23 AND------ 0001 [ ] 100,101 received Invocation= [10] 100, 101 Sent to Node 11 [10] P1 11 12 13 21 22 23 AND------ 0000 Disk 10 PT[10] Disk 01 PT[01] Disk 00 PT[00] Disk C PT[ ] Bp qid NZ P1 12[ ] 23[ ] 22[ ] Disk 11 Bp qid NZ P1 11[ ] Bp qid NZ P1 11[ ] Bp qid NZ P1 C 11[ ] 12[ ] 13[ ] 21[ ] 22[ ] 23[ ] Bp qid NZ P1 12[ ] Bp qid NZ P1 12[ ] Bp qid NZ P1 13[ ] Bp qid NZ P1 13[ ] Bp qid NZ P1 13[ ] Bp qid NZ P1 21[ ] Bp qid NZ P1 21[ ] Bp qid NZ P1 21[ ] 21[ ] Bp qid NZ P1 22[ ] 22[ ] Bp qid NZ P1 22[ ] Bp qid NZ P1 23[ ] 23[01.00] 23[10.00] Bp qid NZ P1 22[ ] 22[ ] Bp qid NZ P1 23[ ] 23[ ] Bp qid NZ P1 23[ ]

Example2.2 AND at Node11 [10] 100,101 received
Sum=1, sent to NodeC gives a sum total of 1 [10] P1 12 13 21 AND------ 01 Disk 10 PT[10] Disk 01 PT[01] Disk 00 PT[00] Disk C PT[ ] Bp qid NZ P1 12[ ] 23[ ] 22[ ] Disk 11 Bp qid NZ P1 11[ ] Bp qid NZ P1 11[ ] Bp qid NZ P1 C 11[ ] 12[ ] 13[ ] 21[ ] 22[ ] 23[ ] Bp qid NZ P1 12[ ] Bp qid NZ P1 12[ ] Bp qid NZ P1 13[ ] Bp qid NZ P1 13[ ] Bp qid NZ P1 13[ ] Bp qid NZ P1 21[ ] Bp qid NZ P1 21[ ] Bp qid NZ P1 21[ ] 21[ ] Bp qid NZ P1 22[ ] 22[ ] Bp qid NZ P1 22[ ] Bp qid NZ P1 23[ ] 23[01.00] 23[10.00] Bp qid NZ P1 22[ ] 22[ ] Bp qid NZ P1 23[ ] 23[ ] Bp qid NZ P1 23[ ]

Example2, bottom-up Peano order Bp qid NZ P1 11[00.00] 1111
11[00.00] 12[00.00] 13[00.00] 21[00.00] 22[00.00] 23[00.00] Example2, bottom-up Peano order x1y1x2y2x3y3 B11B12B13B21B22B23

Mixed quads (can be sent to node01)
Bp qid NZ P1 11[00.00] 11[00.01] 12[00.00] 12[00.01] 13[00.00] 13[00.01] 21[00.00] 21[00.01] 22[00.00] 22[00.01] 23[00.00] 23[00.01] Example2, bottom-up Peano order x1y1x2y2x3y3 B11B12B13B21B22B23 Mixed quads (can be sent to node01) Bp qid NZ P1 21[00.01] 22[00.01]

Mixed quads (sent to node00)
Bp qid NZ P1 11[00.00] 11[00.01] 11[00.10] 12[00.00] 12[00.01] 12[00.10] 13[00.00] 13[00.01] 13[00.10] 21[00.00] 21[00.01] 21[00.10] 22[00.00] 22[00.01] 22[00.10] 23[00.00] 23[00.01] 23[00.10] Example2, bottom-up Peano order x1y1x2y2x3y3 B11B12B13B21B22B23 Mixed quads (sent to node00) Bp qid NZ P at 00 23[00] Bp qid NZ P at 01 21[00.01] 22[00.01]

00 quads that are pure are:
Bp qid NZ P1 11[00.00] 11[00.01] 11[00.10] 11[00.11] 12[00.00] 12[00.01] 12[00.10] 12[00.11] 13[00.00] 13[00.01] 13[00.10] 13[00.11] 21[00.00] 21[00.01] 21[00.10] 21[00.11] 22[00.00] 22[00.01] 22[00.10] 22[00.11] 23[00.00] 23[00.01] 23[00.10] 23[00.11] Example2, bottom-up Peano order x1y1x2y2x3y3 B11B12B13B21B22B23 00 quads that are pure are: Bp qid NZ P1 11[00] 12[00] 13[00] At 00 Bp qid NZ P1 23[00] At 01 Bp qid NZ P1 21[00.01] 22[00.01]

P-ARM Algorithm Procedure P-ARM { Data_Discretization;
F1 = {frequent 1-Itemsets}; For (k=2; F k-1 ) do begin Ck = p-gen(F k-1); Forall candidate Itemsets c  Ck do c.count = AND_rootcount(c); Fk = {cCk | c.count >= minsup} end Answer = k Fk } The P-ARM algorithm assumes a fixed value precision in all bands. The p-gen function for numeric spatial data differs from apriori-gen by using additional pruning techniques. In p-gen function, even if both B3[0,64) and B3[64,127) are frequent 1-itemsets, they’re not joined to candicate 2-itemsets. The AND_rootcount function is used to calculate Itemset counts directly by ANDing the appropriate basic Ptrees instead of scanning the transaction databases. In AND_rootcount fctn, the support count for Itemset {B1[0,64), B2[64,127)}, for instance, is the root count of P1, 00 AND P2, 01.

The Apriori Algorithm — Example
Database D C1 L1 C2 C2 L2 C3 L3 Scan D Scan D Scan D P1^P2^P3 1 //\\ 0010 P1 2 //\\ 1010 P2 3 0111 P3 3 1110 P4 1 1000 P5 3 P1^P2 1 //\\ 0010 P1^P3 ^P5 1 //\\ 0010 Build Ptrees: Scan D TID 1 2 3 4 5 100 200 300 400 P1^P3 2 //\\ 1010 P2^P3 ^P5 2 //\\ 0110 P1^P5 1 //\\ 0010 L1={1,2,3,5} L2={13,23,25,35} L3={235} P2^P3 2 //\\ 0110 P2^P5 3 //\\ 0111 P3^P5 2 //\\ 0110

P-ARM versus Apriori Compare with Apriori (classical method) and FP-growth (recently proposed). Find all frequent itemsets, not just those containing Yield, for fairness. The images are actual aerial TIFF images with synchronized yield maps. Scalability with number of transactions Scalability with support threshold Identical results P-ARM is more scalable for lower support thresholds. P-ARM algorithm is more scalable to large spatial datasets. 1320  1320 pixel TIFF Yield dataset (total number of transactions is ~1,700,000). 2-bits precision Equi-length partition

P-ARM versus FP-growth
17,424,000 pixels (transactions) Scalability with support threshold Scalability with number of trans FP-growth = efficient, tree-based frequent pattern mining method (details later) Identical results. For a dataset of 100K bytes, FP-growth runs very fast. But for images of large size, P-ARM achieves better performance. P-ARM achieves better performance in the case of low support threshold.

High Confidence Rules Application areas on spatial data
Yield identification Identification of agricultural pest infestations Traditional algorithms are not suitable Too many frequent itemsets in the case of low support threshold P-tree  P-cube Low-support To eliminate rules that result from noise and outliers High confidence Eliminate redundant rules Ranked based on confidence, rule-size Generation relation between rules r generalizes r’, if they have same consequent and antecedent of r is properly contained in the antecedent of r’

Confident Rule Mining Algorithm
Build the set of confident rules, C (initially empty) as follows: Start with 1-bit values, 2 bands; then 1-bit values and 3 bands; … then 2-bit values and 2 bands; then 2-bit values and 3 bands; … . . . At each stage defined above, do the following: Find all confident rules by rolling-up the T-cube along each potential consequent set using summation. Comparing these sums with the support threshold to isolate rule support sets with the minimum support. Compare the normalized T-cube values (divide by the rolled-up sum) with the minimum confidence level to isolate the confident rules. Place any new confident rule in C, but only if the rank is higher than any of its generalizations already in C.

Example Assume minimum confidence threshold 80%,
minimum support threshold 10% Start with 1-bit values and 2 bands, B1 and B2 30 34  sums 24 27.2  thresholds 5 19 25 15 1,0 1,1 2,0 2,1 32 40 C: B1={0} => B2={0} c = 83.3%

Methods to Improve Apriori’s Efficiency
Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent The core of the Apriori algorithm: Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets Use database scan and pattern matching to collect counts for the candidate itemsets The bottleneck of Apriori: candidate generation 1. Huge candidate sets: 104 frequent 1-itemset will generate 107 candidate 2-itemsets To discover frequent pattern of size 100, eg, {a1…a100}, need to generate 2100  1030 candidates. 2. Multiple scans of database: (Needs (n +1 ) scans, n = length of the longest pattern)

Mining Frequent Patterns Without Candidate Generation
Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure highly condensed, but complete for frequent pattern mining avoid costly database scans Develop an efficient, FP-tree-based frequent pattern mining method A divide-and-conquer methodology: decompose mining tasks into smaller ones Avoid candidate generation: sub-database test only!

Construct FP-tree from a Transaction DB
TID Items bought (ordered) frequent items 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} 200 {a, b, c, f, l, m, o} {f, c, a, b, m} 300 {b, f, h, j, o} {f, b} 400 {b, c, k, s, p} {c, b, p} 500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} min_support = 0.5 {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 Steps: Scan DB once, find frequent 1-itemset (single item pattern) Order frequent items in frequency descending order Scan DB again, construct FP-tree

FP-tree Structure Completeness: Compactness
never breaks a long pattern of any transaction preserves complete information for frequent pattern mining Compactness reduce irrelevant information—infrequent items are gone frequency descending ordering: more frequent items are more likely to be shared never be larger than the original database (if not count node-links and counts) Example: For Connect-4 DB, compression ratio could be over 100 General idea (divide-and-conquer) Recursively grow frequent pattern path using the FP-tree Method For each item, construct its conditional pattern-base, then its conditional FP-tree Repeat the process on each newly created conditional FP-tree until the resulting FP-tree is empty, or it contains only one path (single path will generate all combinations of its sub-paths, each of which is a frequent pattern)

Major Steps to Mine FP-tree
Construct conditional pattern base for each node in the FP-tree Construct conditional FP-tree from each conditional pattern-base Recursively mine conditional FP-trees and grow frequent patterns obtained so far If conditional FP-tree contains a single path, simply enumerate all the patterns Step-1: Starting at the frequent header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of that item to form a conditional pattern base {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1

FP-tree Construction continued
Node-link property of FP-tree for Conditional Pattern Base Construction For any frequent item ai, all the possible frequent patterns that contain ai can be obtained by following ai's node-links, starting from ai's head in the FP-tree header Prefix path property of FP-tree for Conditional Pattern Base Construction To calculate frequent patterns for a node ai in path P, only the prefix sub-path of ai in P need to be accumulated, and its frequency count should carry same count as node ai. Step-2: For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base {} m-conditional pattern base: fca:2, fcab:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam {} f:3 c:3 a:3 m-conditional FP-tree c:3 b:1 b:1   a:3 p:1 m:2 b:1 p:2 m:1

Mining Frequent Patterns by Creating Conditional Pattern-Bases
Empty f {(f:3)}|c {(f:3)} c {(f:3, c:3)}|a {(fc:3)} a {(fca:1), (f:1), (c:1)} b {(f:3, c:3, a:3)}|m {(fca:2), (fcab:1)} m {(c:3)}|p {(fcam:2), (cb:1)} p Conditional FP-tree Conditional pattern-base Item

Step 3: Recursively mine the conditional FP-tree
{} f:3 c:3 am-conditional FP-tree {} f:3 c:3 a:3 m-conditional FP-tree Cond. pattern base of “am”: (fc:3) {} Cond. pattern base of “cm”: (f:3) f:3 cm-conditional FP-tree {} Cond. pattern base of “cam”: (f:3) f:3 cam-conditional FP-tree

Single FP-tree Path Generation
Suppose an FP-tree T has a single path P The complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P {} f:3 c:3 a:3 m-conditional FP-tree All frequent patterns concerning m m, fm, cm, am, fcm, fam, cam, fcam  Principles of FP-Growth Pattern growth property  be frequent itemset in DB, B be 's conditional pattern base,  be an itemset in B. Then    is a frequent itemset in DB iff  is frequent in B. “abcdef ” is a frequent pattern, if and only if “abcde ” is a frequent pattern, and “f ” is frequent in the set of transactions containing “abcde ”

FP-growth vs. Apriori: Scalability With the Support Threshold
Data set T25I20D100K Data set T25I20D10K Our performance study shows FP-growth is an order of magnitude faster than Apriori, and is also faster than tree-projection Reasoning No candidate generation, no candidate test Use compact data structure Eliminate repeated database scan Basic operation is counting and FP-tree building

Presentation of Association Rules (Table Form )
Visualization of Assoc Rule Using Plane Graph Icerberg query: Compute aggregates over one or a set of attributes only for those whose aggregate values is above certain threshold. E.g., select P.custID, P.itemID, sum(P.qty) from purchase P group by P.custID, P.itemID having sum(P.qty) >= 10 Compute iceberg queries efficiently by Apriori: First compute lower dimensions Then compute higher dimensions only when all the lower ones are above the threshold Visualization of Association Rule Using Rule Graph

Multiple-Level Association Rules
Food bread milk skim Sunset Fraser 2% white wheat Items often form hierarchy. Items at the lower level are expected to have lower support. Rules regarding itemsets at approp levels could be useful. Transaction database can be encoded based on dimensions and levels. We can explore shared multi-level mining

Mining Multi-Level Associations
A top_down, progressive deepening approach: First find high-level strong rules: milk ® bread [20%, 60%]. Then find their lower-level “weaker” rules: % milk ® wheat bread [6%, 50%] Variations at mining multiple-level association rules. Level-crossed association rules: % milk ® Wonder wheat bread Assoc rules with multiple, alternative hierarchies: 2% milk ® Wonder bread Uniform Support vs. Reduced Support Uniform Support: the same minimum support for all levels + 1 minimum support threshold. No need to examine itemsets containing any item whose ancestors do not have minimum support. – Lower level items do not occur as frequently. If support threshold too high  miss low level associations too low  generate too many high level associations Reduced Support: reduced minimum support at lower levels There are 4 search strategies: Level-by-level independent Level-cross filtering by k-itemset Level-cross filtering by single item Controlled level-cross filtering by single item

Support Back Multi-level mining with uniform support
Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 Multi-level mining with uniform support Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 3% Multi-level mining with reduced support Back

Multi-level Association: Redundancy Filtering
Some rules may be redundant due to “ancestor” relationships between items. E.g., milk  wheat bread [support = 8%, confidence = 70%] 2% milk  wheat bread [support = 2%, confidence = 72%] We say the first rule is an ancestor of the second rule. Rule is redundant if its support is close to “expected” value, based on rule’s ancestor. Multi-Level Mining: Progressive Deepening A top-down, progressive deepening approach: First mine high-level frequent items: milk (15%), bread (10%) Then mine their lower-level “weaker” freq itemsets: 2% milk (5%), wheat bread (4%) Different min_support threshold across multi-levels lead to different algs: If adopting same minsupp across multi-levels, then toss t if any of t’s ancestors is infrequent. If adopting reduced min_support at lower levels; then examine only those descendents whose ancestor’s support is frequent/non-negligible.

Progressive Refinement
Progressive Refinement: Why progressive refinement? Mining operator can be expensive or cheap, fine or rough Trade speed with quality: step-by-step refinement. Superset coverage property: Preserve all positive answers—allow a positive false test but not a false negative test. Two- or multi-step mining: First apply rough/cheap operator (superset coverage) Apply expensive alg on substantially reduced candidate set (Koperski,Han, SSD’95). Hierarchy of spatial relationship: “g_close_to”: near_by, touch, intersect, contain, etc. First search for rough relationship and then refine it. Two-step mining of spatial association: Step1: rough spatial computation (as filter) Using MBR or R-tree for rough est. Step2: Detailed spatial algorithm (as refinement) Apply only to objs that passed rough spatial assoc test (no less than minsupp)

Multi-Dimensional Association: Concepts
Single-dimensional rules: buys(X, “milk”)  buys(X, “bread”) Multi-dimensional rules:  2 dimensions or predicates Inter-dimension association rules (no repeated predicates) age(X,”19-25”)  occupation(X,“student”)  buys(X,“coke”) hybrid-dimension association rules (repeated predicates) age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”) Categorical Attributes finite number of possible values, no ordering among values Quantitative Attributes numeric, implicit ordering among values

Techniques for Mining MD Associations
Search for frequent k-predicate set: Example: {age, occupation, buys} is a 3-predicate set. Techniques can be categorized by how age are treated. 1. Using static discretization of quantitative attributes Quantitative attrs statically discretized using predefined concept hierarchies 2. Quantitative association rules Quantitative attrs are dynamically discretized into “bins”based on data distr 3. Distance-based association rules Dynamic discretization process that considers distance between data points. (income) (age) () (buys) (age, income) (age,buys) (income,buys) (age,income,buys) Static Discretization of Quantitative Attributes Discretized prior to mining using concept hierarchy. Numeric values are replaced by ranges. In relational DB, finding all frequent k-predicate sets will require k or k+1 table scans. Data cube is well suited for mining. The cells of an n-dimensional cuboid correspond to the predicate sets. Mining from data cubes can be much faster.

Quantitative Association Rules
Numeric attributes are dynamically discretized Such that the confidence or compactness of the rules mined is maximized. 2-D quantitative association rules: Aquan1  Aquan2  Acat Cluster “adjacent” association rules to form general rules using a 2-D grid. Example: age(X,”30-34”)  income(X,”24K - 48K”)  buys(X,”high resolution TV”)

ARCS (Association Rule Clustering System)
How does ARCS work? 1. Binning 2. Find frequent predicateset 3. Clustering 4. Optimize Limitations of ARCS Only quantitative attributes on LHS of rules. Only 2 attributes on LHS. (2D limitation) An alternative to ARCS Non-grid-based equi-depth binning clustering based on a measure of partial completeness. “Mining Quantitative Association Rules in Large Relational Tables” by R. Srikant and R. Agrawal.

Mining Distance-based Association Rules
Binning methods do not capture the semantics of interval data Distance-based partitioning, more meaningful discretization considering: density/number of points in an interval “closeness” of points in an interval S[X] is a set of N tuples t1, t2, …, tN , projected on the attribute set X The diameter of S[X]: distx:distance metric, e.g. Euclidean or Manhattan

Minkowski Metrics The p-metrics (aka: Minkowski metrics) dp(X,Y) = (i=1 to n |xi - yi|p)1/p Unit Disk Dividing Lines p=1 (Manhattan) p=2 (Euclidean) p=3,4,… . Pmax P=½, ⅓, ¼, … dmax≡max|xi - yi|  d≡limp  dp(X,Y) clear. Proof (sort of) limp  { i=1 to n aip }1/p max(ai) ≡b. For p large enough, other aip << bp since y=xp increasingly concave, so i=1 to n aip  k*bp (k=duplicity of b in the sum), so {i=1 to n aip }1/p  k1/p*b and k1/p 1

P>1 Minkowski Metrics
q x1 y1 x2 y2 e^(LN(B2-C2)*A2)e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) E E MAX E E MAX MAX MAX ************** E MAX E E MAX E E **************************** MAX

P<1 Minkowski Metrics
d 1/p(X,Y) = (i=1 to n |xi - yi|1/p)p P<1 p=0 (lim as p0) doesn’t exist (Does not converge.) q x1 y1 x2 y2 e^(LN(B2-C2)*A2) e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) E+29 q x1 y1 x2 y2 e^(LN(B2-C2)*A2) e^(LN(D2-E2)*A2) e^(LN(G2+F2)/A2) E+14 E+29 E+29

Min dissimilarity function
The dmin function (dmin(X,Y) = min i=1 to n |xi - yi| is strange. It is not a psuedo-metric. The Unit Disk is: And the neighborhood of the blue point relative to the red point (dividing nbrhd - those points closer to the blue than the red). Major bifurcations!

Criticism to Support and Confidence
Example 1: (Aggarwal & Yu, PODS98) Among 5000 students 3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal play basketball  eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. play basketball  not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence Example 2: X and Y: positively correlated, X and Z, negatively related support and confidence of X=>Z dominates Need measure of dependent or correlated events P(B|A)/P(B) is also called the lift of rule A => B

Constraint-Based Mining
Interactive, exploratory mining giga-bytes of data? Could it be real? — Making good use of constraints! What kinds of constraints can be used in mining? Knowledge type constraint: classification, association, etc. Data constraint: SQL-like queries Find product pairs sold together in Vancouver in Dec.’98. Dim/level constraints: (in relevance to region,price, brand, customer category Rule constraints (small sales (price < $10) triggers big sales (sum > $200)) Interestingness constraints: (strong rules (minsupp 3%, minconf 60%). Two kind of rule constraints: Rule form constraints: meta-rule guided mining. P(x, y) ^ Q(x, w) ® takes(x, “database systems”). Rule (content) constraint: constraint-based query optimization (Ng, et al., SIGMOD’98). sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000 1-variable vs. 2-variable constraints (Lakshmanan, et al. SIGMOD’99): 1-var: A constraint confining only one side (L/R) of the rule, e.g., as shown above. 2-var: A constraint confining both sides (L and R). sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS)

Constraint-Based Association Query
Database: (1) trans (TID, Itemset ), (2) itemInfo (Item, Type, Price) A constrained asso. query (CAQ) is in the form of {(S1, S2 )|C }, where C is a set of constraints on S1, S2 including frequency constraint A classification of (single-variable) constraints: Class constraint: S  A. e.g. S  Item Domain constraint: S v,   { , , , , ,  }. e.g. S.Price < 100 v S,  is  or . e.g. snacks  S.Type V S, or S V,   { , , , ,  } e.g. {snacks, sodas }  S.Type Aggregation constraint: agg(S)  v, where agg is in {min, max, sum, count, avg}, and   { , , , , ,  }. e.g. count(S1.Type)  1 , avg(S2.Price)  100 Given a CAQ = { (S1, S2) | C }, the algorithm should be : sound: Only finds frequent sets that satisfy the given constraints C complete: All frequent sets satisfying constraints C are found A naïve solution: Apply Apriori for finding all frequent sets, and then to test them for constraint satisfaction one by one. Our approach: Comprehensive analysis of the properties of constraints and try to push them as deeply as possible inside the frequent set computation.

Convertible constraints
Constraints, Ca Ca is anti-monotone iff for any pattern S not satisfying Ca, no super-patterns of S can satisfy Ca Cm is monotone iff. for any pattern S satisfying Cm, every super-pattern of S also satisfies it A subset of item Is is a succinct set, if it can be expressed as p(I) for some selection predicate p, where  is a selection operator SP2I is a succinct power set, if there is a fixed number of succinct set I1, …, Ik I, s.t. SP can be expressed in terms of the strict power sets of I1, …, Ik using union and minus A constraint Cs is succinct provided SATCs(I) is a succinct power set Suppose all items in patterns are listed in a total order R A constraint C is convertible anti-monotone iff a pattern S satisfying the constraint implies that each suffix of S w.r.t. R also satisfies C A constraint C is convertible monotone iff a pattern S satisfying the constraint implies that each pattern of which S is a suffix w.r.t. R also satisfies C Succinctness Anti-monotonicity Monotonicity Convertible constraints Inconvertible constraints

Property of Constraints
Anti-monotonicity: If a set S violates the constraint, any superset of S violates the constraint. Examples: sum(S.Price)  v is anti-monotone sum(S.Price)  v is not anti-monotone sum(S.Price) = v is partly anti-monotone Application: Push “sum(S.price)  1000” deeply into iterative frequent set computation. Anti-Monotonicity Succinctness S  v,   { , ,  } v  S S  V S  V S  V min(S)  v min(S)  v min(S)  v max(S)  v max(S)  v max(S)  v count(S)  v count(S)  v count(S)  v sum(S)  v sum(S)  v sum(S)  v avg(S)  v,   { , ,  } (frequent constraint) yes no partly Convertible (yes) Yes yes weakly no (no) Example of Convertible Constraints: Avg(S)  V Let R be the value descending order over set of items E.g. I={9, 8, 6, 4, 3, 1} Avg(S)  v is convertible monotone w.r.t. R If S is a suffix of S1, avg(S1)  avg(S) {8, 4, 3} is a suffix of {9, 8, 4, 3} avg({9, 8, 4, 3})=6  avg({8, 4, 3})=5 If S satisfies avg(S) v, so does S1 {8, 4, 3} satisfies constraint avg(S)  4, so does {9, 8, 4, 3} Succinctness: For any set S1 and S2 satisfying C, S1  S2 satisfies C Given A1 is the sets of size 1 satisfying C, then any set S satisfying C are based on A1 (has subset belongs to A1) Example : sum(S.Price )  v is not succinct min(S.Price )  v is succinct Optimization: If C is succinct, then C is pre-counting prunable. The satisfaction of the constraint alone is not affected by the iterative support counting.

Association Rule Mining (ARM)

Similar presentations

Presentation on theme: "Association Rule Mining (ARM)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Association Rule Mining (ARM)

Similar presentations

Presentation on theme: "Association Rule Mining (ARM)"— Presentation transcript:

Similar presentations

About project

Feedback