P Left half of rt half ? false Left half pure1? false Whole is pure1? false 0 5. Rt half of right half? true R To find the number of occurences of , AND these basic Ptrees (next slide) Predicate trees (Ptrees): vertically project each attribute, Given a table structured into horizontal records. Traditional way: Vertical Processing of Horizontal Data - VPHD ) Top-down construction of the 1-dimensional Ptree of R 11, denoted, P 11 : Record the truth of the universal predicate pure 1 in a tree recursively on halves (1/2 1 subsets), until purity is achieved. 3. Right half pure1? false R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] But it is pure (pure0) so this branch ends then vertically project each bit position of each attribute, then compress each bit slice into a basic 1D Ptree. e.g., compression of R 11 into P 11 goes as follows: P 11 pure1? false=0 pure1? true=1 pure1? false=0 R (A 1 A 2 A 3 A 4 ) for Horizontally structured records Scan vertically = Base 10Base 2 P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P ^^ ^ ^ ^ ^^ Review of Vertical Data and 1-D Ptrees VPHD to find the number of occurences of =2 Now Horizonal Processing of Vertical Data HPVD!
R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] To count occurrences of 7,0,1,4 use : 0 P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 = ^ P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P ^ ^^^ ^ ^ ^ ^^ R (A 1 A 2 A 3 A 4 ) = This 0 makes entire left branch 0 These 0 s make this node 0 These 1 s and these 0 s (which when complemented are 1's) make this node 1 The 2 1 -level (2 nd level has the only 1-bit so the 1-count of the Ptree is 1*2 1 = 2 # change Vertical Data: 1Shortcuts in the processing of 1-Dimensional Ptrees
HTT Scan D C1C1 TID C 2 Scan D C2C2 F 3 = L 3 Scan D P 1 2 //\\ 1010 P 2 3 //\\ 0111 P 3 3 //\\ 1110 P 4 1 //\\ 1000 P 5 3 //\\ 0111 Build Ptrees: Scan D L 1 ={1}{2}{3}{5} P 1 ^P 2 1 //\\ 0010 P 1 ^P 3 2 //\\ 1010 P 1 ^P 5 1 //\\ 0010 P 2 ^P 3 2 //\\ 0110 P 2 ^P 5 3 //\\ 0111 P 3 ^P 5 2 //\\ 0110 L 2 ={13}{23}{25}{35} P 1 ^P 2 ^P 3 1 //\\ 0010 P 1 ^P 3 ^P 5 1 //\\ 0010 P 2 ^P 3 ^P 5 2 //\\ 0110 L 3 ={235} F 1 = L 1 F 2 = L 2 {123} pruned since {12} not frequent {135} pruned since {15} not frequent Example ARM using uncompressed Ptrees (note: I have placed the 1-count at the root of each Ptree) C3C3 itemset {2 3 5} {1 2 3} {1,3,5} Data_Lecture_4.1_ARM
L3L3 L1L1 L2L2 1-ItemSets don’t support Association Rules (They will have no antecedent or no consequent). Are there any Strong Rules supported by Frequent=Large 2-ItemSets (at minconf=.75)? {1,3}conf({1} {3}) = supp{1,3}/supp{1} = 2/2 = 1 ≥.75 STRONG conf({3} {1}) = supp{1,3}/supp{3} = 2/3 =.67 <.75 {2,3}conf({2} {3}) = supp{2,3}/supp{2} = 2/3 =.67 <.75 conf({3} {2}) = supp{2,3}/supp{3} = 2/3 =.67 <.75 {2,5}conf({2} {5}) = supp{2,5}/supp{2} = 3/3 = 1 ≥.75 STRONG! conf({5} {2}) = supp{2,5}/supp{5} = 3/3 = 1 ≥.75 STRONG! {3,5}conf({3} {5}) = supp{3,5}/supp{3} = 2/3 =.67 <.75 conf({5} {3}) = supp{3,5}/supp{5} = 2/3 =.67 <.75 Are there any Strong Rules supported by Frequent or Large 3-ItemSets? {2,3,5}conf({2,3} {5}) = supp{2,3,5}/supp{2,3} = 2/2 = 1 ≥.75 STRONG! conf({2,5} {3}) = supp{2,3,5}/supp{2,5} = 2/3 =.67 <.75 conf({3,5} {2}) = supp{2,3,5}/supp{3,5} = 2/3 =.67 <.75 No subset antecedent can yield a strong rule either (i.e., no need to check conf({2} {3,5}) or conf({5} {2,3}) since both denominators will be at least as large and therefore, both confidences will be at least as low. No need to check conf({3} {2,5}) or conf({5} {2,3}) DONE! 2-Itemsets do support ARs. Data_Lecture_4.1_ARM
Ptree-ARM versus Apriori on aerial photo (RGB) data together with yeild data Scalability with support threshold 1320 1320 pixel TIFF- Yield dataset (total number of transactions is ~1,700,000). P-ARM compared to Horizontal Apriori (classical) and FP-growth (an improvement of it). In P-ARM, we find all frequent itemsets, not just those containing Yield (for fairness) Aerial TIFF images (R,G,B) with synchronized yield (Y). Scalability with number of transactions Identical results P-ARM is more scalable for lower support thresholds. P-ARM algorithm is more scalable to large spatial datasets.
P-ARM versus FP-growth (see literature for definition) Scalability with support threshold 17,424,000 pixels (transactions) Scalability with number of trans FP-growth = efficient, tree-based frequent pattern mining method (details later) For a dataset of 100K bytes, FP-growth runs very fast. But for images of large size, P-ARM achieves better performance. P-ARM achieves better performance in the case of low support threshold.