Association Rule Mining on Remotely Sensed Imagery Using Peano-trees (P-trees) Qin Ding, Qiang Ding, and William Perrizo Computer Science Department North Dakota State University, USA May 2002 (P-tree technology is patent pending by NDSU)
Outline Concepts – Association Rule Mining – Market Basket Data – Remotely Sensed Imagery (RSI) data – Peano Count Trees (P-trees) Association rule mining on RSI data using P-trees Performance analysis Conclusion
Association Rule Mining Originally proposed for market basket data. Given – A set of items I = {i 1,i 2,…i m } (e.g., items purchasable in a market) – A set of transactions D (e.g., customers checking out = id + itemset) An association rule is X=>Y, where X, Y are disjoint itemsets – X, Y are consider as events. E.g., X is the event that a transaction contains X. X=>Y is the event: “if t contains X, then it contains Y” X is called the antecedent, Y is called the consequent. Two measures: support (% trans containing X Y) and confidence (% of those transactions containing X which also contain Y) Given minimum thresholds, minsup and minconf, – Find the frequent itemsets which have support above minsup. – Derive all rules supported by frequent sets, with confidence above minconf.
Association rule mining on RSI data RSI data can be viewed as a relational table – Each band (column) is an attribute (for simplicity we assume all values are bytes) – Each pixel (row) is a transaction. – Each interval in each band is an item. – Row/column or longitude/latitude is the primary key ARM task on RSI data – To mine implicit relations among different bands, for example, relations among spectral bands and yield. Example Rule (NDVI): NIR[192,255] ^ RED[0,63] => Yield[128,255]
Important ARM Algorithms Apriori – stepwise algorithm DHP (Direct Hashing and Pruning) – hash itemset counts and prune transactions Partition – divide the database into small partitions such that each can be processed independently and efficiently in memory. DIC (Dynamic Itemset Counting) – overlap the counting of candidate itemsets at different points during a scan. FP-growth – uses Frequent Pattern tree (FP-tree) to optimize candidate generation. Others…
Remotely Sensed Imagery (RSI) Data Satellite image – TM (Thematic Mapper) imagery (6, 7 or 8 bands) TM is Landsat satellite imagery covering the earth every 18 days since ETM+ (Landsat-7) contains 8 bands –7 VIR bands (Blue, Green, Red, NIR, MIR, TIR, MIR2) –1 Panchromatic band (PC). Aerial photography – TIFF (3 bands: Blue, Green, Red) Ground data – Yield, Moisture, Nitrate, Temperature, Elevation, etc
Precision Agriculture Dataset: TIFF Image and related Bands (1320×1320) RGB Moisture Yield Nitrate
x y R G B Y M N x: Row y: Column R: Red G: Green B: Blue Y: Yield M: Moisture N: Nitrate As a relation
Spatial Data Formats BAND ( ) ( ) ( ) ( ) BAND ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2:
Spatial Data Formats BAND ( ) ( ) ( ) ( ) BAND ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2: BIL format (1 file)
Spatial Data Formats BAND ( ) ( ) ( ) ( ) BAND ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2: BIL format (1 file) BIP format (1 file)
Spatial Data Formats BAND ( ) ( ) ( ) ( ) BAND ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2: BIL format (1 file) BIP format (1 file) bSQ format (16 files) B11 B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26 B27 B
Peano Count Tree (P-tree) P-tree represents RSI data bit-by-bit in a recursive quadrant-by-quadrant arrangement. P-trees are a lossless compressed representation of the original data.
An example 2-D a P-tree Quadrant-based, Pure (Pure-1/Pure-0) quadrant Peano or Z-ordering Root Count bSQ file bSQ file arranged as a spatial dataset (2-D raster order)
Peano Mask Tree (PM-tree) Truth-Trees (1 if condition is true of quadrant, else 0 – E.g., Pure-1 and Pure-0 Trees – All are lossless compressed representations of the dataset
Peano or Z-ordering Pure-1/Pure-0 quadrant Root Count Level Fan-out QID (Quadrant ID) ( 7, 1 ) ( 111, 001 )
P-tree Operations P-tree 55 PM-tree m ______/ / \ \_______ ______/ / \ \______ / __ / \___ \ / __ / \ __ \ / / \ \ / / \ \ 16 __8____ _15__ 16 1 m m 1 / / | \ / | \ \ / / \ \ / / \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ //|\ //|\ //|\ P-tree-1: m ______/ / \ \______ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ P-tree-2: m ______/ / \ \______ / / \ \ 1 0 m 0 / / \ \ m //|\ 0100 AND-Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 m 0 / | \ \ 1 1 m m //|\ //|\ OR-Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 m 1 1 / / \ \ m 0 1 m //|\ //|\ Complement 9 m ______/ / \ \_______ ______/ / \ \______ / __ / \___ \ / __ / \ __ \ / / \ \ / / \ \ 0 __8____ _1__ 0 0 m m 0 / / | \ / | \ \ / / \ \ / / \ \ m 1 0 m 0 0 m 0 //|\ //|\ //|\ //|\ //|\ //|\
Ptree ANDing Operation PM-tree1: m ______/ / \ \______ / / \ \ 1 m m 1 / / \ \ / / \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ PM-tree2: m ______/ / \ \______ / / \ \ 1 0 m 0 / / \ \ m //|\ 0100 Result: m ________ / / \ \___ / ____ / \ \ / / \ \ 1 0 m 0 / | \ \ 1 1 m m //|\ //|\ & RESULT 0 0 231 Depth-first Pure-1 path code
Various P-trees Basic P-trees P i, j Value P-trees P i (v) Tuple P-trees P(v 1, v 2, …, v n ) AND COMPLEMENT AND Interval P-trees P i (v 1, v 2 ) Cube P-trees P([v 11, v 12 ], …, [v N1, v N2 ]) OR AND AND, OR, COMPLEMENT AND, OR Predicate P-trees P(p) COMPLEMENT AND, OR, COMPLEMENT
Association Rule Mining on RSI Data using P-trees Admissible Itemsets (Asets ) – Asets are itemsets of the form, Int 1 Int 2 ... Int n = Π i=1...n Int i, where Int i is an interval of values in Band i (some of which may be the full value range). – Example: Aset {[01,01] 1, [11,11] 2 } P-ARM algorithm Pruning techniques
P-ARM algorithm Procedure P-ARM { Data_Discretization; F 1 = {frequent 1-Asets}; For (k=2; F k-1 ) do begin C k = p-gen(F k-1 ); Forall candidate Asets c C k do c.count = AND_rootcount(c); F k = {c C k | c.count >= minsup} end Answer = k F k } F 1 is determined directly from P-tree root counnts and pruning techniques rather than transaction database scan. The p-gen function differs from the apriori-gen function in Apriori by using some pruning techniques. The AND_rootcount function is used to calculate Aset counts directly by ANDing the appropriate basic P- trees instead of scanning the transaction databases. The support count for Aset {B1[0,64), B2[64,127)} (or {[00, 00] 1, [01, 01] 2 }) is the root count of P 1 (00) AND P 2 (01).
Pruning Techniques Band-based pruning – An itemset with two items from the same band will have support zero. Constraint-base pruning – E.g., specify yield as the only consequent band of interest. – Note: in the performance comparisons we did not use this pruning technique (to maintain fairness, since it is hard to implement in other alogrithms) Bit-based pruning for multi-level rules – if Aset [128,255] (or [1,1] 2 ) is not frequent, then the Aset [128,191] (or [10,10] 2 ) and [192,255] (or [11,11] 2 ) cannot be frequent either. Others
P-ARM versus Apriori Scalability with support threshold 1,742,400 pixels (transactions)
P-ARM versus Apriori (cont.) Scalability with number of transactions Support threshold =10%
P-ARM versus FP-growth Scalability with support threshold %30%50%70%90% Support threshold Run time (Sec.) P-ARM FP-growth 17,424,000 pixels (transactions) 1,742,400 pixels (transactions)
P-ARM versus FP-growth (cont.) Scalability with the number of transactions Support threshold =10%
Conclusion A model for association rule mining on RSI data – P-trees facilitate fast calculation of support – P-trees facilitates significant pruning techniques Applications other than precision agriculture – Flood prediction and monitoring – Community and regional planning – Virtual archeology – Mineral exploration – Bioinformatics/Genomics – VLSI design