P-Trees: Universal Data Structure for Query Optimization to Data Mining Most people have Data from which they want information. So, most people need DBMSs whether they know it or not. The main component of any DBMS is the query processor, but so far QPs deal with standard workload only (on left). On the standard_query end, we have much work yet to be done to solve the problem of delivering standard workload answers with low response times (D. DeWitt, ACM SIGMOD’02). On the Data Mining end, we have barely scratched the surface. (But those scratches have made the difference between becoming the biggest corporation in the world and filing for bankruptcy – Walmart vs. KMart) These notes contain NDSU confidential & Proprietary material. Patents pending on bSQ, Ptree technology SELECT FROM WHERE Complex queries (nested, EXISTS.. ) FUZZY queries (e.g., BLAST searches,.. OLAP (rollup, drilldown, slice/dice.. Machine LearningData Mining Standard querying Simple Searching and aggregating Supervised - Classification Regression Unsupervised- Clustering Association Rule Mining
Data Mining Querying: ask specific questions - expect specific answers. We will get back to querying later. Data Mining: “Go into the DATA MOUNTAIN. Come out with gems” (but also, very likely, some fool’s gold. Relevance and Interestingness analysis assays the gems (helps pick out the valuable information). A Universal Model for Association Rule Mining, Classification and Clustering of a data table, R(A 1..A n ) where the A i s are feature attributes assumed numeric (categorical attributes can be coded numeric) First order the rows: –Rids or RRNs provide an ordering –arrival ordinal provides an ordering –Peano order of pixels in an image provides an ordering –Raster order does also, but »For images, raster order should first be converted to Peano order since a raster line is not a geometric or a geographic object. More later. »In raster order, pixel-ids, (x, y), are sorted by bit-position in the order, x 1 x 2..x n y 1 y 2..y n, while Peano order is, x 1 y 1 x 2 y 2 …x n y n
Peano Tree (P-tree) Data Structure for Data Mining Given a data table, R(A 1,…, A n ) Order it (e.g., arrival order or RRN) Decompose it into attribute projections (maintain the ordering on each) R R[A i ] i=1,…,n (Band SeQential or BSQ projections) Decompose each attribute projection by bit position into bit-projections R[A i ] R ij j=1,…,m i (Bit SeQential or bSQ projections) (e.g., if each A i -value datatype is a bytes, then m i =8 for all i) Build d-dimensional basic P-tree from each bit-projection {P ij | i=1,…,n and j=1,…, m i } R R[A i ] R ij basic P-trees, P ij How is this last step done?
1-D CP-tree Construct 1-D CP using recursive l-r-half counts (inodes have bitant counts) A bit-projection (bSQ file) from a table with 64 rows
2-D CP-tree (same bSQ file) Construct the 2D tree by removing every other row (inodes have quadrant counts) Eliminated a pure quadrant Eliminated pure quadrants
4-D CP-tree from the same bSQ file? Remove every other row (insufficient number of rows!) Can construct a 4D with 2D leaves:
What about 3-D? (same bSQ file) Construct the 3D tree from 2D by removing every other row spliting the other rows (inodes have octant counts)
Summary Given a feature relation, R(A 1,…, A n ) –1. Order rows (RRN, Rid, ArrivalOrdinal, a Raster Spatial Order, … –2. Choose a dimension,d (or combination; d 1, d 2,...) –3. Choose fanout(s) (e.g., d 1 n1 d 2 n2 … d r nr ) Basic P-trees can be implemented in many formats: –CountP-Tree (CP) (inode contains quadrant-1-bit-count and child-pointers) –PredicateP-trees (inodes: 1 iff predicate=true thruout quadrant and child ptrs) Pure1-Trees (P1), Pure0-Trees (P0), PureNot1-Tree (NP1), PureNot0-Tree (NP0) ValueP-trees (VP) TupleP-trees (TP) HalfPure-trees (HP) Above are lossless, compressed, DM-Ready Interval-Ptree Box-Ptree How do we datamine heterogeneous datasets? i.e., R,S,T.. describing same entity, different keys/attribs Universal Relation: transform into one relation (union keys?) Key Fusion: R(K..); S(K’..) Mine them as separate relations but map keys using a tautology. The two are methods are related. Universal Rel approach usually includes definining a universal key to which all local keys are mapped (using a (possibly fragmented) tautological lookup table)
Spatial Data Formats (e.g., images with natural 2-D structure and coordinates; (x,y), - raster ordering) BAND ( ) ( ) ( ) ( ) BAND ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2:
Spatial Data Formats (Cont.) BAND ( ) ( ) ( ) ( ) BAND ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2: BIL format (1 file)
Spatial Data Formats (Cont.) BAND ( ) ( ) ( ) ( ) BAND ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2: BIL format (1 file) BIP format (1 file)
Spatial Data Formats (Cont.) BAND ( ) ( ) ( ) ( ) BAND ( ) ( ) ( ) ( ) BSQ format (2 files) Band 1: Band 2: BIL format (1 file) BIP format (1 file) bSQ format (16 files) (related to bit planes in graphics) B 11 B 12 B 13 B 14 B 15 B 16 B 17 B 18 B 21 B 22 B 23 B 24 B 25 B 26 B 27 B
Suppose we start with a bit-projection of a raster-ordered spatial file (image)? First, re-order into Peano order. Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count Level Fan-out QID (Quadrant ID) Raster ordered bSQ file. Spatial arrangement Shows Peano order
Same example Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count Level Fan-out QID (Quadrant ID) ( 7, 1 ) ( 111, 001 ) Level-0 Level-3 Level-2 Level
Each bSQ file, R ij generates a BasicPtree P ij Each value, v, A i, generates a ValuePtree, VP i (v) (1 iff purely v thruout quadrant) Each tuple (v 1..v n ) in R, gens TuplePtree, TP(v 1..v n ) (1 iff purely (v 1..v n ) thruout quadrant) Any row-predicate on R gens PredicatePtree, P (1 iff true thruout quadrant) Any interval, [l,u], in A i, gens IntervalPtree, [l,u]P i (1 iff v [l,u] thruout quadrant) Any box, [l i,u i ], in R gens a RectanglePtree [l i,u i ]P (1 iff (v 1..v n ) [l,u] thruout quadrant) (each Ptree can be expressed as a CountTree with inode-value=count or a BooleanTree with inode-value=bit) Some Common types of Ptrees, given R(A 1..A n ) Value Ptree: P 1 (001) = P1’ 11 ^ P1’ 12 ^ P1 13 Tuple Ptree (1 if quad contains only that tuple, else 0) P(001, 010, 111) = P 1 (001) ^ P 2 (010) ^ P 3 (111) = P1’ 11 ^ P1’ 12 ^P1 13 ^ P1’ 21 ^P1 22 ^ P1’ 23 ^ P1 31 ^P1 32 ^P1 33 Basic Ptrees P1 11, …, P1 18, P1 21, …, P1 28, … P1 71, …, P1 78 attribute tuple (1,2,7), in 3-bit precision Value in 3-bit prec Review: given a feature relation, R(A 1,…, A n ) –1. Order rows (RRN, Rid, ArrivalOrdinal, a Raster Spatial Order, … –2. Choose a dimension, d (or combination; d 1, d 2,...) and a fanout(s) (e.g., d 1 n1 d 2 n2 … d r nr )
Predicate Ptrees (inode: 1 iff condition=true thruout quadrant P1 (Pure1) / / \ \ // \ \ // \ \ //|\ //|\ //|\ CP: / / \ \ // \ \ // \ \ //|\ //|\ //|\ P0 (Pure0) / / \ \ 0 0 // \ \ // \ \ //|\ //|\ //|\ Predicate Ptrees can be stored in QidVector (QV) format. mixed quadrant: (qid, ChildTruthVector) P1-QV Qid CV [] 1001 [1] 0010 [1.0] 1110 [1.3] 0010 [2] 1101 [2.2] 1101 P0-QV Qid CV [] 0000 [1] 0100 [1.0] 0001 [1.3] 1101 [2] 0000 [2.2] 0010 NP0 (NotPure0) / / \ \ 1 1 // \ \ // \ \ //|\ //|\ //|\ NP0-QV Qid CV [] 1111 [1] 1011 [1.0] 1110 [1.3] 0010 [2] 1111 [2.2] 1101 NP1 (NotPure1) / / \ \ // \ \ // \\ //|\ //|\ //|\ NP1-QV Qid CV [] 0110 [1] 1101 [1.0] 0001 [1.3] 1101 [2] 0010 [2.2] 0010 MP (Mixed) / / \ \ // \ \ // \ \ MP-QV Qid CV [] 0110 [1] 1001 [2] 0010 Leafs always 0000 so omitted. HPtrees results from the HalfPure1 predicate: Lossless. 1 means pure1 iff no child ptrs. 0 means pure0 iff no child ptrs. Delete any number of bottom levels = HPtree of coarser granularity. ANDing HPtrees: if any operand is 0, result = 0 if all operands are 1 and has children, result = 1 else result could be 0 or 1 (depends upon children, but likely = 0) The Hptree of the complement of a bSQ file is the flip of the HPtree. HPtrees have the same leaves as Pure1Trees. HPtree is the “high-order bit” tree of the CPtree. HP (HalfPure) / / \ \ 1 1 // \ \ // \ \ //|\ //|\ //|\ HP-QV Qid CV [] 1111 [1] 1010 [1.0] 1110 [1.3] 0010 [2] 1111 [2.2] 1101
The P-tree Algebra (Complement, AND, OR, …) Complementing a Ptree ( Ptree for the flip of the bSQ file) (we use the “prime” notation) –Count-Ptree: formed by purity-complementing each count. –Purity-Ptree (P1, P0, NP0, NP1): formed by bit-flipping leaves Only. –HPtree: formed by bit-flipping all (comp = flip) (We use”underscore” for the flip of a tree) P1 = P0’ / / \ \ // \ \ // \ \ //|\ //|\ //|\ P0 = P1’ / / \ \ 0 0 // \ \ // \ \ //|\ //|\ //|\ NP0 = NP1’ / / \ \ 1 1 // \ \ // \ \ //|\ //|\ //|\ NP0V Qid PgVc [] 1111 [1] 1011 [1.0] 1110 [1.3] 0010 [2] 1111 [2.2] 1101 NP1 = NP0’ = P / / \ \ // \ \ // \\ //|\ //|\ //|\ NP1V Qid PgVc [] 0110 [1] 1101 [1.0] 0001 [1.3] 1101 [2] 0010 [2.2] 0010 P1V Qid PgVc [] 1001 [1] 0010 [1.0] 1110 [1.3] 0010 [2] 1101 [2.2] 1101 P0V Qid PgVc [] 0000 [1] 0100 [1.0] 0001 [1.3] 1101 [2] 0000 [2.2] 0010 P / / \ \ // \ \ // \ \ //|\ //|\ //|\ P / / \ \ 1 1 // \ \ // \ \ //|\ //|\ //|\ NP0 = P / / \ \ 0 0 // \ \ // \ \ //|\ //|\ //|\ NP0V Qid PgVc [] 0000 [1] 0100 [1.0] 0001 [1.3] 1101 [2] 0000 [2.2] 1101 NP1 = P / / \ \ // \ \ // \\ //|\ //|\ //|\ NP1V Qid PgVc [] 1001 [1] 0010 [1.0] 0001 [1.3] 0010 [2] 1101 [2.2] 1101 P1V Qid PgVc [] 0110 [1] 1101 [1.0] 0001 [1.3] 1101 [2] 0010 [2.2] 0010 P0V Qid PgVc [] 1111 [1] 1011 [1.0] 1110 [1.3] 1101 [2] 1111 [2.2] 1101
ANDing (for all Truth-trees, just AND bit-wise) AND Pure1-quad-list method: For each operand, list the qids of the pure1 quad’s in depth-first order. Do one multi-cursor scan across the operand lists, for every pure1 quad common to all operands, install it in the result. P1 operand // \ \ // \\ //|\ //|\ //|\ P0 operand // \ \ // \ \ //|\ //|\ //|\ NP0 operand // \ \ // \\ //|\ //|\ //|\ NP1 operand1 NP0’ // \ \ // \\ //|\ //|\ //|\ P1 operand / / \ \ //|\ 0100 P0 op2 = P1’ op / / \ \ //|\ 1011 NP0 operand / / \ \ //|\ 0100 NP1 operand2 NP0’ / / \ \ //|\ 1011 P1 op1 ^ P1 op // | \ //|\ //|\ P1 op1 ^P0 op2 = P1 op1 ^P1’ op // \ \ //\ \ //|\ //|\ //|\ NP0 op1 ^ NP0 op // | \ //|\ //|\ NP0 op1 ^ NP0’ op // \ \ /// \ //|\ //|\ //|\ AND = Depth first traversal using 1^1=1, 1^0=0, 0^0=0. bitwise
Association Rule Mining (ARM) Association Rule Mining on R is a matter of finding all (qualifying) rules of the form, A C where A is a subset of tuples called the antecedent And is disjoint from the subset, C, called consequent.
Precision Ag ARM example Identifying high / low crop yields (usually a classification problem) E.g., R( X, Y, R, G, B, Y ), R/G/B are red/green/blue reflectance from the pixel or grid cell at (x,y) –Y is the yield at (x,y). –Assume all are 8-bit values. High Support and Confidence rules are searched for in which the consequent is entirely in the Yield attribute, such as: – [192,255] G [0,63] R [128,255] Y How to apply rules? –Obtain rules from previous year’s data, then apply rules in the current year after each aerial photo is taken at different stages of plant growth. –By irrigating/adding Nitrate where lower Yield is indicated, overall Yield may be increased. –We note that this problem is more of a classification problem (classify Yield levels) – that is, the rules are classification rules, not association.
Image Data Terminology Pixel – a point in a space Band – feature attribute of the pixels Value – usually one byte (0~255) Images have different numbers of bands –TM4/5: 7 bands (B, G, R, NIR, MIR, TIR, MIR2) –TM7: 8 bands (B, G, R, NIR, MIR, TIR, MIR2, PC) –TIFF: 3 bands (B, G, R) –Ground data: individual bands (Yield, Moisture, Nitrate, Temp, elevation…) RSI data can be viewed as collection of pixels. Each has a value for each feature attrib. TIFF imageYield Map E.g., RSI dataset above has 320 rows and 320 cols of pixels (102,400 pixels) and 4 feature attributes (B,G,R,Y). The (B,G,R) feature bands are in the TIFF image and the Y feature is color coded in the Yield Map. Existing formats –BSQ (Band Sequential) –BIL (Band Interleaved by Line) –BIP (Band Interleaved by Pixel) New format –bSQ (bit Sequential)
Data Mining in Genomics There is (will be?) an explosion of gene expression data. Current emphasis is on extracting meaningful information from huge raw data sets. Consistent data store and the use of P-trees to facilitate Association Rule Mining as well as Clustering / Classification will facilitate the extract of information and answers from raw data on demand. Microarray data is most often represented as a Gene Table, G(Gid, E 1, E 2,., E n ) where Gid is the gene identifier; E 1 …. E n are the various treatments (or conditions or experiments) and the data values are gene expression levels (Excel spreadsheet). A gene regulatory pathway component can be represented as an association rule, {G 1..G n } G m where {G 1 …G n } is the antecedent & G m is the consequent. Currently, data-mining techniques concentrate on the Gene table - specifically, on finding clusters of genes that exhibit similar expression patterns under selected treatments clustering the gene table
ARM for Microarray Data (Contd.) An alternate data format exits (called the “Experiment Table”.) T(Eid, G 1, G 2, …., G n ) where Eid is the Experiment (or Treatment or Condition) identifier and G 1 …G n are the gene identifiers. Experiment tables are a convenient form for ARM of gene expression levels. Goal is to mine for rules among genes by associating treatment table columns. …. E4E4 E3E3 E2E2 E1E1 G4G4 G3G3 G2G2 G1G1 GeneID ExpmtID. Gene expression values The form of the Experiment Table with binary values (coding only whether an expression level exceeds or does not_exceed a threshold) is identical to Market Basket Data, for which a wealth of Rule Mining techniques have been developed in the last 8-10 years.
Experiment Table ……. …E4E4 … …E3E3 … …E2E2 … …E1E1 G4G4 G3G3 G2G2 G1G1 Gene Table is usually given as a standard (MS excel) spreadsheet of gene expression levels coming from microarray experiements. It is a 2-D data cube which can be rotated (to the Experiment Table), rolledup, sliced, diced, drilled down, association rule mined etc. Gene Table ……….…G4G4 …… …G3G3 …… …G2G2 …… …G1G1 E4E4 E3E3 E2E2 E1E1
A Universal Format? E.g., One large universal table with 5 dimensions based on MIAME standard? –E = Experimental design – Hybridisation Procedures –A = Array design –S = Samples –M = Measurements –N = Normalization Control for data mining across all experiments and genes?
Gene-Rep Eid (E,A,S,M,N in 5D Peano order) G1G1 G2G2 …GnGn E,A,S,M,N 1 …. E,A,S,M,N 2 ….... E,A,S,M,N m …. Gene expression values “MIAME HYPERCUBE“ (5-D Universal Gene Expression Cube) Cardinality is high, but compression will be substantial (next slide).
MIAME HYPRCUBE rolled up onto (E,S) … zeros E(Experiment) Gene E1A1S1M1N1..E1A1S1M1Nn...EnAnSnMnNnE1A1S1M1N1..E1A1S1M1Nn...EnAnSnMnNn G 1 G 2... G n The non-zero blocks may occur off the diagonal. The Point: Massive but very sparse dataset! The AD (All Digital) implementation format for distributed P-tree processing is one in which the bit filter (mask) approach is universally adopted. If the dimension is 5 (as in the MIAME HYPERCUBE), the only operation is a 32-bit AND operation – which fits current commodity processors perfectly (32-bit registers!). A hardware drop-in 5-D P-tree AND card of standard PCs please!!! “Anyone? Anyone?..
Market Basket ARM example Identifying purchasing patterns If a customer buys beer, s/he will buy chips (so shelve the chips near the beer?) E.g., Boolean relation, R(Tid, Aspirin, Beer, Chips, Dates,..,Zippo) Tid=transaction id (for a customer going thru checkout). In any field of a tuple there is a 1 if the customer has that product in his/er basket, else 0 (existence, not count). Support and Confidence: Given itemsets, A and C, Supp(A) = ratio of the number of trans supporting A over the total number of transs. Supp(A C) = ratio of the number of customers buying A C over the total cust. Conf(A C) = ratio of # of customers buying A C over # of cust buying A = Supp(A C) / Supp(A) in list notation Thresholds Frequent Itemsets = Support exceeds a min support threshold (minsupp). –L k denotes the set of frequent k-itemsets (sets with k items in them). High Confidence Rules = Confidence exceeds a min threshold (minconf).
Lists versus Vectors in MBR In most MBR, the table (relation) is T(Tid, Itemset) –Where Tid is the customer transaction id (during checkout) –Itemset is the set of items customer purchases (in Market Basket). Itemset is a set and therefore T is non-First_Normal_Form Therefore the bit-vector approach is usually taken in MBR: BT(Tid, item 1 item 2 … item n ) Itemset is expressed as a bit-vector, [ …1000] –where each item is assigned to a bit position and that bit is 1 if t-itemset contains that item and 0 otherwise. –The Vector version corresponds to the table model we have been using, with R(A 1,…,A n ), ordering is by Tid and the A i ‘s are the items in an assigned order (the datatype of each is Boolean)
Many-to-Many Relationships (M-M) (List vs. Vector model?) A M-M relationship between entities, E 1 and E 2, is modeled as a table, R(E 1, E 2 List) where E 2 List is the list of all E 2 -occurrences related to the corresponding E 1 occurrence. Or it is modeled as the “rotation” R’(E 2, E 1 List). Note that both tables are non-1NF! Non-1NF tables are difficult, so List model is typically transformed to the Vector model : R(E 1, E 2,1, E 2,2, …, E 2,n ) where each E 2,j value is Boolean (1 iff that E 2 -occurrence is related to the E 1 occurrence). This transformation and the APRIORI work done between 1992-present has made MBR a sea-change event. Walmart adopted it early to analyze and manage supply and Kmart did not. This year Walmart became the world largest company and Kmart filed for bankruptcy protection. Is it effective technology? Gene-to-Experiment and CustomerTrans-to-Item are M-M relationships – quite similar!
Association Rule Example Each trans is a list (or bit vector) of items purchased by a customer in a visit): TidABCDEF minsupp=50%, minconf=50% Find the frequent itemsets: the sets of items that have minsupp A subset of a freq itemset must also be a freq itemset if {A, B} is freq itemset, {A} and {B} must be frequent APRIORI: Iteratively find frequent itemsets with size from 1 to k. Use the frequent itemsets to generate association rules. C k will denote the candidate frequent k-itemsets L k will denote the frequent k-itemsets Support Count Suppose the items in L k-1 are listed in an order Step 1: self-joining L k-1 insert into C k select p.item 1, p.item 2, …, p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 =q.item 1,..,p.item k-2 =q.item k-2, p.item k-1 < q.item k-1 Step 2: pruning forall itemsets c in C k do forall (k-1)-subsets s of c do if (s is not in L k-1 ) delete c from C k
Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3 TID P 1 2 //\\ 1010 P 2 3 //\\ 0111 P 3 3 //\\ 1110 P 4 1 //\\ 1000 P 5 3 //\\ 0111 Build Ptrees: Scan D L 1 ={1,2,3,5} P 1 ^P 2 1 //\\ 0010 P 1 ^P 3 2 //\\ 1010 P 1 ^P 5 1 //\\ 0010 P 2 ^P 3 2 //\\ 0110 P 2 ^P 5 3 //\\ 0111 P 3 ^P 5 2 //\\ 0110 L 2 ={13,23,25,35} P 1 ^P 2 ^P 3 1 //\\ 0010 P 1 ^P 3 ^P 5 1 //\\ 0010 P 2 ^P 3 ^P 5 2 //\\ 0110 L 3 ={235} Minsup=2 {123} pruned because {12} not frequent. {135} pruned because {15}not frequent.. The P-ARM algorithm assumes a fixed value precision in all bands. p-gen function for numeric spatial data differs from apriori-gen by using additional pruning. AND_rootcount function is used to calculate Itemset counts directly by ANDing the appropriate basic Ptrees instead of scanning the transaction databases.
P-ARM versus Apriori Scalability with support threshold 1320 1320 pixel TIFF- Yield dataset (total number of transactions is ~1,700,000). 2-bits precision Equi-length partition Compare with Apriori (classical method) and FP-growth (recently proposed). Find all frequent itemsets, not just those containing Yield, for fairness. The images are actual aerial TIFF images with synchronized yield maps. Scalability with number of transactions Identical results P-ARM is more scalable for lower support thresholds. P-ARM algorithm is more scalable to large spatial datasets.
P-ARM versus FP-growth Scalability with support threshold 17,424,000 pixels (transactions) Scalability with number of trans FP-growth = efficient, tree-based frequent pattern mining method (details later) Identical results. For a dataset of 100K bytes, FP-growth runs very fast. But for images of large size, P-ARM achieves better performance. P-ARM achieves better performance in the case of low support threshold.
P-cube and P-table of R(A 1,…,A n ) Given R(A 1, A 2, A 3 ), form P-trees for R Form the data cube, P-cube, of all TupleP-trees, PcubeR Applying Peano ordering to the P-cube cells, defines the P-table PtableR([],[0],[0.0]…[1],[1.0]... ) Quadrants are the feature attribute (column) names, listed in depth-first or pre-order. Can form P-trees on PtableR - What are they? - What is the relationship to the Haar wavelet low-pass tree? rc P(0,0,0) A1A1 A2A2 A3A3 rc P(1,0,0) rc P(0,2,0) rc P(1,2,0) rc P(2,2,0) rc P(3,2,0) rc P(0,3,0) rc P(1,3,0) rc P(2,3,0) rc P(3,3,0) rc P(0,0,0) rc P(1,1,0) rc P(2,1,0) rc P(3,1,0) rc P(3,0,0) rc P(2,0,0) rc P(0,0,1) rc P(1,0,1) rc P(2,0,1) rc P(3,0,1) rc P(0,0,2) rc P(1,0,2) rc P(2,0,2) rc P(3,0,2) rc P(2,0,3) rc P(1,0,3) rc P(0,0,3) rc P(3,0,3) rc P313 rc P312 rc P311 rc P323 rc P333 rc P322 rc P321 rc P331 rc P332
High Confidence Rules Application areas on spatial data –Forest fires –Big ticket item buyer identification. –Gene function determination –Identification of agricultural pest infestations Traditional algorithms are not suitable –Too many frequent itemsets in the case of low support threshold P-tree P-cube Establish a very low minsupp though –To eliminate rules that result from noise and outliers Eliminate redundant rules
Confident Rule Mining Algorithm Build the set of confident rules, C (initially empty) as follows: –Start with 1-bit values, 2 bands; –then 1-bit values and 3 bands; … –then 2-bit values and 2 bands; –then 2-bit values and 3 bands; … –... –At each stage defined above, do the following: Find all confident rules by rolling-up the T-cube along each potential consequent set using summation. Comparing these sums with the support threshold to isolate rule support sets with the minimum support. Compare the normalized T-cube values (divide by the rolled-up sum) with the minimum confidence level to isolate the confident rules. Place any new confident rule in C, but only if non-redundant.
,0 1,1 2,0 2,1 Example 3034 sums thresholds Assume minimum confidence threshold 80%, minimum support threshold 10% Start with 1-bit values and 2 bands, B1 and B2 C: B1={0} => B2={0} c = 83.3%
Methods to Improve Apriori’s Efficiency Hash-based itemset counting: A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent Transaction reduction: A transaction that does not contain any frequent k-itemset is useless in subsequent scans Partitioning: Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB Sampling: mining on a subset of given data, lower support threshold + a method to determine the completeness Dynamic itemset counting: add new candidate itemsets only when all of their subsets are estimated to be frequent The core of the Apriori algorithm: –Use frequent (k – 1)-itemsets to generate candidate frequent k-itemsets –Use database scan and pattern matching to collect counts for the candidate itemsets The bottleneck of Apriori: candidate generation 1. Huge candidate sets: 10 4 frequent 1-itemset will generate 10 7 candidate 2-itemsets To discover frequent pattern of size 100, eg, {a 1 …a 100 }, need to generate candidates. 2. Multiple scans of database: (Needs (n +1 ) scans, n = length of the longest pattern)
SMILEY Distributed P-Tree Architecture Synchronized dataset to be data mined (at some URL) (RSI, Genomic dataset, MBR dataset, …) R(A 1,…,A n ) USE DATA MINING RESULTS DADI (Drag And Drop Interface) DVI (Data Visualization Interface) SMILEY, BisonBlast, BisonArray (ARM, Classification, Clustering Algorithm implementations) DCI (Data Capture Interface) DMI (Data Mining Interface) Basic P-Trees striped (distributed) across Beowulf cluster (MidAS) (Parameters: Qid striping level(s), fanout, implementation model…)
BSM — A Bit Level Decomposition Storage Model A model of query optimization of all types Vertical partitioning has been studied within the context of both centralized database system as well as distributed ones. It is a good strategy when small numbers of columns are retrieved by most queries. The decomposition of a relation also permits a number of transactions to execute concurrently. Copeland et al presented an attribute level decomposition storage model (DSM) [CK85] storing each column of a relational table into a separate binary table. The DSM showed great comparability in performance. Beyond attribute level decomposition, Wong et al further took the advantage of encoding attribute values using a small number of bits to reduce the storage space [WLO+85]. In this paper, we will decompose attributes of relational tables into bit position level, utilize SPJ query optimization strategy on them, store the query results in one relational table, finally data mine using our very good P-tree methods. Our method offers these advantages: –(1) By vertical partitioning, we only need to read everything we need. This method makes hardware caching work really well and greatly increases the effectiveness of the I/O device. –(2) We encode attribute values into bit vector format, which makes compression easy to do. –(3) SPJ queries can be formulated as Boolean expressions, which facilitates fast implementation on hardware. –(4) Our model is fit not only for query processing but for data mining as well. [CK85] G.Copeland, S. Khoshafian. A Decomposition Storage Model. Proc. ACM Int. Conf. on Management of Data (SIGMOD’85), pp , Austin, TX, May [WLO + 85] H. K. T. Wong, H.-F. Liu, F. Olken, D. Rotem, and L. Wong. Bit Transposed Files. Proc. Int. Conf. on Very Large Data Bases (VLDB’85), pp , Stockholm, Sweden, 1985.
SPJ Query Optimization Strategies - One-table Selections There are two categories of queries in one-table selections: Equality Queries and Range Queries. Most techniques [WLO+85, OQ97, CI98] used to optimize them employ encoding schemes – equality encoding and range encoding. Chan and Ioannidis [CI99] defined a more general query format called interval query. An interval query on attribute A is a query of the form “x≤A≤y” or “NOT (x≤A≤y)”. It can be an equality query or a range query when x or y satisfies different kinds of conditions. We defined interval P-trees in previous work [DKR+02], which is equivalent to the bit vectors of corresponding intervals. So for each restriction in the form above, we have one corresponding interval P-tree. The ANDing result of all the corresponding interval P-trees represents all the rows satisfy the conjunction of all the restriction in the where clause. [CI98] C.Y. Chan and Y. Ioannidis. Bitmap Index Design and Evaluation. Proc. ACM Intl. Conf. on Management of Data (SIGMOD’98), pp , Seattle, WA, June [CI99] C.Y. Chan and Y.E. Ioannidis. An Efficient Bitmap Encoding Scheme for Selection Queries. Proc. ACM Intl. Conf. on Management of Data (SIGMOD’99), pp , Philadephia, PA, [DKR + 02] Q. Ding, M. Khan, A. Roy, and W. Perrizo. The P-tree algebra. Proc. ACM Symposium Applied Computing (SAC 2002), pp , Madrid, Spain, [OQ97]P. O’Neill and D. Quass. Improved Query Performance with Variant Indexes. Proc. ACM Int. Conf. on Management of Data (SIGMOD’97), pp.38-49, Tucson, AZ, May 1997.
Select-Project-StarJoin (SPSJ) Queries A Select-Project-StarJoin query is a SPJ query in which there is one multiway join along with selections and projections typically there is a central fact relation to which several dimension relations are joined. The dimension relations can be viewed as points on a star centered on the fact relation. For example, given the Student (S), Course (C), and Enrollment (E) database shown below (note a bit encoding is shown in reduced font italics for certain attributes), take SPSJ query, SELECT S.s,S.name,C.name FROM S,C,E WHERE S.s=E.s AND C.c=E.c AND S.gen=M AND E.grade=A AND C.term=S S|s____|name_|gen| C|c____|name|st|term| E|s____|c____|grade | |0 000|CLAY |M 0| |0 000|BI |ND|F 0| |0 000|1 001|B 10| |1 001|THAIS|M 0| |1 001|DB |ND|S 1| |0 000|0 000|A 11| |2 010|GOOD |F 1| |2 010|DM |NJ|S 1| |3 011|1 001|A 11| |3 011|BAID |F 1| |3 011|DS |ND|F 0| |3 011|3 011|D 00| |4 100|PERRY|M 0| |4 100|SE |NJ|S 1| |1 001|3 011|D 00| |5 101|JOAN |F 1| |5 101|AI |ND|F 0| |1 001|0 000|B 10| |2 010|2 010|B 10| |2 010|3 011|A 11| |4 100|4 100|B 10| |5 101|5 101|B 10| The bSQ attributes are stored as follows (note, e.g., bit-1 of S.s has been labeled Ss1, etc.). Ss1 Ss2 Ss3 Sg Cc1 Cc2 Cc3 Ct Es1 Es2 Es3 Ec1 Ec2 Ec3 Eg1 Eg BSQ attributes stored as single attribute files: S.name C.name C.st |CLAY | |BI | |ND| |THAIS| |DB | |ND| |GOOD | |DM | |NJ| |BAID | |DS | |ND| |PERRY| |SE | |NJ| |JOAN | |AI | |ND|
For character string attributes, LZW or some other run-length compression could be used to further reduce storage requirements. The compression scheme should be chosen so that any range of offset entries can be uncompressed independently of the rest. Each of these BSQ files would require only a few pages of storage, allowing the entire BSQ file to be brought into memory whenever any portion of it is needed, thus eliminating the need for indexes and paging. A bit mask is formed for each selection as follows. The bit mask for S.gen=M is just the complement of S.g (since M has been coded as 0), therefore mS=Sg'. Similarly, mC=Ct and mE=Eg1 AND Eg2. mS mC mE Logically ANDing mE into the E.s and E.c attributes, reduces E.s and E.c as follows. Es1 Es2 Es3 Ec1 Ec2 Ec We note that the reduced S.s and C.c attributes would need to be reduced only when S.s and C.c are not already surrogates attributes. In E, each tuple is compared with the participation masks, mS and mC to eliminate non-participating (E.s, E.c) pairs. The (E.s, E,c) pairs in binary are (000, 000), (011, 001) and (010, 011) or in decimal, (0,0), (3,1) and (2,3). The mask mS and mC reveal that S.s=0,1,4 and C.c=1,2,4 are the participating values. Therefore (3,1) and (2,3) are non-participating pairs and can be eliminated. Therefore there is but one participating (E.s, E.c) pair, namely (0, 0). Therefore to answer the query only the S.name value at offset 0 and the E.name value at offset 0 need to be retrieved. The output is (0, CLAY, BI). To review, once the basic P-trees for the join and selection attributes have been processed to remove all non-participants, only the participating BSQ values need to be accessed. The basic P-trees files for the join and selection attributes would typically be striped across a cluster of nodes so that the AND operations could be done very quickly in a parallel fashion. Our implementation on a 16 node cluster of 266 MHz Pentium computers shows that any multiway AND operation can be done in a few milliseconds.
Select-Project-Join (SPJ) Queries We deal with an example in which more than one join is required and there are more than one join attribute (bushy query tree). We organize our query trees using the "constellation" model in which one of the fact files is considered central and the others are points in a star around that central attribute. Each secondary star-point fact file can be the center of a "sub-star". We apply the selection masks first. Then we perform semi-joins from the boundary toward the central fact file. Finally we perform semi-joins back out again. The result is the full elimination of all non-participants. The following is an example of such a bushy query. Those details that are identical to the above are not repeated here. SELECT S.n,C.n,R.capacity FROM S,C,E,O,R WHERE S.s=E.s & C.c=O.c & O.o=E.o & O.r=R.r & S.gen=M & C.cred=2 & E.grade=A & R.capacity=10; In this query, O is taken as the central fact relation of the following database. S___________ C___________ E_________________ O_______________ R_____________ |s |n|gen| |c |n|cred| |s |o |grade| |o |c |r | |r |capacity| |0 000|A|M 0| |0 00|B|1 01| |0 000|1 001|2 10| |0 000|0 00|0 01| |0 00|30 11| |1 001|T|M 0| |1 01|D|3 11| |0 000|0 000|3 11| |1 001|0 00|1 01| |1 01|20 10| |2 010|S|F 1| |2 10|M|3 11| |3 011|1 001|3 11| |2 010|1 01|0 00| |2 10|30 11| |3 011|B|F 1| |3 11|S|2 10| |3 011|3 011|0 00| |3 011|1 01|1 01| |3 11|10 01| |4 100|C|M 0| |1 001|3 011|0 00| |4 100|2 10|0 00| |5 101|J|F 1| |1 001|0 000|2 10| |5 101|2 10|2 10| Sn |2 010|2 010|2 10| |6 110|2 10|3 11| A |2 010|3 011|3 11| |7 111|3 11|2 10| T |4 100|4 100|2 10| S |5 101|5 101|2 10| Ss1 Ss2 Ss3 Sgen B C Egrade1 Egrade2 Cn J Cc1 Cc2 Ccred1 Ccred2 B D Es1 Es2 Es3 Eo1 Eo2 Eo M S Rr1 Rr2 Rcap1 Rcap Oo1 Oo2 Oo3 Oc1 Oc2 Or1 Or
1. Apply selection masks: mE mR mC Es1 Es2 Es3 Eo1 Eo2 Eo Rc1 Rc2 Cr1 Cr Oo1 Oo2 Oo3 Oc1 Oc2 Or1 Or Results in the following, Es1 Es2 Es3 Eo1 Eo2 Eo Rc1 Rc2 Cr1 Cr Oo1 Oo2 Oo3 Oc1 Oc2 Or1 Or Apply join-attribute selection-masks externally to further reduce the P-trees: mS s=0,1,4 are the participants. mC c=3 is the only participant. mR r=2 is only participant Produces: Es1 Es2 Es3 Eo1 Eo2 Eo3 Rc1 Rc2 Cr1 Cr Oo1 Oo2 Oo3 Oc1 Oc2 Or1 Or Completing the elimination of newly discovered non-participants internally in each file, results in: Es1 Es2 Es3 Eo1 Eo2 Eo3 Rc1 Rc2 Cr1 Cr2 Oo1 Oo2 Oo3 Oc1 Oc2 Or1 Or Thus, s has to be 0 (000), o has to be 0 (000), c has to be 3 (11) and r has to be 2 (10). But o also has to be 7 (111). Since o cannot be both 0 and 7 there are no participants.
DISTINCT Keyword, GROUP BY Clause, ORDER BY Clause, HAVING Clause and Aggregate Operations Duplicate elimination after a projection (SQL DISTINCT keyword) is one of the most expensive operations in query optimisation. In general, it is as expensive as the join operation. However, in our approach, it can automatically be done while forming the output tuples (since that is done in a order). While forming all output records for a particular value of the ORDER BY attribute, duplicates can be easily eliminated without the need for an expensive algorithm. The ORDER BY and GROUP BY clauses are very commonly used in queries and can require a sorting of the output relation. However, in our approach, if the central relation is chosen to be the one with the sort attribute and the surrogation is according to the attribute order (typically the case – always the case for numeric attributes), then the final output records can be put together and aggregated in the requested order without a separate sort step at no additional cost. Aggregation operators such as COUNT, SUM, AVG, MAX, and MIN can be implemented without additional cost during the output formation step and any HAVING decision can be made as output records are being composed, as well. If the Count aggregate is requested by itself, we note that P-trees automatically provide the full counts for any predicate with just one multiway AND operation.
The following example illustrates these points. SELECTDISTINCT C.c, R.capacity FROM S,C,E,O,R WHERE S.s=E.s AND C.c=O.c AND O.o=E.o AND O.r=R.r AND C.cred>1 AND (E.grade='B' OR E.grade='A') AND R.capacity>10 ORDER BY C.c; S___________ C___________ E_________________ O_______________ R_____________ |s |n|gen| |c |n|cred| |s |o |grade| |o |c |r | |r |capacity| |0 000|A|M 0| |0 00|B|1 01| |0 000|1 001|2 10| |0 000|0 00|0 01| |0 00|30 11| |1 001|T|M 0| |1 01|D|3 11| |0 000|0 000|3 11| |1 001|0 00|1 01| |1 01|20 10| |2 010|S|F 1| |2 10|M|3 11| |3 011|1 001|3 11| |2 010|1 01|0 00| |2 10|30 11| |3 011|B|F 1| |3 11|S|2 10| |3 011|3 011|0 00| |3 011|1 01|1 01| |3 11|10 01| |4 100|C|M 0| |1 001|3 011|0 00| |4 100|2 10|0 00| |5 101|J|F 1| |1 001|0 000|2 10| |5 101|2 10|2 10| Sn |2 010|2 010|2 10| |6 110|2 10|3 11| A |2 010|3 011|3 11| |7 111|3 11|2 10| T |4 100|4 100|2 10| S |5 101|5 101|2 10| Ss1 Ss2 Ss3 Sgen B C Egrade1 Egrade2 Cn J Cc1 Cc2 Ccred1 Ccred2 B D Es1 Es2 Es3 Eo1 Eo2 Eo M S Rr1 Rr2 Rcap1 Rcap Oo1 Oo2 Oo3 Oc1 Oc2 Or1 Or Apply selection masks: mE =Egrade1 mR =Rcap1 mC =Ccred
results in, Es1 Es2 Es3 Eo1 Eo2 Eo3 Rr1 Rr2 Cc1 Cc Semijoin (toward center), E O(on o=0,1,2,3,4,5), R O(on r=0,1,2), C O(on c=1,2,3), reduces Oo1 Oo2 Oo3 Oc1 Oc2 Or1 Or to Oo1 Oo2 Oo3 Oc1 Oc2 Or1 Or Thus, the participants are c=1,2; r=0,1,2; o=2,3,4,5. Semijoining back again produces the following. Cc1 Cc2 Rr1 Rr2 Es1 Es2 Es3 Eo1 Eo2 Eo Thus, s partic are s=2,4,5. Ss1 Ss2 Ss Output tuples are determined from participating O.c P-trees. RC(P O.c (2)) = RC(Oc 1 ^Oc 2 ’)=2, since Oc1 ^ Oc2’ = Since the 1-bits are in positions 4 and 5, the two O-tuples have O.o surrogate values 4 and 5. The r-values at positions 4 and 5 of O.r are 0 and 2. Thus, we retrieve the R.capacity values at offsets 0 and 2. However, both of these R.capacity values are 30. Thus, this duplication is discovered without sorting or additional processing. The only output is (2,30). Similarly, RCntP O.c (1) = RCntOc 1 ’^Oc 2 =2, Oc1’ ^ Oc = Finally note, if ORDER BY clause is over an attribute which is not in the relation O (e.g., over student number, s) then we center the query tree (or wheel) on a fact file that contains the ORDER BY attribute (e.g., on E in this case). If the ORDER BY attribute is not in any fact file (in a dimension file only) then the final query tree can be re-arranged to center on the dimension file containing that attribute. Since output ordering and duplicate elimination are traditionally very expensive sub-operations of SPJ query processing, the fact that our BDM model and the P-tree data structure provide a fast and efficient way to accomplish these operations is a very favorable aspect of the approach.
Combining Data Mining and Query Processing Many data mining request involve pre-selection, pre-join, and pre-projection of a database to isolate the specific data subset to which the data mining algorithm is to be applied. For example, in the above database, one might be interested in all Association Rules of a given support threshold and confidence threshold across all the relations of the database. The brute force way to do this is to first join all relations into one universal relation and then to mine that gigantic relation. This is not a feasible solution in most cases due to the size of the resulting universal relation. Furthermore, often some selection on that universal relation is desirable prior to the mining step. Our approach accommodates combinations of querying and data mining without necessitation the creation of a massive universal relation as an intermediate step. Essentially, the full vertical partitioning and P-trees provide a selection and join path which can be combined with the data mining algorithm to produce the desired solution without extensive processing and massive space requirements. The collection of P-trees and BSQ files constitute a lossless, compressed version of the universal relation. Therefore the above techniques, when combined with the required data mining algorithm can produce the combination result very efficiently and directly.
Appendix on examples and implementations Example1: One band, B 1, with 3-bit precision PNP0V 11 P1V 11 (combined into 1 table) qidNP0P1 [ ] [01] [10] [01.00] [01.11] [10.10] P 12 qidNP0P1 [ ] [10] [10.11]0111 P 13 qidNP0P1 [ ] [01] [10] [01.11]0110 [10.00]1000 Redundant! Since, at leaf, NP0=P B 11 B 13 B B1B1
Example1: ANDing to get rc P 1 (6) P 1 (6) = P 1 (110) = P1 11 ^P1 12 ^P0 13 = P 11 ^P 12 ^NP0” 13 PM 1 (110)= P 1 (110) xor NP0 1 (110) = P 11 ^P 12 ^NP0” 13 xor NP0 11 ^NP0 12 ^P1” 13 At [ ]: CNT[ ]=1-cnt*4 level =1*4 2 = 16 since P 1 (110)[ ] = 1001^1000^1000=1000 PM 1 (110)[ ] = P 11 ^ P 12 ^NP0” 13 xor NP0 11 ^NP0 12 ^P1” 13 =1001^1000^ 1000 xor 1111 ^ 1010 ^1110 = 0010 At [10]: CNT[10]= 1-cnt*4 level =0*4 1 =0 since P 1 (110)[10]= 1101^1110^0001=0000 PM 1 (110)[10] = P 11 ^P 12 ^NP0” 13 xor NP0 11 ^NP0 12 ^P1” 13 =1101^1110^0001 xor 1111^1111^1001= 0000 xor 1001=1001 At [10.00]: CNT=[10.00]1-cnt*4 level =3*4 0 =3 since P 1 (110)[10.00]= 1111^1111^0111=0111 At [10.11]: CNT=[10.11]1-cnt*4 level =3*4 0 =3 since P 1 (110)[10.11]= 1111^0111^1111=0111 Thus, rcP 1 (6) = = 22 [10] only mixed child [10.00], [10.11] mixed children BpQid NP0 P1 11[ ] [ ] [ ] [01] [01] [01.00] [01.11] [01.11] [10] [10] [10] [10.00] [10.10] [10.11] 0111 For P(p)= P( , …, ): At each [..] 1. swap and take bit comp of each [..]NP0V [..]P1V pair corresponding to 0-bits. 2. AND the resulting vector-pairs. Result: [..]NP0V(p)[..]P1V(p). To get PMV(p) for the next level, 3. xor the two vectors.
ANDing in the NP0V-P1V Vector-Pair Format For P(p)= P( , …, ) (previous example, P 1 (6) at qid[ ] ) At each [..] 1. swap and complement each [..]NP0V [..]P1V pair corresponding to 0-bits. Result denoted with * 2. AND the resulting vector-pairs. Result: [..]NP0V(p)[..]P1V(p). To get PMV(p) for the next level, 3. xor the two vectors to get [..]PMV(p) bit NP0V* P1V* … - … - _____________________ pos NP0V P1V … - … - NP0V P1V p PMV(p) =
Distributed P trees? Assume 5-computer cluster; NodeC, Node 00, Node 01, Node 10, Node 11 Send to N ij if qid ends in ij: BpQid NP0 P [01.00] [10.00] 1000 BpQid NP0 P1 C 11[ ] [ ] [ ] BpQid NP0 P [01] [01] BpQid NP0 P [10] [10.10] [10] [10] BpQid NP0 P [01.11] [10.11] [01.11] 0110 BpQid NP0 P1 11[ ] [ ] [ ] [01] [01] [01.00] [01.11] [01.11] [10] [10] [10] [10.00] [10.10] [10.11] 0111 P1 1 (110) = P1 11 ^P1 12 ^P0 13 = P 11 ^P 12 ^NP0” 13 PM 1 (110) = P 11 ^P 12 ^NP0” 13 xor NP0 11 ^NP0 12 ^P1” 13 At N C : CNT[ ]=1-cnt*4 level =1*4 2 = 16 since P 1 (110)[ ]= 1001^1000^1000=1000 PM 1 (110)[ ] = 1001^1000^1000 xor 1111^1010^1110= 0010 At N 10 : CNT[10]= 1-cnt*4 level =0*4 1 = 0 since P 1 (110)[10]= 1101^1110^0001=0000 PM 1 (110)[10] = 1101^1110^0001 xor 1111^1111^1001= 0000 xor 1001=1001 At N 00 : CNT=[10.00]1-cnt*4 level =3*4 0 = 3 since P 1 (110)[10.00]= 1111^1111^0111=0111 At N 11 : CNT=[10.11]1-cnt*4 level =3*4 0 =3 since P 1 (110)[10.11]= 1111^0111^1111=0111 Every node sends accumulated CNT to C, where rcP 1 (6) = = 22 calculated.
Distributed P trees? qidNP0P1 [ ] [01] [10] [01.00]1110 [01.11]0010 [10.10]1101 qidNP0P1 [ ] [10] [10.11]0111 qidNP0P1 [ ] [01] [10] [01.11]0110 [10.00]1000 P 11 P 12 P 13 Alternatively, Send to Node ij if qid starts with qid segment, ij. Is this better? How would the AND code be revised? AND performance? OR: Send to Node ij if the largest qid segment divisible by p is ij eg if p=4: [0]->0; [0.3]->0; [0.3.2]->0; [ ]->2; [ ]->2; [ ]->2; [ ]->2; [ ]->2; [ ]->1 etc. Similar to fanout 4. Implement by multicasting externally only every 4 th segment. More generally, choose any increasing sequence, p=(p 1..p L ), define x p = {max pi x}, then multicast [s1.s2…sk] --> Node k p Bp qidNP0P1 00 Bp qidNP0P1 C 11[ ] [ ] [ ] Bp qidNP0P [01] [01.00] [01.11] [01] [01.11]0110 Bp qidNP0P [10] [10.10] [10] [10.11] [10] [10.00]1000 Bp qidNP0P1 11
Distributed P trees? qidNP0P1 [ ] [01] [10] [01.00]1110 [01.11]0010 [10.10]1101 qidNP0P1 [ ] [10] [10.11]0111 qidNP0P1 [ ] [01] [10] [01.11]0110 [10.00]1000 P 11 P 12 P 13 Alternatively, The Sequence can be a tree in the most general setting (i.e., a different sequence can be used on different branches, tuned to the very best tree of "multicast delays": Define a function F:{set of qids} --> {0,1,...} where if F([q1.q2...qn]) = p > 0 then F([q1.q2...qn-1]) = p-1 and if F([q1.q2...qn]) = 0 then the there is a multicast at this level. Said another way, there is a "multicast tree that tells you when to multicast (to node corresponding to last segment of the qid), eg: [] / /... \ / [0.1] \ [0.0.0] //..\ \ //..\ // \ [ ] // \// [ ] //.. \ Each node knows if it is suppose to make a distr. call for the next level or if it is suppose to compute that level (multicast to itself) by consulting the tree (or we could attach that info when we stripe). IN this way we have full flexibility to tune the multicast-compute balance to minimize execution time – on a “per P-tree basis”. The AD-implementation vector format (All Digital) replaces the qid column with a depth-first ordered vector indicating the mixed inodes.
Example 1 (bottom-up) B Band, B 1, with 3-bit values Bp qid NP0 P1 11[00.00] 1111 Bp qid NP0 P1 11[00.00] [00.01] 1111 Bp qid NP0 P1 11[00.00] [00.01] [00.10] 1111 Bp qid NP0 P1 11[00.00] [00.01] [00.10] [00.11] 1111 Bp qid NP0 P1 11[00] Bp qid NP0 P1 11[00] [01.00] 1110 This ends the possibility of a larger pure1 quad. So 00 can be installed in parent as a pure1. Bp qid NP0 P1 11[01.00] [01.01] 0000 Mixed leaf quad sent. Also ends possibility parent is pure so it & all siblings are installed as bits in parent. 11[01.10] [01.11] 0001 Mixed leaf quad sent. Ends parent so install bits in grandparent also Node-00 Bp qid NP0 P1 11[01.00] 1110 Node-01 Bp qid NP0 P1 11[01] Node-10 Bp qid NP0 P1 Node-11 11[01.11] 0001 Node-C Bp qid NP0 P1 11[] 01__ 10__
Example 1 (bottom-up) B Band, B 1, with 3-bit values Bp qid NP0 P1 11[10.00] 1111 Bp qid NP0 P1 11[10.00] [10.01] 1111 Bp qid NP0 P1 11[10.00] [10.01] [10.10] [10.11] 1111 Bp qid NP0 P1 11[11.00] [11.01] [11.10] [11.11] 1111 Bp qid NP0 P1 11[11] Node-00 Bp qid NP0 P1 11[01.00] 1110Node-01 Bp qid NP0 P1 11[01] Node-10 Bp qid NP0 P1 11[10.10] [10] Node-11 Bp qid NP0 P1 11[01.11] 0001 Node-C Bp qid NP0 P1 11[] Ends the possibility of a larger pure1 quad. All can be installed in parent/grandparent as a 1-bit can be installed. Ends quad-11. All can be installed in Parent as a 1-bit. Bottom-up bottom-line: Since it is better to use 2-D than 3-D (higher compression), it should be better to use 1-D than 2-D? This should be investigated.
Example2 B1B1 B 11 B 12 B B2B2 B 21 B 22 B X, Y, B 1, B Example2
Example2: Striping X, Y, B 1, B X, Y, B 11 B 12 B 13 B 21 B 22 B x 1 y 1 x 2 y 2 x 3 y 3 B 11 B 12 B 13 B 21 B 22 B 23 __PNP0V_ __P1V__ Band bit-pos [ ] === === === === _PNP0V__ __P1V__ _PNP0V__ __P1V__ _PNP0V__ __P1V__ _PNP0V__ __P1V__ Send B 21 B 22 B 23 to Node 00 Send B 11 B 13 B 22 B 23 to Node 01 Send B 12 B 13 B 21 B 22 B 23 to Node 10 Send nothing to Node 11 Bp qid NP0P1 C 11[ ] [ ] [ ] [ ] [ ] [ ] Purity Template [ ] Raster order Peano order OR for PNP0 AND for P1
Example2: striping at Node x 1 y 1 x 2 y 2 x 3 y 3 B 11 B 12 B 13 B 21 B 22 B 23 _PNP0V__ __P1V__ _PNP0V__ __P1V__ _PNP0V__ __P1V__ _PNP0V__ __P1V__ Send nothing to Node 00 Send nothing to Node 10 Send nothing to Node 11 _PNP0V__ __P1___ Band bit-pos [00 ] === === === === Send [ ]B 21 B 22 to Node 01 Bp qidNP0P [00 ] [00 ] [00 ] PurityTemplate [00] [01.00 ] [01.00 ] [10.00 ] [10.00 ] [10.00 ] [10.00 ] [10.00 ] x 1 y 1 x 2 y 2 x 3 y 3 B 11 B 23 From [01 ] P1 Band 12 bit-pos 13 [01.00 ] == To [01 ] x 1 y 1 x 2 y 2 x 3 y 3 B 12 B 12 B 23 B 23 B 23 From [10 ] P1 Band bit-pos [10.00 ] == === Bp qidNP0P [10.00 ] 1111 Bp qidNP0P [10.00 ] 1000 Bp qidNP0P [00 ] [10.00 ] 0111 Bp qidNP0P [00 ] [10.00 ] 1111 Bp qidNP0P [00 ] [01.00 ] [10.00 ] 1000 Bp qidNP0P [01.00 ] 1110 Pages on disk
Example2: striping at Node x 1 y 1 x 2 y 2 x 3 y 3 B 11 B 13 B 22 B 23 _PNP0V__ __P1V__ _PNP0V__ __P1V__ _PNP0V__ __P1V__ _PNP0V__ __P1V__ Send [01]B 11 B 23 to Node 00 Send nothing to Node 10 Send nothing to Node 11 Send nothing to Node 01 _PNP0V__ __P1___ Band bit-pos [01 ] === === === === x 1 y 1 x 2 y 2 x 3 y 3 B 21 B 22 From [00 ] P1 Band 22 bit-pos 12 [00.01 ] == To [00 ] x 1 y 1 x 2 y 2 x 3 y 3 B 23 From [10 ] P1 Band 2 bit-pos 3 [10.01 ] == 0 1 Bp qidNP0P [00.01 ] 1110 Bp qidNP0P [01 ] [10.01 ] 0011 Bp qidNP0P [01 ] [00.01 ] 0001 Bp qidNP0P [01 ] Bp qidNP0P [01 ] Bp qidNP0P [01 ] [01 ] [01 ] [01 ] PurityTemplate [01] [00.01 ] [00.01 ] [10.01 ] 0011 Pages on disk
Example2: striping at Node x 1 y 1 x 2 y 2 x 3 y 3 B 12 B 13 B 21 B 22 B 23 _PNP0V__ __P1V__ _PNP0V__ __P1V__ _PNP0V__ __P1V__ _PNP0V__ __P1V__ Send [10]B 13 B 21 B 23 to Node 00 Send nothing to Node 10 Send [10]B 12 B 21 B 22 to Node 11 Send [10] B 23 to Node 01 _PNP0V__ __P1___ Band bit-pos [10 ] === === === === To [00 ]To[01 ] To [11 ] Pages on disk Bp qidNP0P [10 ] Bp qidNP0P [10 ] Bp qidNP0P [10 ] Bp qidNP0P [10 ] Bp qidNP0P [10 ] Bp qidNP0P [10 ] [10 ] [10 ] [10 ] [10 ] PurityTemplate [10]
Example2: striping at Node x 1 y 1 x 2 y 2 x 3 y 3 B 12 B 21 B 22 From [10 ] P1 Band 122 bit-pos 223 [10.11 ] === Bp qidNP0P [10.11 ] 01 22[10.11 ] 10 23[10.11 ] 01 Bp qidNP0P [10.11 ] 01 Bp qidNP0P [10.11 ] 01 Bp qidNP0P [10.11 ] 10 Pages on disk
Example2.1 AND at NodeC or [ ] Bp qid NP0 P1 12[10.11 ] 01 Bp qid NP0 P1 23[10.11 ] 01 Bp qid NP0 P1 22[10.11 ] 10 Disk 11 Bp qidNP0 P1 12[10 ] Bp qidNP0 P1 13[10 ] Bp qidNP0 P1 21[10 ] Bp qidNP0 P1 22[10 ] Bp qidNP0 P1 23[10 ] Bp qidNP0 P1 21[00.01 ] 1110 Bp qidNP0 P1 23[01 ] [10.01 ] 0011 Bp qidNP0 P1 22[01 ] [00.01 ] 0001 Bp qidNP0 P1 13[01 ] Bp qidNP0 P1 11[01 ] Bp qidNP0 P1 12[10.00 ] 1111 Bp qidNP0 P1 13[10.00 ] 1000 Bp qid NP0 P1 21[00 ] [10.00 ] 0111 Bp qid NP0 P1 22[00 ] [10.00 ] 1111 Bp qid NP0 P1 23[00 ] [01.00] [10.00] 1000 Bp qidNP0 P1 11[01.00 ] 1110 Disk 10 PT[10] Disk 01 PT[01] Disk 00 PT[00] Disk C PT[ ] RC(P 101,010 ) = P 11 ^ P’ 12 ^ P 13 ^ P’ 21 ^ P 22 ^ P’ 23 Bp qid NP0P1 C 11[ ] [ ] [ ] [ ] [ ] [ ] []NP AND 0111 []P AND 0001 Sum= 8 so far. Invocation= [ ] 101,010 send to Nodes 01, 10 P1-pattern NP0 P1 11 xxxx 12 prime 13 xxxx 21 prime 22 xxxx 23 prime NP0-pattern NP0 P1 11 xxxx 12 prime 13 xxxx 21 prime 22 xxxx 23 prime
Example2.1 AND at Node 01 Bp qid NP0 P1 12[10.11 ] 01 Bp qid NP0 P1 23[10.11 ] 01 Bp qid NP0 P1 22[10.11 ] 10 Disk 11 Bp qidNP0 P1 12[10 ] Bp qidNP0 P1 13[10 ] Bp qidNP0 P1 21[10 ] Bp qidNP0 P1 22[10 ] Bp qidNP0 P1 23[10 ] Bp qidNP0 P1 21[00.01 ] 1110 Bp qidNP0 P1 23[01 ] [10.01 ] 0011 Bp qidNP0 P1 22[01 ] [00.01 ] 0001 Bp qidNP0 P1 13[01 ] Bp qidNP0 P1 11[01 ] Bp qidNP0 P1 12[10.00 ] 1111 Bp qidNP0 P1 13[10.00 ] 1000 Bp qid NP0 P1 21[00 ] [10.00 ] 0111 Bp qid NP0 P1 22[00 ] [10.00 ] 1111 Bp qid NP0 P1 23[00 ] [01.00] [10.00] 1000 Bp qidNP0 P1 11[01.00 ] 1110 Bp qid NP0P1 C 11[ ] [ ] [ ] [ ] [ ] [ ] Invocation= [01] 101,010 Sent to Node 00 [01] NP AND [01] P AND P1-pattern NP0 P1 11 xxxx 12 prime 13 xxxx 21 prime 22 xxxx 23 prime NP0-pattern NP0 P1 11 xxxx 12 prime 13 xxxx 21 prime 22 xxxx 23 prime [ ] 101,010 received Disk 10 PT[10] Disk 01 PT[01] Disk 00 PT[00] Disk C PT[ ]
Example2.1 AND at Node 10 Bp qid NP0 P1 12[10.11 ] 01 Bp qid NP0 P1 23[10.11 ] 01 Bp qid NP0 P1 22[10.11 ] 10 Disk 11 Bp qidNP0 P1 12[10 ] Bp qidNP0 P1 13[10 ] Bp qidNP0 P1 21[10 ] Bp qidNP0 P1 22[10 ] Bp qidNP0 P1 23[10 ] Bp qidNP0 P1 21[00.01 ] 1110 Bp qidNP0 P1 23[01 ] [10.01 ] 0011 Bp qidNP0 P1 22[01 ] [00.01 ] 0001 Bp qidNP0 P1 13[01 ] Bp qidNP0 P1 11[01 ] Bp qidNP0 P1 12[10.00 ] 1111 Bp qidNP0 P1 13[10.00 ] 1000 Bp qid NP0 P1 21[00 ] [10.00 ] 0111 Bp qid NP0 P1 22[00 ] [10.00 ] 1111 Bp qid NP0 P1 23[00 ] [01.00] [10.00] 1000 Bp qidNP0 P1 11[01.00 ] 1110 Bp qid NP0P1 C 11[ ] [ ] [ ] [ ] [ ] [ ] Invocation= [10] 101,010 Sent nowhere (no mixed) [10] NP AND [10] P AND P1-pattern NP0 P1 11 xxxx 12 prime 13 xxxx 21 prime 22 xxxx 23 prime NP0-pattern NP0 P1 11 xxxx 12 prime 13 xxxx 21 prime 22 xxxx 23 prime [ ] 101,010 received Disk 10 PT[10] Disk 01 PT[01] Disk 00 PT[00] Disk C PT[ ]
Example2.1 AND at Node 00 Bp qid NP0 P1 12[10.11 ] 01 Bp qid NP0 P1 23[10.11 ] 01 Bp qid NP0 P1 22[10.11 ] 10 Disk 11 Bp qidNP0 P1 12[10 ] Bp qidNP0 P1 13[10 ] Bp qidNP0 P1 21[10 ] Bp qidNP0 P1 22[10 ] Bp qidNP0 P1 23[10 ] Bp qidNP0 P1 21[00.01 ] 1110 Bp qidNP0 P1 23[01 ] [10.01 ] 0011 Bp qidNP0 P1 22[01 ] [00.01 ] 0001 Bp qidNP0 P1 13[01 ] Bp qidNP0 P1 11[01 ] Bp qidNP0 P1 12[10.00 ] 1111 Bp qidNP0 P1 13[10.00 ] 1000 Bp qid NP0 P1 21[00 ] [10.00 ] 0111 Bp qid NP0 P1 22[00 ] [10.00 ] 1111 Bp qid NP0 P1 23[00 ] [01.00] [10.00] 1000 Bp qidNP0 P1 11[01.00 ] 1110 Bp qid NP0P1 C 11[ ] [ ] [ ] [ ] [ ] [ ] Sum=1, sent to Node C gives a sum total of = 9 [01.00] P AND Disk 10 PT[10] Disk 01 PT[01] Disk 00 PT[00] Disk C PT[ ] [01] 101,010 received P1-pattern P1 11 xxxx 12 prime 13 xxxx 21 prime 22 xxxx 23 prime
Example2.2 AND at NodeC or [ ] Bp qid NP0 P1 12[10.11 ] 01 Bp qid NP0 P1 23[10.11 ] 01 Bp qid NP0 P1 22[10.11 ] 10 Disk 11 Bp qidNP0 P1 12[10 ] Bp qidNP0 P1 13[10 ] Bp qidNP0 P1 21[10 ] Bp qidNP0 P1 22[10 ] Bp qidNP0 P1 23[10 ] Bp qidNP0 P1 21[00.01 ] 1110 Bp qidNP0 P1 23[01 ] [10.01 ] 0011 Bp qidNP0 P1 22[01 ] [00.01 ] 0001 Bp qidNP0 P1 13[01 ] Bp qidNP0 P1 11[01 ] Bp qidNP0 P1 12[10.00 ] 1111 Bp qidNP0 P1 13[10.00 ] 1000 Bp qid NP0 P1 21[00 ] [10.00 ] 0111 Bp qid NP0 P1 22[00 ] [10.00 ] 1111 Bp qid NP0 P1 23[00 ] [01.00] [10.00] 1000 Bp qidNP0 P1 11[01.00 ] 1110 Disk 10 PT[10] Disk 01 PT[01] Disk 00 PT[00] Disk C PT[ ] RC(P 100,101 ) = P 11 ^ P’ 12 ^ P’ 13 ^ P 21 ^ P’ 22 ^ P 23 Bp qid NP0P1 C 11[ ] [ ] [ ] [ ] [ ] [ ] []NP AND 0010 []P AND 0000 Sum= 0 so far. Invocation= [ ] 100, 101 send to Node 10 P1-pattern NP0 P1 11 xxxx 12 prime 13 prime 21 xxxx 22 prime 23 xxxx NP0-pattern NP0 P1 11 xxxx 12 prime 13 prime 21 xxxx 22 prime 23 xxxx
Example2.2 AND at Node 10 Bp qid NP0 P1 12[10.11 ] 01 Bp qid NP0 P1 23[10.11 ] 01 Bp qid NP0 P1 22[10.11 ] 10 Disk 11 Bp qidNP0 P1 12[10 ] Bp qidNP0 P1 13[10 ] Bp qidNP0 P1 21[10 ] Bp qidNP0 P1 22[10 ] Bp qidNP0 P1 23[10 ] Bp qidNP0 P1 21[00.01 ] 1110 Bp qidNP0 P1 23[01 ] [10.01 ] 0011 Bp qidNP0 P1 22[01 ] [00.01 ] 0001 Bp qidNP0 P1 13[01 ] Bp qidNP0 P1 11[01 ] Bp qidNP0 P1 12[10.00 ] 1111 Bp qidNP0 P1 13[10.00 ] 1000 Bp qid NP0 P1 21[00 ] [10.00 ] 0111 Bp qid NP0 P1 22[00 ] [10.00 ] 1111 Bp qid NP0 P1 23[00 ] [01.00] [10.00] 1000 Bp qidNP0 P1 11[01.00 ] 1110 Bp qid NP0P1 C 11[ ] [ ] [ ] [ ] [ ] [ ] Invocation= [10] 100, 101 Sent to Node 11 [10] NP AND [10] P AND [ ] 100,101 received Disk 10 PT[10] Disk 01 PT[01] Disk 00 PT[00] Disk C PT[ ] P1-pattern NP0 P1 11 xxxx 12 prime 13 prime 21 xxxx 22 prime 23 xxxx NP0-pattern NP0 P1 11 xxxx 12 prime 13 prime 21 xxxx 22 prime 23 xxxx
Example2.2 AND at Node 11 Bp qid NP0 P1 12[10.11 ] 01 Bp qid NP0 P1 23[10.11 ] 01 Bp qid NP0 P1 22[10.11 ] 10 Disk 11 Bp qidNP0 P1 12[10 ] Bp qidNP0 P1 13[10 ] Bp qidNP0 P1 21[10 ] Bp qidNP0 P1 22[10 ] Bp qidNP0 P1 23[10 ] Bp qidNP0 P1 21[00.01 ] 1110 Bp qidNP0 P1 23[01 ] [10.01 ] 0011 Bp qidNP0 P1 22[01 ] [00.01 ] 0001 Bp qidNP0 P1 13[01 ] Bp qidNP0 P1 11[01 ] Bp qidNP0 P1 12[10.00 ] 1111 Bp qidNP0 P1 13[10.00 ] 1000 Bp qid NP0 P1 21[00 ] [10.00 ] 0111 Bp qid NP0 P1 22[00 ] [10.00 ] 1111 Bp qid NP0 P1 23[00 ] [01.00] [10.00] 1000 Bp qidNP0 P1 11[01.00 ] 1110 Bp qid NP0P1 C 11[ ] [ ] [ ] [ ] [ ] [ ] [10] P AND [10] 100,101 received Disk 10 PT[10] Disk 01 PT[01] Disk 00 PT[00] Disk C PT[ ] Sum=1, sent to Node C gives a sum total of 1
Example2, bottom-up x 1 y 1 x 2 y 2 x 3 y 3 B 11 B 12 B 13 B 21 B 22 B 23 Bp qid NP0P1 11[00.00] [00.00] [00.00] [00.00] [00.00] [00.00] 0000 Peano order
Example2, bottom-up x 1 y 1 x 2 y 2 x 3 y 3 B 11 B 12 B 13 B 21 B 22 B 23 Bp qid NP0P1 11[00.00] [00.01] [00.00] [00.01] [00.00] [00.01] [00.00] [00.01] [00.00] [00.01] [00.00] [00.01] 0000 Peano order Mixed quads (can be sent to node 01 ) Bp qid NP0P1 21[00.01] [00.01] 0001
Example2, bottom-up x 1 y 1 x 2 y 2 x 3 y 3 B 11 B 12 B 13 B 21 B 22 B 23 Bp qid NP0P1 11[00.00] [00.01] [00.10] [00.00] [00.01] [00.10] [00.00] [00.01] [00.10] [00.00] [00.01] [00.10] [00.00] [00.01] [00.10] [00.00] [00.01] [00.10] 1111 Peano order Bp qid NP0P1 at 00 23[00] Mixed quads (sent to node 00 ) Bp qid NP0P1 at 01 21[00.01] [00.01] 0001
Example2, bottom-up x 1 y 1 x 2 y 2 x 3 y 3 B 11 B 12 B 13 B 21 B 22 B 23 Bp qid NP0P1 11[00.00] [00.01] [00.10] [00.11] [00.00] [00.01] [00.10] [00.11] [00.00] [00.01] [00.10] [00.11] [00.00] [00.01] [00.10] [00.11] [00.00] [00.01] [00.10] [00.11] [00.00] [00.01] [00.10] [00.11] 0000 Peano order 00 quads that are pure are: Bp qid NP0P1 11[00] [00] [00] At 00 Bp qid NP0P1 23[00] At 01 Bp qid NP0P1 21[00.01] [00.01] 0001
Appendix: Firm Math Foundation where RRN-order assumed in raster order Given any relation or table R(A1..An), assign RRNs, {0,1,.., (2 d ) L } (d=dimension, L=level) Write RRNs as bit strings: x 11..x 1L.x 21..x 2L.. x d1..x dL (d=2: x 1..x L y 1..y L ) k=0..L define the concept of a level-k polytant Q[x 11 x 21..x d1x 12 …x d2..x 1k..x dk ] by Q { t R | t.K ij =x ij }, K ij = ij th bit of the RRN - Q = (SR dk ([x 11..x 1L.x 21..x 2L..x d1..x dL ])).R = {t|t.R SR dk ([x 11..x 1L.x 21..x 2L..x d1..x dL ])} (tuple variable notation - d=2: Q[x 1 y 1..x k y k ] is a quadrant. - Q[]=R; Q[x 11 x 21..x d1x 12 …x d2..x 1L..x dL ]=single_tuple=1x..x1-polytant. - imposes a “d-space” structure on R (for RSI, which already has such, can skip this step.) Quadrant-conditions: On each quadrant, Q, in R define conditions (Q {T,F}) (level=k): Q-CONDDESCR pure1true if C is true of all Q-tuples pure0true if C is false of all Q-tuples mixedtrue if C is true of some Q-tuples and false of some Q-tuples p-counttrue if C is true of exactly p Q-tuples ( 0 p cardQ = 2 dk ) Every Ptree is a Quadrant-condition Ptree on R, e.g., P ij, basic Ptree, is P cond where cond = (SR 8-j ( SL j-1 ( t.A i ))) P1 i (v) for value, v A i is P cond where cond = (t.A i = v, t Q) NP0(a 1..a n ) is P cond where cond = ( i : ( t Q : t.A i = a i ) ) Notation: bSQ files, P ij (cond) ; BSQ files, P i (cond); Relations, P.