Data Warehouses and DBMSs  C.J. Date, circa 1980  Do transactions on a DBMSs rather than  file processing on file systems.  “Using a DBMS instead of.

Data Warehouses and DBMSs  C.J. Date, circa 1980  Do transactions on a DBMSs rather than  file processing on file systems.  “Using a DBMS instead of file systems unifies data resources, centralizes control, promotes standards and consistency, eliminates redundancy, increases data value and usage, yadda, yadda”  Inmon, et all, circa 1990  “Buy a separate Data Warehouse for long-running queries and data mining” (separate from DBMS for transaction processing)”.  “Double your hardware! Double your software! Double your fun!

Data Querying, Analysis and Mining on a Data Warehouse vs. Transaction Processing on Database  What happened?  It was a great marketing success!  Great Concurrency Control R&D failure! CC R&D people failed to integrate transactions and queries (OLTP and OLAP, i.e., updates and reads) in one system with acceptable performance!  Marketing of Data Warehouses was so successful, nobody noticed the failure!  Most enterprises now have separate DW and DBMS.

Some still hope that DWs and DBs will be unified again. The industry may demand it eventually. (e.g., Already, there’s work on updating DW s ) For now let’s just focus on DATA. You run up against two curses immediately in data processing, querying and mining. Curse of cardinality solutions don’t scale with respect to volume. Curse of dimensionality solutions don’t scale with respect to the number of attribute dimensions The curse of cardinality was a problem in horizontal world too!  It was disguised as “the curse of the slow join”.  In the “horizontal data world” we decompose relations to get good design (e.g., 3 rd normal form);  We pay for it by requiring many slow joins to get the answers we need.

Let’s talk about techniques we use to address these curses. Horizontal processing of vertically structured data (instead of the ubiquitous vertical processing of horizontal data (record orientation). Parallelize the engine.  Parallelize software engine on clusters of computers.  Parallelize greyware engine on clusters of people (i.e., browser enable all software for visualization) Why do we need better techniques for data analysis, querying and mining? Data volume expands by Parkinson’s Law: Data volume expands to fill available data storage Disk-storage expands by Moore’s law: Available storage doubles every 9 months!

We’re awash with data! Network data: hi-speed, DWDM, All-opt (mgmt, flow classif’n,QoS,security) 10 terabytes by 2004 ~ 10 13 B US EROS Data Center (EDC) archives Earth Observing System (EOS) Remotely Sensed Imagery (RSI), satellite and aerial photo data 15 petabytes by 2007 ~ 10 16 B National Virtual Observatory (aggregated astronomical data) 10 exabytes by 2010 ~ 10 19 B Sensor data from sensors (including Micro & Nano -sensor networks) 10 zettabytes by 2015 ~ 10 22 B WWW (and other text collections) 10 yottabytes by 2020 ~ 10 25 B Genomic/Proteomic/Metabolomic data (microarrays, genechips, genome sequences) 10 gazillabytes by 2030 ~ 10 28 B?). Stock Market prediction data (prices + all the above? especially astronomy data? 10 supragazillabytes by 2040 ~ 10 31 B? Useful information must be teased out of these large volumes of data. I had to make up these Name! Projected data sizes are overrunning our ability to name those sizes!

More’s Less The more volume you have, the less information you have. (AKA: Shannon’s Canon) A simple illustration: Which phone book has more info? (both involve the same 4 data granules) BOOK-1BOOK-2 NameNumberNameNumber Smith234-9816Smith234-9816 Jones231-7237Smith231-7237 Jones234-9816 Jones231-7237 Data analysis, querying and mining reduce volume and raises info level More’s Law:

Producer want the color intensity pattern to yield association. One is ”hi_green & low_red  hi_yield”. It is very intuitive. A stronger association was found strictly by analyzing (mining) the data: “hi_NIR & low_red  hi_yield” Once found in historical data (through data mining), producers just query TIFF images mid-season for low_NIR & high_red grid cells, where they then apply additional nitrogen. This concept can detect ag insurance fraud ( http://www.npr.org/templates/story/story.php?storyId=5013871), forest fires, wetlands drainage, etc.http://www.npr.org/templates/story/story.php?storyId=5013871 A Precision Agriculture example TIFF image Yield Map and a synchronized yield map (crop yield taken at harvest); thus, 4 feature attributes (B,G,R,Y) and ~100,000 pixels Yield prediction: dataset consists of an aerial photograph (RGB TIFF image taken during the growing season) Grasshopper Infestation Prediction (again involving RSI data) Grasshopper caused significant economic loss each year. Early infestation prediction is key to damage control. Pixel classification on remotely sensed imagery holds significant promise to achieve early detection. Pixel classification (signaturing) has many apps: pest detection, fire detection, wet-lands monitoring … (for signaturing we developed the SMILEY software/greyware system) http://midas.cs.ndsu.nodak.edu/~smiley http://midas.cs.ndsu.nodak.edu/~smiley

Sensor Network Data  Micro and Nano scale sensor blocks are being developed for sensing  Biological agents  Chemical agents  Motion detection  coatings deterioration  RF-tagging of inventory (RFID tags for Supply Chain Mgmt)  Structural materials fatigue There will be trillions ++ of individual sensors creating mountains of data.

Sensor Network Application: Each energized nano-sensor transmits a ping (location is triangulated from the ping). These locations are then translated to 3-dimensional coordinates at the display. The corresponding voxel on the display lights up. This is the expendable, one-time, cheap sensor version. A more sophisticated CEASR device could sense and transmit the intensity levels, lighting up the display voxel with the same intensity. Wherever threshold level is sensed (chem, bio, thermal...) a ping is registered in a compressed structure (P-tree – detailed definition coming up) for that location. Situation space Nano-sensors dropped into the Situation space Soldier sees replica of sensed situation prior to entering space.:.:.:.:..:: ….:. : … : … ::..:.. :: :.: … : :..:..::..::..:.::...:.:.:.:..:: ….:. : … : … ::..:.. :: :.: … : :..:..::..::..:.::...:.:.:.:..:: ….:. : … : … ::..:.. :: :.: … : :..:..::..::..:.::.. Using Alien Technology’s Fluidic Self-assembly (FSA) technology, clear plexiglass laminates are joined into a cube, with a embedded nano-LED at each voxel. ================================== \ CARRIER / CubE for Active Situation Replication (CEASR) The single compressed structure (P-tree) containing all the information is transmitted to the cube, where the pattern is reconstructed (uncompress, display).

Anthropology Application Digital Archive Network for Anthropology (DANA) (analyze, query and mine arthropological artifacts (shape, color, discovery location,…)

What is Data Mining Querying is asking specific questions and expecting specific answers Data Mining goes into MOUNTAINS of raw data for info gems. Data Cleaning/Integration: missing data, outliers, noise, errors Data Warehouse: cleaned, integrated, read-only, periodic, historical raw database Data Mining Pattern Evaluation and Assay OLAP Classification Clustering Rule Mining Task-relevant Data Selection Feature extraction, tuple selection visualization Loop backs Smart files

Data Mining versus Querying On the Query end, much work is yet to be done (D. DeWitt, ACM SIGMOD Record’02). On the Data Mining end, the surface has barely been scratched. But even those scratches had a great impact – One of the early scatchers became the biggest corporation in the world last year. A Non-scratcher filed for bankruptcy SQL SELECT FROM WHERE Complex queries (nested, EXISTS..) Standard querying FUZZY query, Search engines, BLAST searches OLAP (rollup, drilldown, slice/dice.. Searching and Aggregating Machine LearningData Mining Supervised Learning – classification regression Unsupervised Learning - clustering Walmart vs. KMart There is a whole spectrum of techniques to get information from data : Association Rule Mining Data Prospecting Fractals, … Our Approach: Vertical, horizontally horizontal data vertically) Our Approach: Vertical, compressed data structures, Predicate-trees or Peano-trees (Ptrees in either case) 1 processed horizontally (DBMSs process horizontal data vertically) Ptrees are data-mining-ready, compressed data structures, which attempt to address the curses of scalability and curse of dimensionality. 1 Ptree Technology is patent pending by North Dakota State University

Vertical Data Structures History  In the 1980’s vertical data structures were proposed for record-based workloads Decomposition Storage Model (DSM, Copeland et al) Attribute Transposed File (ATF) Bit Transposed File (BTF, Wang et al); Viper Band Sequential Format (BSQ) for Remotely Sensed Imagery DSM and BTF initiatives have disappeared. Why? (next slide)  Vertical auxiliary and system structures Domain & Request Vectors (DVA/ROLL/ROCC Perrizo, Shi, et al) vertical system structures (query optimization & synchronization) Bit Mapped Indexes (BMI s - very popular in Data Warehouses) all indexes are vertical auxiliary structures really  BMI’s use bit maps (positional approach to IDing records)  other indexes use RID lists (keyword or value approach)

6. Lf half of lf of rt? true  1 0 0 1 1 4. Left half of rt half ? false  0 0 2. Left half pure1? false  0 0 0 1. Whole is pure1? false  0 5. Rt half of right half? true  1 0 0 1 R 11 0 1 0 1 Horizontally AND basic Ptrees Predicate tree technology: vertically project each attribute, Current practice: Structure data into horizontal records. Process vertically (scans) Top-down construction of the 1-dimensional Ptree representation of R 11, denoted, P 11, is built by recording the truth of the universal predicate “pure 1” in a tree recursively on halves (1/2 1 subsets), until purity is achieved. 3. Right half pure1? false  0 0 7. Rt half of lf of rt? false  0 0 0 1 10 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 But it is pure (pure0) so this branch ends then vertically project each bit position of each attribute, then compress each bit slice into a basic Ptree. e.g., compression of R 11 into P 11 goes as follows: P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P 43 0 0 0 0 1 10 0 1 0 0 1 0 0 0 0 0 0 1 01 10 0 1 0 0 1 0 0 0 0 1 0 01 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 10 01 ^ ^^^ ^ ^ ^ ^^ 0 0 0 0 1 10 ^ P 11 pure1? false=0 pure1? true=1 And it’s pure so branch ends pure1? false=0 R (A 1 A 2 A 3 A 4 ) 2 7 6 1 6 7 6 0 2 7 5 1 2 7 5 7 5 2 1 4 2 2 1 5 7 0 1 4 Horizontally structured records Scanned vertically 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 = Base 10 Base 2

0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 To count occurrences of 7,0,1,4 use pure111000001100 : 0 P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 = 0 0 01 ^ 7 0 1 4 P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P 43 0 0 0 0 1 10 0 1 0 0 1 0 0 0 0 0 0 1 01 10 0 1 0 0 1 0 0 0 0 1 0 01 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 10 01 ^ ^^^ ^ ^ ^ ^^ R (A 1 A 2 A 3 A 4 ) 2 7 6 1 6 7 6 0 2 7 5 1 2 7 5 7 5 2 1 4 2 2 1 5 7 0 1 4 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 = This 0 makes entire left branch 0 These 0 s make this node 0 These 1 s and these 0 s make this 1 2 1 -level has the only 1-bit so the 1-count = 1*2 1 = 2

R 11 0 1 0 1 Top-down construction of basic P-trees is best for understanding, but bottom-up is much more efficient. Bottom-up construction of 1-Dim, P 11, is done using in-order tree traversal and the collapsing of pure siblings, as follow: 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 P 11 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0

2-Dimensional P-trees: natural choice for, e.g., image files. For images, any ordering of pixels will work (raster, diagonalized, Peano, Hilbert, Jordan), but the space-filling “Peano” ordering has advantages for fast processing, yet compresses well in the presence of spatial continuity. 0 1000 00101101 11100010110 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 For an image bit-file (e.g., hi-order bit of the red band of an image file): 1111110011111000111111001111111011110000111100001111000001110000 Which, in spatial raster order is: Top-down construction of its 2-dimensional P-tree is built by recording the truth of the universal predicate “pure 1” in a fanout=4 tree recursively on quarters, until purity is achieved. Pure-1? False=0 Pure! pure!

1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 11 1 1 1 1 111 1 11 11 1 1 1 1 1 11 0 00000 0 From here on we will take 4 bit positions at a time, for efficiency. 1 1 10 0 0 10 0 1 1 1 11 1 0 101 1 1 0 0 0 0 0 0 0 0 0 0000 0 Bottom-up construction of the 2-Dimensional P-tree is done using in-order traversal of a fanout=4, log 4 (64)=4-level tree and the collapsing pure siblings, as follow: Start here

Node ID (NID) = 2.2.3 Tree levels (going down): 3, 2, 1, 0, with purity-factors of 4 3 4 2 4 1 4 0 respectively Fan-out = 2 dimension = 2 2 = 4 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 7=111 ( 7, 1 ) ( 111, 001 ) 10.10.11 1=001 Some aspects of 2-D P-trees: 0 1001 00100001 11100010110 1 0 level-3 (pure=4 3 ) 1001 level-2 (pure=4 2 ) 0010110 1 level-1 (pure=4 1 ) 11100010110 1 level-0 (pure=4 0 ) 01232 3 2. 2. 3 ROOT-COUNT = level-sum * level-purity-factor. Root Count = 7 * 4 0 + 4 * 4 1 + 2 * 4 2 = 55

3-Dimensional Ptrees

1 Situation space How would a CEASR bio-agent detector work? All other positions contain a 0-bit, i.e., the level of bio-agent detected by the nano-sensors in each of the other 63 cells (voxels) is below the danger threshold. P 00 Start 0 00000 1 0 We can save time by noting that all the remaining 56 cells (in 7 other octants) contain all 0 s. Each of the next 7 octants will produce eight 0 s at the leaf level (8 pure-0 siblings), each of which will collapse to a 0 at level-1. So, proceeding an octant at a time (rather than a cell at a time): 0000000 0 And that position corresponds to the 1-bit position in this cutaway view   at a position, as shown in the situation space. Suppose a biological agent is sensed by nano-sensors 0 ONE tiny, compressed P-tree can completely represent this “bio-situation” It is constructed (bottom up) as a fan-out=8, level=3 P-tree, as follows. 0000000 0 0 0000000 0 0 0000000 0 0 0000000 0 0 0000000 0 0 0000000 0 0 This entire situation can be transmitted to a personal display unit, as merely two bytes of data plus their two NIDs. For NID, use [level, global_level_offset] rather than [local_segment_offset,…local_segment_offset]. So assume every node not sent is all 0 s, that in any 13-bit node segment sent (only need send “mixed” segments), the 1 st segment is the level (in this case, need 2 bit only), the next 3 bits give the global_level_offset within that level (i.e., 0..7) and the final 8 bits are the node’s data, then the complete situation can be transmitted as these 13 bits: 01 000 0000 0001 If 2 n 3 cells (n=2 above) “situation” will take only log 2 (n) blue, 2 3n-3 green, 8 red bits (e.g., even if there are 2 8 3 =2 24 ~16,000,000 voxels, transmit merely 3+21+8=32 bits.) We have now captured the data in the 1 st octant (forward-upper-left). Moving to the next octant (forward-upper-right):

Ptree dimension is a user parameter and can be chosen to fit the data; default=1-D Ptrees (recursive halving); Images  2-D Ptrees (recursive quartering); 3-D Solids  3-D Ptrees (recursive eighth-ing) Or the dimension can be chosen based on other considerations (to optimize compression, increase processing speed...) Logical Operations on Ptrees (are used to get counts of any pattern) Ptree AND is faster than bit-by-bit AND since, any pure0 operand node means result node is pure0. e.g., only load quadrant 2 to AND Ptree1, Ptree2, etc. The more operands there are in the AND, the greater the benefit due to this shortcut (more pure0 nodes). Using logical operators on the basic P-trees (Predicate = universal predicate “purely 1-bits”), can construct, for any domain: constant-P-trees (predicate: “value=const”), range-Ptrees (predicate: “value  range”), interval-P-tree (pred: “value  interval). In fact, there is a domain P-tree for every predicate defined on it ANDing domain-predicate P-trees, tuple-predicate P-trees, e.g., rectangle-P-tree (pred: tuple  rectangle). The next slide shows some of these constructions. Ptree 1 Ptree 2 AND result OR result

Basic, Value and Tuple Ptrees Tuple Ptrees (predicate: quad is purely target tuple) e.g., P (1, 2, 3) = P (001, 010, 111) = P 1, 001 AND P 2, 010 AND P 3, 111 AND Value Ptrees (predicate: quad is purely target value in target attribute) e.g., P 1, 5 = P 1, 101 = P 11 AND P 12 ’ AND P 13 AND Target Attribute Target Value Basic Ptrees for a 7 column, 8 bit table e.g., P 11, P 12, …, P 18, P 21, …, P 28, …, P 71, …, P 78 Target Attribute Target Bit Position Rectangle Ptrees (predicate: quad is purely in target rectangle (product of intervals) e.g., P ([13],, [0.2]) = (P 1,1 OR P 1,2 OR P 1,3 ) AND (P 3,0 OR P 3,1 OR P 3,2 ) AND/OR

Horizontal Processing of Vertical Structures for Record-based Workloads  For record-based workloads (where the result is a set of records), changing the horizontal record structure and then having to reconstruct it, may introduce too much post processing? 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A 1 A 2 A 3 A 4 ) 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 1  For data mining workloads, the result is often a bit (Yes/No, True/False) or another unstructured result, where there is no reconstructive post processing?

But even for some standard SQL queries, vertical data may be faster (evaluating when this is true would be an excellent research project)  For example, the SQL query,  SELECT Count * FROM purchases WHERE price  $4,000.00 AND 1000  sales  500.  The answer is the root-count of the P-tree resulting from ANDing the price-interval-P-tree, P price  [4000,  ) and the sales-interval-P-tree, P sales  [500,1000].

Architecture for the DataMIME™ System (DataMIME tm = data mining, NO NOISE) (PDMS = P-tree Data Mining System) Internet DII (Data Integration Interface) Data Integration Language DIL YOUR DATA Data Repository lossless, compressed, distributed, vertically- structured database DMI (Data Mining Interface) YOUR DATA MINING Ptree (Predicates) Query Language PQL

Raster Sorting: Attributes 1 st Bit position 2 nd Generalized Raster and Peano Sorting : generalizes to any table with numeric attributes (not just images). Peano Sorting: Bit position 1 st Attributes 2 nd Decimal Binary Unsorted relation

Generalize Peano Sorting 0 20 40 60 80 100 120 adult spam mushroom function crop Time in Seconds Unsorted Generalized Raster Generalized Peano KNN speed improvement (using 5 UCI Machine Learning Repository data sets)

Astronomy Application: (National Virtual Observatory data) What Ptree dimension and what ordering should be used for astronomical data?, where all bodies are assumed on surface of celestial sphere (shares equatorial plane with earth and has no specified radius) Hierarchical Triangle Mesh Tree (HTM-tree, seems to be the accepted standard) Peano Triangle Mesh Tree (PTM-tree) Peano Celestial Coordinate tree (RA=Recession Angle (longitudinal angle); dec=declination (latitude angle) PTM is similar to HTM used in the Sloan Digital Sky Survey project. In both:  Sphere is divided into triangles  Triangle sides are always great circle segments.  PTM differs from HTM in the way in which they are ordered?

The difference between HTM and PTM-trees is in the ordering. 1, 2 1, 3 1, 0 1,1 1 1,3,3 1,3,21,3,0 1,3,1 1,2 1,1 1,0 1,3 1 1,1,2 1,1,01,1,1 1.1.3 Ordering of HTM Ordering of PTM-tree Why use a different ordering?

PTM Triangulation of the Celestial Sphere Traverse southern hemisphere in the revere direction (just the identical pattern pushed down instead of pulled up, arriving at the Southern neighbor of the start point. RA dec The following ordering produces a sphere-surface filling curve with good continuity characteristics, For each level. left turn right left right Equilateral triangle (90o sector) bounded by longitudinal and equatorial line segments Traverse the next level of triangulation, alternating again with left-turn, right- turn, left-turn, right-turn.. Traverse southern hemisphere in the revere direction (just the identical pattern pushed down instead of pulled up, arriving at the Southern neighbor of the start point.

PTM-triangulation - Next Level LRLR RLRL LRLR RLRL LRLR RLRL LRLR RLRL LRLR RLRL LRLR RLRL

South Plane 90 o 0 o -90 o 0 o 360 o  Plane Z Sphere  Cylinder Peano Celestial Coordinates Unlike PTM-trees which initially partition the sphere into the 8 faces of an octahedron, in the PCCtree scheme: Sphere is tranformed to a cylinder, then into a rectangle, then standard Peano ordering is used on the Celestial Coordinates.  Celestial Coordinates Recession Angle (RA) runs from 0 to 360 o dand Declination Angle (dec) runs from -90 o to 90 o.

e0e0 e1e1 e2e2 e3e3 1 0 1 1 0 1 1 1 1 1 0 1 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 1 0 0 1 0 1 1 o1o1 o2o2 o3o3 o0o0 Gene-Experiment-Organism Cube (1 iff that gene from that organism expresses at a threshold level in that experiment.) many-to-many-to-many relationship Organism Dimension Table 30001Mus musculus mouse 12.10 Saccharomyces cerevisiae yeast 1850Drosophila melanogaster fly 30001Homo sapienshuman Genome Size (million bp) Vert Species Organism Gene Dimension Table 0011PolyA-Tail.9.1 StopCodonDensity apopmitomeioapop Function Ribo Nucl RiboMyta SubCell-Location Experiment Dimension Table (MIAME) 1asa42 1aca42 0hsb22 1hca23 NMHSADAD EDED STZSTZ CTYCTY STRSTR UNVUNV PIPI LABLAB g0g0 g1g1 g2g2 g3g3 e0e0 e1e1 e2e2 e3e3 17, 78 12, 60 Mi, 40 1, 48 10, 75 0 0 7, 40 0 14, 65 0 0 16, 76 0 9, 45 Pl, 43 Gene-Organism Dimension Table ( chromosome,length) PUBLIC (Ptree Unfied BioLogical InformtiCs Data Cube and Dimension Tables)

0100 0101 0110 1001 100 011 011 01 01 0 g0g0 g1g1 g2g2 g3g3 g0g0 g1g1 g2g2 g3g3 Protein-Protein Interaction Pyramid Original Gene Dimension Table 0011PolyA-Tail.9.1 StopCodonDensity apo p mit o mei o apo p Function Rib o NuclRib o Myt a SubCell- Location g0g0 g1g1 g2g2 g3g3 g3g3 01001001010 g2g2 01000100100 g1g1 11000010010 g0g0 11000101001 GENEGENE P ol y- A SCD1SCD1 MitoMito MeioMeio apopapop NuclNucl RiboRibo MytaMyta SCD2SCD2 SCD3SCD3 SCD4SCD4 Boolean Gene Dimension Table (Binary)

Association of Computing Machinery KDD-Cup-02 NDSU Team

Network Security Application (Network security through Vertical Structured data) Network layers do their own partitioning  Packets, frames, etc. (usually independent of any intrinsic data structuring – e.g., record structure) Fragmentation/Reassembly, Segmentation/Reassembly Data privacy is compromised when the horizontal (stream) message content is eavesdropped upon at the reassembled level (in network  A standard solution is to host-encrypt the horizontal structure so that any network reassembled message is meaningless.  Alt.: Vertically structure (decompose, partition) data (e.g., basic Ptrees). Send one Ptree per packet Send intra-message packets separately  Trick flow classifiers into thinking the multiple packets associated with a particular message are unrelated. The message is only meaningful after destination demux-ing  Note: the only basic Ptree that holds actual information is the high-order bit Ptree. Therefore encrypt it! There is a whole passel of killer ideas associated with the concept of using vertical structuring data within network transmission units  Active networking? (AND basic Ptrees (or just certain levels of) at active net nodes?)

Network Security Application Cont.  Vertically structure (decompose, partition) data (e.g., basic Ptrees). Send one P-tree (vertical bit-slice per packet Send basic P-tree slices (for a given attribute) one at a time starting with the low order bit slice.  Encrypt it using some agreed upon algorithm (and key) (requires key distribution)  But then steganographically embed the crypto alg identity and key structure for the next higher order bit into the ptree (as the carrier message).  Continue to do that for each higher order bit until you get to the highest order bit. Until it arrives and unless each crypto has been broken (in time to apply it to the next level) the message is un-decipherable.

Nearest Neighbor Classification (AKA: regression, case-based reasoning) is the most common method of data mining. Given a table R(A 1,…,A n,C) where C is chosen as the class attribute of interest (e.g., in homeland security data mining, C={terrorist, non-terrorist}; precision agriculture, C={low_yield, medium_yield, hi_yield}; Network flow classification, C={flow-1, flow-2, …}) and A i s are feature attributes on which the classification decision is based; network virus classification, C={DoS attack, SYN-flooding attack,…}; in cancer research, C={cancerous, non-cancerous} NN classification amounts to using a training dataset (such as R above) and a distance or similarity function on the attributes A 1,…,A n to decide the best prediction of class label for a new tuple, a=(a 1,…,a n ) which does not have a known class, by letting those training rows closest to vote (assumes the class labels should be continuous in A 1,…,A n ). That is, for the homeland security application, the NN classifier will classify an unknown individual, I, as a potential terrorist iff the known individuals who have characteristics (history, nationality, …) close to I’ s are predominantely terrorists. Or in precision agriculture, using last years data as training data, we take an aerial photo of a field mid- growing-season. For a given point, p, in the field, we find the points in last years data that reveal the nearest match on colors (e.g., R,G,B, NIR…) and let them vote as to the probable yield to be expected at p this year. In Network Flow classification, we examine header fields to find near matches with those packets that have been assigned to a given flow in the past. In the virus classification application, if a message or flow has nearly the same characteristics as previously identified attacks of a particular class, we reject the message or flow. In the cancer research app, we judge a gene to be cancer causing if nearly the same expression patterns have been predominately by known cancer causing genes in past studies.

Key a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 =C a 11 a 12 a 13 a 14 a 15 a 16 a 17 a 18 a 19 a 20 t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1 t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0 t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0 t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1 t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1 t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0 t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1 t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1 t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 Suppose we consider A10 as the class attribute, and suppose we know that only a 5 a 6 a 11 a 12 a 13 a 14 are relevant to this classification. Suppose an unclassified sample has with ( a 5 a 6 a 11 a 12 a 13 a 14 ) = (0 0 0 0 0 0), has to be classified (need a prediction as to most likely a 10 -value. In 3-Nearest-Neighbor (3NN) classification, we look for the 3 nearest rows in the training table below, count up the number of occurrences of each a 10 -value and let the predominate class be the prediction.

Key a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 =C a 11 a 12 a 13 a 14 a 15 a 16 a 17 a 18 a 19 a 20 t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1 t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0 t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0 t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1 t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1 t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0 t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1 t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1 t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 t12 0 0 1 0 1 1 0 2 t13 0 0 1 0 1 0 0 1 t15 0 0 1 0 1 0 1 2 a 5 a 6 a 10 =C a 11 a 12 a 13 a 14 distance The 3 nearest neighbors 0 0 0 d=2, don’t replace 0 0 0 d=4, don’t replace 0 0 0 d=4, don’t replace 0 0 0 d=3, don’t replace 0 0 0 d=3, don’t replace 0 0 0 d=2, don’t replace 0 0 0 d=3, don’t replace 0 0 0 d=2, don’t replace 0 0 0 d=1, replace t53 0 0 0 0 1 0 0 1 0 0 0 d=2, don’t replace 0 0 0 d=2, don’t replace 0 0 0 d=3, don’t replace 0 0 0 d=2, don’t replace 0 0 0 d=2, don’t replace 0 1 C=1 wins! For the unclassified sample, a, with ( a 5 a 6 a 11 a 12 a 13 a 14 ) = (0 0 0 0 0 0), we scan for the 3-nearest nbrs. Note, only 1 of many training tuple at a distance=2 from the sample got to vote. We didn’t know that distance=2 was going to be the vote cutoff until the end of the 1 st scan. Finding the other distance=2 voters (Closed 3NN set or C3NN) requires another scan.

Key a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 =C a 11 a 12 a 13 a 14 a 15 a 16 a 17 a 18 a 19 a 20 t12 1 0 1 0 0 0 1 1 0 1 0 1 1 0 1 1 0 0 0 1 t13 1 0 1 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 1 1 t15 1 0 1 0 0 0 1 1 0 1 0 1 0 1 0 0 1 1 0 0 t16 1 0 1 0 0 0 1 1 0 1 1 0 1 0 1 0 0 0 1 0 t21 0 1 1 0 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 t27 0 1 1 0 1 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 t31 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 1 1 0 1 t32 0 1 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 0 1 t33 0 1 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 1 1 t35 0 1 0 0 1 0 0 0 1 1 0 1 0 1 0 0 1 1 0 0 t51 0 1 0 1 0 0 1 1 0 0 1 0 1 0 0 0 1 1 0 1 t53 0 1 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 t55 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 t57 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 1 1 0 0 t61 1 0 1 0 1 0 0 0 1 0 1 0 1 0 0 0 1 1 0 1 t72 0 0 1 1 0 0 1 1 0 0 0 1 1 0 1 1 0 0 0 1 t75 0 0 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 0 0 2 nd scan to find the Closed 3NN set (C3NN set) for the unclassified sample. Does it change the vote? 0 0 0 d=2, include it also 0 0 0 d=4, don’t include 0 0 0 d=4, don’t include 0 0 0 d=3, don’t include 0 0 0 d=3, don’t include 0 0 0 d=2, include it also 0 0 0 d=3, don’t include 0 0 0 d=2, include it also 0 0 0 d=1, already have 0 0 0 d=2, include it also 0 0 0 d=2, include it also 0 0 0 d=3, don’t replace 0 0 0 d=2, include it also 0 0 0 d=2, include it also 0 0 0 d=2, include it also 0 0 0 d=2, already have 0 0 0 d=1, already have 0 1 Vote after 1 st scan. YES! t12 0 0 1 0 1 1 0 2 t13 0 0 1 0 1 0 0 1 a 5 a 6 a 10 =C a 11 a 12 a 13 a 14 distance t53 0 0 0 0 1 0 0 1 Unclassified sample: 0 0 0 0 0 0 3NN set after 1 st scan

C00000000001111111C00000000001111111 C11111111110000000C11111111110000000 Find the Closed 3NN training set (C3NN set) using Ptree : a 20 1 0 1 0 1 0 1 0 1 0 key t 12 t 13 t 15 t 16 t 21 t 27 t 31 t 32 t 33 t 35 t 51 t 53 t 55 t 57 t 61 t 72 t 75 a111110000000000100a111110000000000100 a200001111111111000a200001111111111000 a311111100000000111a311111100000000111 a400000000001111011a400000000001111011 a500001111110000100a500001111110000100 a600001100000000000a600001100000000000 a711110000001111011a711110000001111011 a811110000001111011a811110000001111011 a900000011110000100a900000011110000100 a 10 = C 1 0 a 11 0 1 0 1 0 1 0 1 0 a 12 1 0 1 0 1 0 1 a 13 1 0 1 0 1 0 1 0 a 14 0 1 0 1 0 1 0 1 0 1 a 15 1 0 1 0 1 0 1 0 1 0 a 16 1 0 1 0 1 0 a 17 0 1 0 1 0 1 0 1 0 1 a 18 0 1 0 1 0 1 0 1 0 1 a 19 0 1 0 1 0 1 0 1 0 Ps00000000000000000Ps00000000000000000 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 D s,0 is empty, proceed to S s,1 the sphere of radius=1 about s Let all training points in D s,0 (disk about sample, s, of radius 0) vote 1 st if  3 of them, done, else go to D s,1, etc. by constructing the tuple Ptree, P s then ANDing with P C and P C’ Below black is used to denote attribute complement (rather than ‘) and red means uncomplemented.

C00000000001111111C00000000001111111 C11111111110000000C11111111110000000 a 20 1 0 1 0 1 0 1 0 1 0 key t 12 t 13 t 15 t 16 t 21 t 27 t 31 t 32 t 33 t 35 t 51 t 53 t 55 t 57 t 61 t 72 t 75 a111110000000000100a111110000000000100 a200001111111111000a200001111111111000 a311111100000000111a311111100000000111 a400000000001111011a400000000001111011 a500001111110000100a500001111110000100 a600001100000000000a600001100000000000 a711110000001111011a711110000001111011 a811110000001111011a811110000001111011 a900000011110000100a900000011110000100 a 10 =C 1 0 a 11 0 1 0 1 0 1 0 1 0 a 12 1 0 1 0 1 0 1 a 13 1 0 1 0 1 0 1 0 a 14 0 1 0 1 0 1 0 1 0 1 a 15 1 0 1 0 1 0 1 0 1 0 a 16 1 0 1 0 1 0 a 17 0 1 0 1 0 1 0 1 0 1 a 18 0 1 0 1 0 1 0 1 0 1 a 19 0 1 0 1 0 1 0 1 0 P D(s,1) 0 1 0 1 0 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a500001111110000100a500001111110000100 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a600001100000000000a6000011000000000000 a511110000001111011a511110000001111011 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 0 1 0 1 0 1 0 1 0 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 121 1 0 1 0 1 0 1 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a 14 1 0 1 0 1 0 1 0 1 0 a 13 1 0 0 1 0 1 0 1 0 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a 140 1 0 1 0 1 0 1 0 1 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 S s,1 : Construct Ptree, P S(s,1) = OR P i = P |s i -t i |=1; |s j -t j |=0, j  i = OR P S(s i,1)   S(s j,0) black=attribute complement, red=attribute OR P5P5 P6P6 P 11 P 12 P 13 P 14 j  {5,6,11,12,13,14}-{i} 0 1 i= 5,6,11,12,13,14

key t 12 t 13 t 15 t 16 t 21 t 27 t 31 t 32 t 33 t 35 t 51 t 53 t 55 t 57 t 61 t 72 t 75 a500001111110000100a500001111110000100 a600001100000000000a600001100000000000 a 10 C 1 0 a 11 0 1 0 1 0 1 0 1 0 a 12 1 0 1 0 1 0 1 a 13 1 0 1 0 1 0 1 0 a 14 0 1 0 1 0 1 0 1 0 1 D s,2 : Construct Ptree, P D(s,2) = OR{all double-dim interval-Ptrees}; P D(s,2) = OR P i,j P i,j = P S(s i,1)  S(s j,1)  S(s k,0) black=attribute complement, red=attribute k  {5,6,11,12,13,14}-{i,j} 0 1 i,j  {5,6,11,12,13,14} a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a600001100000000000a6000011000000000000 a500001111110000100a500001111110000100 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 0 1 0 1 0 1 0 1 0 a611110011111111111a611110011111111111 a500001111110000100a500001111110000100 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 121 1 0 1 0 1 0 1 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a500001111110000100a500001111110000100 a 14 1 0 1 0 1 0 1 0 1 0 a 13 1 0 0 1 0 1 0 1 0 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a500001111110000100a500001111110000100 a 140 1 0 1 0 1 0 1 0 1 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a500001111110000100a500001111110000100 P 5,6 P 5,11 P 5,12 P 5,13 P 5,14 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 0 1 0 1 0 1 0 1 0 a600001100000000000a6000011000000000000 a511110000001111011a511110000001111011 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 1 0 1 0 1 0 1 a 11 1 0 1 0 1 0 1 0 1 a600001100000000000a6000011000000000000 a511110000001111011a511110000001111011 a 14 1 0 1 0 1 0 1 0 1 0 a 13 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a600001100000000000a6000011000000000000 a511110000001111011a511110000001111011 a 14 0 1 0 1 0 1 0 1 0 1 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a600001100000000000a6000011000000000000 a511110000001111011a511110000001111011 P 6,11 P 6,12 P 6,13 P 6,14 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 1 0 1 0 1 0 1 a 11 0 1 0 1 0 1 0 1 0 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a 14 1 0 1 0 1 0 1 0 1 0 a 13 1 0 1 0 1 0 1 0 a 12 0 1 0 1 0 1 0 a 11 0 1 0 1 0 1 0 1 0 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a 14 1 0 1 0 1 0 1 0 1 0 a 13 0 1 0 1 0 1 0 1 a 12 0 1 0 1 0 1 0 a 11 0 1 0 1 0 1 0 1 0 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 P 11,12 P 11,13 P 11,14 a 14 1 0 1 0 1 0 1 0 1 0 a 13 1 0 0 1 0 1 0 1 0 a 12 1 0 1 0 1 0 1 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 a 140 1 0 1 0 1 0 1 0 1 a 13 0 1 0 1 0 1 0 1 a 12 1 0 1 0 1 0 1 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 P 12,13 P 12,14 We now have 3 nearest nbrs. We could quite and declare C=1 winner? a 140 1 0 1 0 1 0 1 0 1 a 13 1 0 1 0 1 0 1 0 a 12 0 1 0 1 0 1 0 a 11 1 0 1 0 1 0 1 0 1 a611110011111111111a611110011111111111 a511110000001111011a511110000001111011 P 13,14 We now have the closed 3-nbrhd. We declare C=0 winner!

Justification for using vertical structures (once again)?  For record-based workloads (where the result is a set of records), changing the horizontal record structure and then having to reconstruct it, may introduce too much post processing? 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A 1 A 2 A 3 A 4 ) 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43  For data mining workloads, the result is often a bit (Yes/No, True/False) or another unstructured result (histogram), where there is no reconstructive post processing and the actual data records need never be involved? 1 0 1 or

1.1 st run is Pure0  0:000 truth:start 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A 1 A 2 A 3 A 4 ) Appendix: Run Lists : Another way to handle vertical data. Generalized Ptrees using standard run length compression of vertical bit files (alternatively, using Lempl Zipf?, Golomb?, other?) Run Lists: record the type and start-offset of pure runs. E.g., RL 11 : 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 --> R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 2. 2 nd run is Pure1  1:100 3. 3 rd run is Pure0  0:101 4. 4 th run is Pure1  1:110 RL 11 0:000 1:100 0:101 1:110 (to complement, flip purity bits) Eg, to count, 111 000 001 100 s, use “pure111000001100”: RL 11 ^RL 12 ^RL 13 ^RL’ 21 ^RL’ 22 ^RL’ 23 ^RL’ 31 ^RL’ 32 ^RL 33 ^RL 41 ^RL’ 42 ^RL’ 43 RL 11 RL 12 RL 13 RL 21 RL 22 RL 23 RL 31 RL 32 RL 33 RL 41 RL 42 RL 43 0:000 1:100 0:101 1:110 1:000 0:100 1:101 0:000 1:001 0:010 1:100 0:101 1:110 1:000 0:100 1:000 0:110 1:000 0:010 1:011 0:100 1:000 0:100 1:000 0:010 0:000 1:010 0:000 1:010 0:000 1:010 0:101 1:000 0:001 1:010 0:100 1:101 0:110 0000101100001011 R 11

RunListTrees? (RLtrees) To facilitate subsetting (isolating a subset) and processing, a Ptree stucture can be constructed as follows: 0000101100001011 R 11 RL 11 0:000 1:100 0:101 1:110 6. 1 st half of 1 st of 2 nd is  1 0 0 1 1 4. 1 st half of 2 nd half not  0 0 2. 1 st half is not pure1  0 0 0 1. Whole file is not pure1  0 5. 2 nd half of 2 nd half is  1 0 0 1 3. 2 nd half is not pure1  0 0 7. 2 nd half of 1 st of 2 nd not  0 0 0 1 10 Or, a separate NotPure0 index tree (trees could be terminated at any level). 1 st, AND NP0trees. Only 1-branches / result need ANDing thru list scans. The more operands, the fewer 1-branches. 0000101100001011 R 11 RL 11 0:000 1:100 0:101 1:110 6. 1 st half of 1 st of 2 nd true  1 1 0 1 1 1 1 4. 1 st half of 2 nd half true  1 1 0 1 1 2. 1 st half is false  0 1 0 1 1. Whole file is true  1 5. 2 nd half of 2 nd half true  1 1 0 1 1 1 3. 2 nd half is true  1 1 0 1 7. 2 nd of 1 st of 2 nd false  0 1 0 1 1 1 10

Other Indexes on RunLists We could put Pure0-Run, Pure1-Run and even Mixed-Run (or LZV-Run) RunListIndexes on RL: 00001011010101010000101101010101 R 11 RL 11 0:0 1:100 0:101 1:110 01:1000 P1RI 11 100:1 110:2 P0RI 11 000:4 101:1 startlength PLZVRI 11 1000:1 pattern Length (# of consecutive replicas of pattern)

Best Pure1 tree Implementation? My guess circa 04jan For n-D Pure1 trees: 1.At any node, if |1-bits| in the tuple set represented by the node < lower threshold, LT, 1.Then that node will simply show the 1List, the list of 1-bit positions (use a 0-bit if =0) and have no children, 2.Else if the tuple set represented by that node < UT=2 n m, an upper threshold, leave bit-slice uncompressed Building such Ptrees bottom up: Using in-order ordering, 1.If 1-count of the next UT-segment  LT install P-sequence, else install 1List. 2.If current UT-segment node is numbered k*(2 n –1) and it and all 2 n -1 predecessors are 1Lists, and the cardinality of the union of said 1Lists < LT, install the union in the parent node. Recurse this collapsing process upward to the root. Building such Ptrees top down: 1.For datasets larger than UT, recursively build down the pure1. 2.If ever a node has < LT 1-bits, install the 1List and terminate that branch. 3.At the level where the represented tuple set = UT, install 1List if |1-bits| < LT, else install P-sequence. Notes: 1.This method should extend well to data streams. When the data stream exceeds the current allocation (which, for n-D Ptrees will be a power of 2 n ), just grow the root of each Ptree to a new level, using the current Ptree as node 0. Nodes 1,2,3,…2 n-1 of the new Ptree will start as 0 nodes without children and will grow as 1Lists until LLT is reach then they will be converted to P-sequences. 2.Some additional MS-WORD notes on building P-trees can be found at http://www.cs.ndsu.nodak.edu/~perrizo/classes/765/buildingptree.doc http://www.cs.ndsu.nodak.edu/~perrizo/classes/765/buildingptree.doc

Ptrees Vertical, compressed, lossless structures that facilitates fast horizontal AND-processing Jury is still out on parallelization, vertical (by relation) or horizontal (by tree node) or some combination? Horizontal parallelization is pretty, but network multicast overhead is huge Use active networking? Clusters of Playstations?... Formally, P-trees are be defined as any of the following; Partition-tree: Tree of nested partitions (a partition P(R)={C 1..C n }; each component is partitioned by P(C i )={C i,1..C i,n i } i=1..n; each component is partitioned by P(C i,j )={C i,j 1..C i,j n ij }... ) Partition tree R / … \ C 1 … C n / … \ … / … \ C 11 …C 1,n 1 C n1 …C n,n n... Predicate-tree: For a predicate on the leaf-nodes of a partition-tree (also induces predicates on i-nodes using quantifiers) Predicate-tree nodes can be truth-values (Boolean P-tree); can be quantified existentially (1 or a threshold %) or universally; Predicate-tree nodes can count # of true leaf children of that component (Count P-tree) Purity-tree: universally quantified Boolean-Predicate-tree (e.g., if the predicate is, Pure1-tree or P1tree) A 1-bit at a node iff corresponding component is pure1 (universally quantified) There are many other useful predicates, e.g., NonPure0-trees; But we will focus on P1trees. All Ptrees shown so far were: 1-dimensional (recursively partition by halving bit files), but they can be; 2-D (recursively quartering) (e.g., used for 2-D images); 3-D (recursively eighth-ing), …; Or based on purity runs or LZW-runs or … Further observations about Ptrees: Partition-tree: have set nodes Predicate-tree: have either Boolean nodes (Boolean P-tree) or count nodes (Count P-tree) Purity-tree: being universally quantified Boolean-Predicate-tree have Boolean nodes (since the count is always the “full” count of leaves, expressing Purity-trees as count-trees is redundant. Partition-tree can sliced at a given level if each partition at that level is labeled with very same label set (e.g., Month partition of years). A Partition-tree can be generalized to a Set-graph when the siblings of a node do not form a partition.

Data Warehouses and DBMSs  C.J. Date, circa 1980  Do transactions on a DBMSs rather than  file processing on file systems.  “Using a DBMS instead of.

Similar presentations

Presentation on theme: "Data Warehouses and DBMSs  C.J. Date, circa 1980  Do transactions on a DBMSs rather than  file processing on file systems.  “Using a DBMS instead of."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Warehouses and DBMSs  C.J. Date, circa 1980  Do transactions on a DBMSs rather than  file processing on file systems.  “Using a DBMS instead of.

Similar presentations

Presentation on theme: "Data Warehouses and DBMSs  C.J. Date, circa 1980  Do transactions on a DBMSs rather than  file processing on file systems.  “Using a DBMS instead of."— Presentation transcript:

Similar presentations

About project

Feedback