Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.

Slides:



Advertisements
Similar presentations
Lecture 4 (week 2) Source Coding and Compression
Advertisements

AVL Trees1 Part-F2 AVL Trees v z. AVL Trees2 AVL Tree Definition (§ 9.2) AVL trees are balanced. An AVL Tree is a binary search tree such that.
Multidimensional Data Rtrees Bitmap indexes. R-Trees For “regions” (typically rectangles) but can represent points. Supports NN, “where­am­I” queries.
CSE506: Operating Systems Block Cache. CSE506: Operating Systems Address Space Abstraction Given a file, which physical pages store its data? Each file.
BTrees & Bitmap Indexes
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
Binary Trees Terminology A graph G = is a collection of nodes and edges. An edge (v 1,v 2 ) is a pair of vertices that are directly connected. A path,
Datamining_3 Clustering Methods Clustering a set is partitioning that set. Partitioning is subdividing into subsets which mutually exclusive (don't overlap)
COMP 451/651 Multiple-key indexes
E.G.M. PetrakisB-trees1 Multiway Search Tree (MST)  Generalization of BSTs  Suitable for disk  MST of order n:  Each node has n or fewer sub-trees.
CSE Lectures 22 – Huffman codes
Data Mining on Streams  We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to.
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
Data Structures Arrays both single and multiple dimensions Stacks Queues Trees Linked Lists.
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
1 COP 3538 Data Structures with OOP Chapter 8 - Part 2 Binary Trees.
MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR DISTRIBUTED DATA MINING
Introduction Of Tree. Introduction A tree is a non-linear data structure in which items are arranged in sequence. It is used to represent hierarchical.
Multi-way Trees. M-way trees So far we have discussed binary trees only. In this lecture, we go over another type of tree called m- way trees or trees.
Data Warehouses and DBMSs  C.J. Date, circa 1980  Do transactions on a DBMSs rather than  file processing on file systems.  “Using a DBMS instead of.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
Chapter 11 Heap. Overview ● The heap is a special type of binary tree. ● It may be used either as a priority queue or as a tool for sorting.
CSCI 765 Big Data and Infinite Storage One new idea introduced in this course is the emerging idea of structuring data into vertical structures and processing.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
P-Tree Implementation Anne Denton. So far: Logical Definition C.f. Dr. Perrizo’s slides Logical definition Defines node information Representation of.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality: solutions don’t scale well with respect to record volume.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
Main Index Contents 11 Main Index Contents Complete Binary Tree Example Complete Binary Tree Example Maximum and Minimum Heaps Example Maximum and Minimum.
Association Analysis (3)
Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
Accelerating Multilevel Secure Database Queries using P-Tree Technology Imad Rahal and Dr. William Perrizo Computer Science Department North Dakota State.
Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,
Copyright © Curt Hill Other Trees Applications of the Tree Structure.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Vertical Data 2 In this example database (which is used throughout these notes), there are two entities, Students (a student has a number, S#, a name,
Indexing Structures Database System Implementation CSE 507 Some slides adapted from Silberschatz, Korth and Sudarshan Database System Concepts – 6 th Edition.
Trees CSIT 402 Data Structures II 1. 2 Why Do We Need Trees? Lists, Stacks, and Queues are linear relationships Information often contains hierarchical.
1 Query Processing Part 3: B+Trees. 2 Dense and Sparse Indexes Advantage: - Simple - Index is sequential file good for scans Disadvantage: - Insertions.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.
Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.
P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.
CSE 373 Data Structures Lecture 7
Item-Based P-Tree Collaborative Filtering applied to the Netflix Data
COMP261 Lecture 23 B Trees.
Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science.
Decision Tree Induction for High-Dimensional Data Using P-Trees
Efficient Ranking of Keyword Queries Using P-trees
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
Proximal Support Vector Machine for Spatial Data Using P-trees1
CSC 172– Data Structures and Algorithms
North Dakota State University Fargo, ND USA
Smoothing using only the two hi order bits (aggregation by
Yue (Jenny) Cui and William Perrizo North Dakota State University
Monday, April 16, 2018 Announcements… For Today…
PTrees (predicate Trees) fast, accurate , DM-ready horizontal processing of compressed, vertical data structures Project onto each attribute (4 files)
3. Vertical Data LECTURE 2 Section 3.
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
Vertical K Median Clustering
Vertical K Median Clustering
3. Vertical Data LECTURE 2 Section 3.
North Dakota State University Fargo, ND USA
CMSC 202 Trees.
Functional Analytic Unsupervised and Supervised data mining Technology
The Multi-hop closure theorem for the Rolodex Model using pTrees
Vertical K Median Clustering
North Dakota State University Fargo, ND USA
Heaps and priority queues
Presentation transcript:

Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees (Ptrees in either case) 1 processed horizontally (DBMSs process horizontal data vertically) Ptrees are data-mining-ready, compressed data structures, which attempt to address the curses of scalability and curse of dimensionality. 1 Ptree Technology is patent pending by North Dakota State University

6. Lf half of lf of rt? true  Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 Horizontally structured records Scanned vertically 5. Rt half of right half? true  R Horizontally AND basic Ptrees Predicate tree technology: vertically project each attribute, R( A 1 A 2 A 3 A 4 ) Current practice: Structure data into horizontal records. Process vertically (vertical scans) The compressed Ptree (1-Dim) version of R 11, denoted, P 11, is built by recording the truth of the predicate “pure 1” in a tree recursively on halves, until purity is achieved. 3. Right half pure1? false  Rt half of lf of rt? false  R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] But it is pure (pure0) so this branch ends then vertically project each bit position of each attribute, then compress each bit slice into a basic Ptree. e.g., compression of R 11 into P 11 goes as follows: To count occurrences of 7,0,1,4 use pure : level P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 = level = level ^ P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P ^ ^^^ ^ ^ ^ ^^ ^ P 11 pure1? false=0 pure1? true=1 And it’s pure so branch ends pure1? false=0

RunListTrees? (RLtrees) To facilitate subsetting (isolating a subset) and processing, a Ptree stucture can be constructed as follows: R 11 RL 11 0:000 1:100 0:101 1: st half of 1 st of 2 nd is  st half of 2 nd half not  st half is not pure1  Whole file is not pure1  nd half of 2 nd half is  nd half is not pure1  nd half of 1 st of 2 nd not  Or, a separate NotPure0 index tree (trees could be terminated at any level). 1 st, AND NP0trees. Only 1-branches / result need ANDing thru list scans. The more operands, the fewer 1-branches R 11 RL 11 0:000 1:100 0:101 1: st half of 1 st of 2 nd true  st half of 2 nd half true  st half is false  Whole file is true  nd half of 2 nd half true  nd half is true  nd of 1 st of 2 nd false 

Other Indexes on RunLists We could put Pure0-Run, Pure1-Run and even Mixed-Run (or LZV-Run) RunListIndexes on RL: R 11 RL 11 0:0 1:100 0:101 1:110 01:1000 P1RI :1 110:2 P0RI :4 101:1 startlength PLZVRI :1 pattern Length (# of consecutive replicas of pattern)

Best Pure1 tree Implementation? My guess circa 04jan For n-D Pure1 trees: 1.At any node, if |1-bits| in the tuple set represented by the node < lower threshold, LT, 1.Then that node will simply show the 1List, the list of 1-bit positions (use a 0-bit if =0) and have no children, 2.Else if the tuple set represented by that node < UT=2 n m, an upper threshold, leave bit-slice uncompressed Building such Ptrees bottom up: Using in-order ordering, 1.If 1-count of the next UT-segment  LT install P-sequence, else install 1List. 2.If current UT-segment node is numbered k*(2 n –1) and it and all 2 n -1 predecessors are 1Lists, and the cardinality of the union of said 1Lists < LT, install the union in the parent node. Recurse this collapsing process upward to the root. Building such Ptrees top down: 1.For datasets larger than UT, recursively build down the pure1. 2.If ever a node has < LT 1-bits, install the 1List and terminate that branch. 3.At the level where the represented tuple set = UT, install 1List if |1-bits| < LT, else install P-sequence. Notes: 1.This method should extend well to data streams. When the data stream exceeds the current allocation (which, for n-D Ptrees will be a power of 2 n ), just grow the root of each Ptree to a new level, using the current Ptree as node 0. Nodes 1,2,3,…2 n-1 of the new Ptree will start as 0 nodes without children and will grow as 1Lists until LLT is reach then they will be converted to P-sequences.

Ptrees Predicate-tree: For a predicate on the leaf-nodes of a partition-tree (also induces predicates on i-nodes using quantifiers) Predicate-tree nodes can be truth-values (Boolean P-tree) Predicate can be quantified existentially (1 or a threshold %) or universally Purity-tree: universally quantified tree (predicate is “1  position), Pure1-tree) existential quantified tree (predicate is “  a 1 position), NotPure0-tree) We will focus on P1trees. All Ptrees shown so far were 1-dimensional (recursively partition by halving bit files), but they can be 2-D (recursively quartering) (e.g., used for 2-D images) 3-D (recursively eighth-ing), … Or based on purity runs or LZW-runs or … Vertical, compressed, lossless structures that facilitates fast horizontal AND-processing P-trees are all of the following: Partition-tree: Tree of nested partitions: Root is the entire bit sequence Level-1 is a partition of the root P(R)={C 1..C n } In level-2 each level-1 component is partitioned by P(C i ) = {C i,1..C i,n i } i=1..n, In level-3 each level-2 component is partitioned by P(C i,j ) = {C i,j 1..C i,j n ij }... Partition tree R / … \ C 1 … C n / … \ … / … \ C 11 …C 1,n 1 C n1 …C n,n n...

Best Ptree Implementation? My guess circa 04feb Ptrees can be viewed “breadth first bottom up ” That is, one can view the leaf level (level-0) horizontally and then view level-1 as another bit vector in which each bit represents the truth of the predicate (e.g., pure1), applied to pairs of bits in the leaf level vector. This can be continued upwards for each successive level until the root is reached. The result is a bottom up construction of the 1-D ptree. This construction has the advantages that the leaf level vector can be any length and can grow (e.g., for data streams), Each level up e.g., level-k, can be grown as new 2 k -granules for that level fill up. Additionally, the breadth-first bottom up view of the 2-D ptree for that bit vector is just the leaf vector with every other level above it left out. The breadth first bottom up view of the 3-D ptree is just the leaf level then leaving out 2 consecutive levels at a time, going up the levels. 4-D is the same leaving out triples of levels at a time, etc. Another advantage of this view is that, stripping an additional k levels just above the leaf level amounts to using 2 k length p-sequence at the leaf as discussed discussed on a previous slide. The breadth-first bottom up (B2U) view doubles the storage requirement over the storage requirement of p- sequences. However, storage is free so it isn’t a big concern. The capture time (on the DII side) is probably much higher, but that is a one-time cost. On the face of it, compression is gone. However, we are talking about disk space again, which is free. In terms of anding speed, the compression is there (just don’t decent on purity). However, one should have the NZ trees (for the existential case) as well in order to distinguish purity. Note, we are replacing all pointers with offsets now, so distinguishing mixed from pure zero cannot be done by “no child pointers”.