3. Vertical Data LECTURE 2 Section 3.

Slides:



Advertisements
Similar presentations
BackTracking Algorithms
Advertisements

Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
The Binary Numbering Systems
Multidimensional Data Rtrees Bitmap indexes. R-Trees For “regions” (typically rectangles) but can represent points. Supports NN, “where­am­I” queries.
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
With PGP-D, to get pTree info, you need: the ordering (the mapping of bit position to table row) the predicate (e.g., table column id and bit slice or.
BTrees & Bitmap Indexes
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
Datamining_3 Clustering Methods Clustering a set is partitioning that set. Partitioning is subdividing into subsets which mutually exclusive (don't overlap)
SPRING 2004CENG 3521 The Relational Model Chapter 3.
Is ASCII the only way? For computers to do anything (besides sit on a desk and collect dust) they need two things: 1. PROGRAMS 2. DATA A program is a.
COMP 451/651 Multiple-key indexes
5. 1 JPEG “ JPEG ” is Joint Photographic Experts Group. compresses pictures which don't have sharp changes e.g. landscape pictures. May lose some of the.
Artificial Neural Network Applications on Remotely Sensed Imagery Kaushik Das, Qin Ding, William Perrizo North Dakota State University
Data Mining on Streams  We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to.
Bit Sequential (bSQ) Data Model and Peano Count Trees (P-trees) Department of Computer Science North Dakota State University, USA (the bSQ and P-tree technology.
Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.
Analysis of Algorithms These slides are a modified version of the slides used by Prof. Eltabakh in his offering of CS2223 in D term 2013.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.
Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,
Bootstrapped Optimistic Algorithm for Tree Construction
Entity Relationship Diagram (ERD). Objectives Define terms related to entity relationship modeling, including entity, entity instance, attribute, relationship.
Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.
Vertical Data 2 In this example database (which is used throughout these notes), there are two entities, Students (a student has a number, S#, a name,
Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.
P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.
ER Diagrams and Relational Model CS 174a (Winter 2015)
School of Computing Clemson University Fall, 2012

Item-Based P-Tree Collaborative Filtering applied to the Netflix Data
Everything is a number Everything in a computer memory and on storages is a number. Number  Number Characters  Number by ASCII code Sounds  Number.
Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science.
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Decision Tree Induction for High-Dimensional Data Using P-Trees
Efficient Ranking of Keyword Queries Using P-trees
Mean Shift Segmentation
Yue (Jenny) Cui and William Perrizo North Dakota State University
Translation of ER-diagram into Relational Schema
Spatial Data Models Raster uses individual cells in a matrix, or grid, format to represent real world entities Vector uses coordinates to store the shape.
Grasshopper caused significant economic loss each year.
Monday, April 16, 2018 Announcements… For Today…
PTrees (predicate Trees) fast, accurate , DM-ready horizontal processing of compressed, vertical data structures Project onto each attribute (4 files)
Fitting Curve Models to Edges
Spanning Trees Longin Jan Latecki Temple University based on slides by
3. Vertical Data LECTURE 2 Section 3.
Pre-Processing What is the best amount of amortized preprocessing?
09_Queries_LECTURE2 CODE GENERATION implements the operator, JOIN (equi-join, we will use  for it.) J1. NESTED LOOP JOIN S on S.S#=E.S# For each record.
Sequential circuit design
Clustering Methods Clustering a set is partitioning that set.
The Building Blocks: Binary Numbers, Boolean Logic, and Gates
Lecture 15: Bitmap Indexes
Sequential circuit design
Decision Trees.
Fundamentals of Data Representation
A Spatial Data and Sensor Network Application:
North Dakota State University Fargo, ND USA
Copyright © Cengage Learning. All rights reserved.
Introduction to Data Structures
The Multi-hop closure theorem for the Rolodex Model using pTrees
Pruning 24-Feb-19.
North Dakota State University Fargo, ND USA
Spanning Trees Longin Jan Latecki Temple University based on slides by
Quantizing Compression
The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy
Error Correction Coding
Scalable light field coding using weighted binary images
Quantizing Compression
Digital Electronics and Logic Design
Data Mining CSCI 307, Spring 2019 Lecture 23
Presentation transcript:

3. Vertical Data LECTURE 2 Section 3

A Education Database Example In this example database (which is used throughout these notes), there are two entities, Students (a student has a number, S#, a name, SNAME, and gender, GEN Courses (course has a number, C#, name, CNAME, State where the course is offered, ST, TERM and ONE relationship, Enrollments (a student, S#, enrolls in a class, C#, and gets a grade in that class, GR). The horizontal Education Database consists of 3 files, each of which consists of a number of instances of identically structured horizontal records: Enrollments Student Courses S#|C#|GR 0 |1 |B 0 |0 |A 3 |1 |A 3 |3 |B 1 |3 |B 1 |0 |D 2 |2 |D 2 |3 |A 4 |4 |B 5 |5 |B S#|SNAME|GEN 0 |CLAY | M 1 |THAD | M 2 |QING | F 3 |AMAL | M 4 |BARB | F 5 |JOAN | F C#|CNAME|ST|TERM 0 |BI |ND| F 1 |DB |ND| S 2 |DM |NJ| S 3 |DS |ND| F 4 |SE |NJ| S 5 |AI |ND| F We have already talked about the process of structuring data in a horizontal database (e.g., develop an Entity-Relationship diagram or ER diagram, etc. - in this case: Courses Student Enrollments S# SNAME GEN C# GR CNAME ST TERM What is the process of structuring this data into a vertical database? This is an open question. Much research is needed on that issue! (great term paper topics exist here!!!) We will discuss this a little more on the next slide. Section 3 # 12 11

One way to begin to vertically structure this data is: 1. Code some attributes in binary For numeric fields, we have used standard binary encoding (red indicates the highorder bit, green the middle bit and blue the loworder bit) to the right of each field value encoded). . For gender, F=1 and M=0. For term, Fall=0, Spring=1. For grade, A=11, B=10, C=01, D=00 (which could be called GPA encoding?). We have abreviated STUDENT to S, COURSE to C and ENROLLMENT to E. S:S#___|SNAME|GEN 0 000|CLAY |M 0 1 001|THAD |M 0 2 010|QING |F 1 3 011|BARB |F 1 4 100|AMAL |M 0 5 101|JOAN |F 1 C:C#___|CNAME|ST|TERM 0 000|BI |ND|F 0 1 001|DB |ND|S 1 2 010|DM |NJ|S 1 3 011|DS |ND|F 0 4 100|SE |NJ|S 1 5 101|AI |ND|F 0 E:S#___|C#___|GR . 0 000|1 001|B 10 0 000|0 000|A 11 3 011|1 001|A 11 3 011|3 011|D 00 1 001|3 011|D 00 1 001|0 000|B 10 2 010|2 010|B 10 2 010|3 011|A 11 4 100|4 100|B 10 5 101|5 101|B 10 The above encoding seem natural. But how did we decide which attributes are to be encoded and which are not? As a term paper topic, that would be one of the main issues to research Note, we have decided not to encode names (our rough reasoning (not researched) is that there would be little advantage and it would be difficult (e.g. if name is a CHAR(25) datatype, then in binary that's 25*8 = 200 bits!). Note that we have decided not to encode State. That may be a mistake! Especially in this case, since it would be so easy (only 2 States ever? so 1 bit), but more generally there could be 50 and that would mean at least 6 bits. 2. Another binary encoding scheme (which can be used for numeric and non-numeric fields) is value map or bitmap encoding. The concept is simple. For each possible value, a, in the domain of the attribute, A, we encode 1=true and 0=false for the predicate A=a. The resulting single bit column becomes a map where a 1 means that row has A-value = a and a 0 means that row or tuple has A-value which is not a. There is a wealth of existing research on bit encoding. There is also quite a bit of research on vertical databases. There is even the first commercial vertical database announced called Vertica (check it out by Googling that name). Vertica was created by the same guy, Mike Stonebraker, who created one of the first Relational Databases, Ingres. Section 3 # 13

Method-1 for vertically structuring the Educational Database S:S#___|SNAME|GEN 0 000|CLAY |M 0 1 001|THAD |M 0 2 010|QING |F 1 3 011|BARB |F 1 4 100|AMAL |M 0 5 101|JOAN |F 1 C:C#___|CNAME|ST|TERM 0 000|BI |ND|F 0 1 001|DB |ND|S 1 2 010|DM |NJ|S 1 3 011|DS |ND|F 0 4 100|SE |NJ|S 1 5 101|AI |ND|F 0 E:S#___|C#___|GR . 0 000|1 001|B 10 0 000|0 000|A 11 3 011|1 001|A 11 3 011|3 011|D 00 1 001|3 011|D 00 1 001|0 000|B 10 2 010|2 010|B 10 2 010|3 011|A 11 4 100|4 100|B 10 5 101|5 101|B 10 1 1 1 CLAY THAD QING BARB AMAL JOAN 1 1 1 1 BI DB DM DS SE AI ND NJ 1 1 1 1 1 1 1 1 1 The M1 VDBMS would then be stored as: Rather than have to label these (0-Dimensional uncompressed) Ptrees, we will remember the color code, namely: purple border = S#; brown border = C#; light blue border = GR; Yellow border = GEN; Black border = TERM. red bits means highorder bit (on the left); green bits means middle bit; blue bit means loworder bit (on the right). We will be able to distinguish whether a S# Ptrees is from STUDENT or ENROLLMENT by its length. Similarly, we will be able to distingusih the C# Ptrees of COURSE versus ENROLLMENT. Section 3 # 14

EM SELECT S.n, E.g FROM S, E WHERE S.s=E.s & E.g=D For the selection, ENROLL.gr = D ( or E.g=D) we we create a ptree mask: EM = E'.g1 AND E'.g2 (because we want both bits to be zero for D). S: S#___ | SNAME | GEN E: S#___ | C#___ | GR 1 1 1 CLAY THAD QING BARB AMAL JOAN 1 1 1 1 1 1 1 1 1 1 1 EM 1 Section 3 # 15

For S#= (0 1 1), mask S-tuples with P'S#,2^PS#,1^PS#, 0 SELECT S.n, E.g FROM S, E WHERE S.s=E.s & E.g=D For the join, S.s = E.s, we sequence through the masked E tuples and for each, we create a mask for the matching S tuples, concatenate them and output the concatenation. S: S#___ | SNAME | GEN E: S#___ | C#___ | GR 1 1 1 1 1 1 CLAY THAD QING BARB AMAL JOAN 1 1 1 1 1 1 1 1 1 1 For S#= (0 1 1), mask S-tuples with P'S#,2^PS#,1^PS#, 0 Concatenate and output (BARB, D) 1 Section 3 # 16

For S#= (0 1 1), mask S-tuples with P'S#,2^P'S#,1^PS#, 0 SELECT S.n, E.g FROM S, E WHERE S.s=E.s & E.g=D For the join, S.s = E.s, we sequence through the masked E tuples and for each, we create a mask for the matching S tuples, concatenate them and output the concatenation. S: S#___ | SNAME | GEN E: S#___ | C#___ | GR 1 1 1 1 1 1 CLAY THAD QING BARB AMAL JOAN 1 1 1 1 1 1 1 1 1 1 For S#= (0 1 1), mask S-tuples with P'S#,2^P'S#,1^PS#, 0 Concatenate and output (BARB, D) 1 For S#= (0 0 1), mask S-tuples with P'S#,2^P'S#,1^PS#, 0 Concatenate and output (THAD, D) Section 3 # 17

For S#= (0 0 0), mask E-tuples with P'S#,2^P'S#,1^P'S#, 0 SELECT S.n, E.g FROM S, E WHERE S.s=E.s Can the join, S.s = E.s, be speeded up? Since there is no selection involved this time, pontentially, we would have to visit every E-tuple, mask the matching S-tuples, concatenate and output. We can speed that up by masking all common E-tuples and retaining partial masks as we go. S: S#___ | SNAME | GEN E: S#___ | C#___ | GR 1 1 1 1 1 1 CLAY THAD QING BARB AMAL JOAN 1 1 1 1 1 1 1 1 1 1 1 1 For S#= (0 0 0), mask E-tuples with P'S#,2^P'S#,1^P'S#, 0 1 1 For S#= (0 0 1), mask S-tuples with P'S#,2^P'S#,1^P'S#, 0 Concatenate and output (CLAY, B) and (CLAY, A) Continue down to the next E-tuple and do the same.... Section 3 # 18

2-Dimensional P-trees:natural choice for, e. g. , 2-D image files 2-Dimensional P-trees:natural choice for, e.g., 2-D image files. For images, any ordering of pixels will work (raster, diagonalized, Peano, Hilbert, Jordan), but the space-filling “Peano” ordering has advantages for fast processing, yet compresses well in the presence of spatial continuity. For an image bit-file (e.g., hi-order bit of the red color band of an image): 1111110011111000111111001111111011110000111100001111000001110000 Which, in spatial raster order is: Top-down construction of its 2-dimensional Peano ordered P-tree is built by recording the truth of universal predicate “pure 1” in a fanout=4 tree recursively on quarters (1/22 subsets), until purity achieved Pure-1? False=0 Pure! Pure! 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 pure! pure! pure! pure! pure! 1 1 1 1 1 1 1 1 1 1 1 Section 3 # 19

From here on we will take 4 bit positions at a time, for efficiency. Start here 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 Bottom-up construction of the 2-Dimensional P-tree is done using Peano (in order) traversal of a fanout=4, log4(64)= 4 level tree, collapsing pure siblings, as we go: From here on we will take 4 bit positions at a time, for efficiency. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 Section 3 # 20

Some aspects of 2-D P-trees: ROOT-COUNT = level-sum * level-purity-factor. Root Count = 7 * 40 + 4 * 41 + 2 * 42 = 55 1=001 0 level-3 (pure=43) 1 1 level-2 (pure=42) 1 level-1 (pure=41) 1 level-0 (pure=40) 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 2 3 2 3 2 . 2 . 3 7=111 Tree levels (going down): 3, 2, 1, 0, with purity-factors of 43 42 41 40 respectively Fan-out = 2dimension = 22 = 4 Node ID (NID) = 2.2.3 ( 7, 1 ) ( 111, 001 ) 10.10.11 Section 3 # 21

3-Dimensional Ptrees: Top-down construction of its 3-dimensional Peano ordered P-tree: record the truth of universal predicate pure1 in a fanout=8 tree recursively on eighths (1/23 subsets), until purity achieved. Bottom-up construction of the 3-Dimensional P-tree is done using Peano (in order) traversal of a fanout=8, log8(64)= 2 level tree, collapsing pure siblings, as we go: Section 3 # 22

Situation space 1 P CEASR bio-agent detector (uses 3-D Ptrees)  at a position in the situation space. Suppose a biological agent is sensed by nano-sensors Situation space And that position corresponds to this 1-bit position in this cutaway view We have now captured the data in the 1st octant (forward-upper-left). Moving to the next octant (forward-upper-right): All other positions contain a 0-bit, i.e., the level of bio-agent detected by the nano-sensors in each of the other 63 cells is below a danger threshold. Start 1 ONE tiny, 3-D P-tree can represent this “bio-situation” completely. It is constructed (bottom up) as a fan-out=8, 3-D P-tree, as follows. P We can save time by noting that all the remaining 56 cells (in 7 other octants) contain all 0s. Each of the next 7 octants will produce eight 0s at the leaf level (8 pure-0 siblings), each of which will collapse to a 0 at level-1. So, proceeding an octant at a time (rather than a cell at a time): 1 This entire situation can be transmitted to a personal display unit, as merely two bytes of data plus their two NIDs. For NID, use [level, global_level_offset] rather than [local_segment_offset,…local_segment_offset]. So assume every node not sent is all 0s, that in any 13-bit node segment sent (only need send “mixed” segments), the 1st 2 bits are the level, the next 3 bits are the global_level_offset within that level (i.e., 0..7), the final 8 bits are the node’s data, then the complete situation can be transmitted as these 13 bits: 01 000 0000 0001 If 2n3 cells (n=2 above) situation it will take only log2(n) blue, 23n-3 green, 8 red bits So even if there are 283=224 ~16,000,000 cells, transmit merely 3+21+8=32 bits. Section 3 # 23

Basic, Value and Tuple Ptrees Basic Ptrees for a 7 column, 8 bit table e.g., P11, P12, … , P18, P21, …, P28, …, P71, …, P78 Target Attribute Target Bit Position Value Ptrees (predicate: quad is purely target value in target attribute) e.g., P1, 5 = P1, 101 = P11 AND P12’ AND P13 AND Target Attribute Target Value Tuple Ptrees (predicate: quad is purely target tuple) e.g., P(1, 2, 3) = P(001, 010, 111) = P1, 001 AND P2, 010 AND P3, 111 AND Rectangle Ptrees (predicate: quad is purely in target rectangle (product of intervals) e.g., P([13],, [0.2]) = (P1,1 OR P1,2 OR P1,3) AND (P3,0 OR P3,1 OR P3,2) AND/OR Section 3 # 24

Horizontal Processing of Vertical Structures for Record-based Workloads For record-based workloads (where the result is a set of records), changing the horizontal record structure and then having to reconstruct it, may introduce too much post processing? 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A1 A2 A3 A4) For data mining workloads, the result is often a bit (Yes/No, True/False) or another unstructured result, where there is no reconstructive post processing? 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 1 Section 3 # 25

Please proceed on to 03_Vertical_data_LECTURE3 Section 3