Vertical Data 2 In this example database (which is used throughout these notes), there are two entities, Students (a student has a number, S#, a name,

Slides:

Advertisements

Similar presentations

Advertisements

Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??

Searching on Multi-Dimensional Data

The Binary Numbering Systems

Multidimensional Data Rtrees Bitmap indexes. R-Trees For “regions” (typically rectangles) but can represent points. Supports NN, “whereamI” queries.

With PGP-D, to get pTree info, you need: the ordering (the mapping of bit position to table row) the predicate (e.g., table column id and bit slice or.

Multidimensional Data. Many applications of databases are "geographic" = 2dimensional data. Others involve large numbers of dimensions. Example: data.

CSC1016 Coursework Clarification Derek Mortimer March 2010.

BTrees & Bitmap Indexes

Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.

Datamining_3 Clustering Methods Clustering a set is partitioning that set. Partitioning is subdividing into subsets which mutually exclusive (don't overlap)

SPRING 2004CENG 3521 The Relational Model Chapter 3.

Spatial Information Systems (SIS) COMP Raster-based structures (2) Data conversion.

Is ASCII the only way? For computers to do anything (besides sit on a desk and collect dust) they need two things: 1. PROGRAMS 2. DATA A program is a.

Efficient Multidimensional Packet Classification with Fast Updates Author: Yeim-Kuan Chang Publisher: IEEE TRANSACTIONS ON COMPUTERS, VOL. 58, NO. 4, APRIL.

COMP 451/651 Multiple-key indexes

General Trees and Variants CPSC 335. General Trees and transformation to binary trees B-tree variants: B*, B+, prefix B+ 2-4, Horizontal-vertical, Red-black.

CS 255: Database System Principles slides: Variable length data and record By:- Arunesh Joshi( 107) Id: Cs257_107_ch13_13.7.

5. 1 JPEG “ JPEG ” is Joint Photographic Experts Group. compresses pictures which don't have sharp changes e.g. landscape pictures. May lose some of the.

CS559-Computer Graphics Copyright Stephen Chenney Image File Formats How big is the image? –All files in some way store width and height How is the image.

Artificial Neural Network Applications on Remotely Sensed Imagery Kaushik Das, Qin Ding, William Perrizo North Dakota State University

Ch5: ER Diagrams - Part 2 Much of the material presented in these slides was developed by Dr. Ramon Lawrence at the University of Iowa.

Data Mining on Streams  We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to.

Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.

Pairwise Alignment, Part I Constructing the Values and Directions Tables from 2 related DNA (or Protein) Sequences.

: Chapter 12: Image Compression 1 Montri Karnjanadecha ac.th/~montri Image Processing.

3. Vertical Data First, a brief description of Data Warehouses versus Database Management Systems  C.J. Date recommended, circa 1980,  Do transaction.

UNC Chapel Hill M. C. Lin Orthogonal Range Searching Reading: Chapter 5 of the Textbook Driving Applications –Querying a Database Related Application –Crystal.

Bit Sequential (bSQ) Data Model and Peano Count Trees (P-trees) Department of Computer Science North Dakota State University, USA (the bSQ and P-tree technology.

Partitioning – A Uniform Model for Data Mining Anne Denton, Qin Ding, William Jockheck, Qiang Ding and William Perrizo.

Data Compression By, Keerthi Gundapaneni. Introduction Data Compression is an very effective means to save storage space and network bandwidth. A large.

Association Rule Mining on Remotely Sensed Imagery Using Peano-trees (P-trees) Qin Ding, Qiang Ding, and William Perrizo Computer Science Department North.

C-Store: Data Model and Data Organization Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May 17, 2010.

Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.

Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.

RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University,

M180: Data Structures & Algorithms in Java Trees & Binary Trees Arab Open University 1.

Accelerating Multilevel Secure Database Queries using P-Tree Technology Imad Rahal and Dr. William Perrizo Computer Science Department North Dakota State.

Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,

Bootstrapped Optimistic Algorithm for Tree Construction

Entity Relationship Diagram (ERD). Objectives Define terms related to entity relationship modeling, including entity, entity instance, attribute, relationship.

Lecture 12 Huffman Algorithm. In computer science and information theory, a Huffman code is a particular type of optimal prefix code that is commonly.

BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.

Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.

Digital Image Processing CCS331 Relationships of Pixel 1.

Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.

P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.

Item-Based P-Tree Collaborative Filtering applied to the Netflix Data

Everything is a number Everything in a computer memory and on storages is a number. Number  Number Characters  Number by ASCII code Sounds  Number.

Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science.

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Decision Tree Induction for High-Dimensional Data Using P-Trees

Efficient Ranking of Keyword Queries Using P-trees

Mean Shift Segmentation

Yue (Jenny) Cui and William Perrizo North Dakota State University

Translation of ER-diagram into Relational Schema

Spatial Data Models Raster uses individual cells in a matrix, or grid, format to represent real world entities Vector uses coordinates to store the shape.

Grasshopper caused significant economic loss each year.

PTrees (predicate Trees) fast, accurate , DM-ready horizontal processing of compressed, vertical data structures Project onto each attribute (4 files)

Fitting Curve Models to Edges

3. Vertical Data LECTURE 2 Section 3.

Clustering Methods Clustering a set is partitioning that set.

3. Vertical Data LECTURE 2 Section 3.

A Spatial Data and Sensor Network Application:

North Dakota State University Fargo, ND USA

The Multi-hop closure theorem for the Rolodex Model using pTrees

North Dakota State University Fargo, ND USA

The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy

d1 F rrn m1 m2 (m21 m22) F rrn m1 m2 (m21 m22) d2 d3 D1 rrn1 a11 a12

Presentation transcript:

Vertical Data 2

In this example database (which is used throughout these notes), there are two entities, Students (a student has a number, S#, a name, SNAME, and gender, GEN Courses (course has a number, C#, name, CNAME, State where the course is offered, ST, TERM and ONE relationship, Enrollments (a student, S#, enrolls in a class, C#, and gets a grade in that class, GR). The horizontal Education Database consists of 3 files, each of which consists of a number of instances of identically structured horizontal records: C#|CNAME|ST|TERM 0 |BI |ND| F 1 |DB |ND| S 2 |DM |NJ| S 3 |DS |ND| F 4 |SE |NJ| S 5 |AI |ND| F Courses S#|SNAME|GEN 0 |CLAY | M 1 |THAD | M 2 |QING | F 3 |AMAL | M 4 |BARB | F 5 |JOAN | F Student S#|C#|GR 0 |1 |B 0 |0 |A 3 |1 |A 3 |3 |B 1 |3 |B 1 |0 |D 2 |2 |D 2 |3 |A 4 |4 |B 5 |5 |B Enrollments A Education Database Example We have already talked about the process of structuring data in a horizontal database (e.g., develop an Entity-Relationship diagram or ER diagram, etc. - in this case: What is the process of structuring this data into a vertical database? This is an open question. Much research is needed on that issue! (great term paper topics exist here!!!) We will discuss this a little more on the next slide. CoursesStudent Enrollments S# SNAME GEN C# S# GR C# CNAME ST TERM

S:S#___|SNAME|GEN 0 000|CLAY |M |THAD |M |QING |F |BARB |F |AMAL |M |JOAN |F 1 1. Code some attributes in binary For numeric fields, we have used standard binary encoding (red indicates the highorder bit, green the middle bit and blue the loworder bit) to the right of each field value encoded).. For gender, F=1 and M=0. For term, Fall=0, Spring=1. For grade, A=11, B=10, C=01, D=00 (which could be called GPA encoding?). We have abreviated STUDENT to S, COURSE to C and ENROLLMENT to E. C:C#___|CNAME|ST|TERM 0 000|BI |ND|F |DB |ND|S |DM |NJ|S |DS |ND|F |SE |NJ|S |AI |ND|F 0 E:S#___|C#___|GR |1 001|B |0 000|A |1 001|A |3 011|D |3 011|D |0 000|B |2 010|B |3 011|A |4 100|B |5 101|B 10 One way to begin to vertically structure this data is: The above encoding seem natural. But how did we decide which attributes are to be encoded and which are not? As a term paper topic, that would be one of the main issues to research Note, we have decided not to encode names (our rough reasoning (not researched) is that there would be little advantage and it would be difficult (e.g. if name is a CHAR(25) datatype, then in binary that's 25*8 = 200 bits!). Note that we have decided not to encode State. That may be a mistake! Especially in this case, since it would be so easy (only 2 States ever? so 1 bit), but more generally there could be 50 and that would mean at least 6 bits. 2. Another binary encoding scheme (which can be used for numeric and non-numeric fields) is value map or bitmap encoding. The concept is simple. For each possible value, a, in the domain of the attribute, A, we encode 1=true and 0=false for the predicate A=a. The resulting single bit column becomes a map where a 1 means that row has A-value = a and a 0 means that row or tuple has A-value which is not a. There is a wealth of existing research on bit encoding. There is also quite a bit of research on vertical databases. There is even the first commercial vertical database announced called Vertica (check it out by Googling that name). Vertica was created by the same guy, Mike Stonebraker, who created one of the first Relational Databases, Ingres.

C:C#___|CNAME|ST|TERM 0 000|BI |ND|F |DB |ND|S |DM |NJ|S |DS |ND|F |SE |NJ|S |AI |ND|F 0 S:S#___|SNAME|GEN 0 000|CLAY |M |THAD |M |QING |F |BARB |F |AMAL |M |JOAN |F 1 Method-1 for vertically structuring the Educational Database The M1 VDBMS would then be stored as: E:S#___|C#___|GR |1 001|B |0 000|A |1 001|A |3 011|D |3 011|D |0 000|B |2 010|B |3 011|A |4 100|B |5 101|B CLAY THAD QING BARB AMAL JOAN BI DB DM DS SE AI ND NJ ND NJ ND Rather than have to label these (0-Dimensional uncompressed) Ptrees, we will remember the color code, namely: purple border = S#; brown border = C#; light blue border = GR; Yellow border = GEN; Black border = TERM. red bits means highorder bit (on the left); green bits means middle bit; blue bit means loworder bit (on the right). We will be able to distinguish whether a S# Ptrees is from STUDENT or ENROLLMENT by its length. Similarly, we will be able to distingusih the C# Ptrees of COURSE versus ENROLLMENT.

SELECT S.n, E.g FROM S, E WHERE S.s=E.s & E.g=D For the selection, ENROLL.gr = D ( or E.g=D) we we create a ptree mask: EM = E'.g 1 AND E'.g 2 (because we want both bits to be zero for D) CLAY THAD QING BARB AMAL JOAN S: S#___ | SNAME | GENE: S#___ | C#___ | GR EM

SELECT S.n, E.g FROM S, E WHERE S.s=E.s & E.g=D For the join, S.s = E.s, we sequence through the masked E tuples and for each, we create a mask for the matching S tuples, concatenate them and output the concatenation CLAY THAD QING BARB AMAL JOAN S: S#___ | SNAME | GENE: S#___ | C#___ | GR For S#= (0 1 1), mask S-tuples with P' S#,2 ^P S#,1 ^P S#, Concatenate and output (BARB, D)

SELECT S.n, E.g FROM S, E WHERE S.s=E.s & E.g=D For the join, S.s = E.s, we sequence through the masked E tuples and for each, we create a mask for the matching S tuples, concatenate them and output the concatenation CLAY THAD QING BARB AMAL JOAN S: S#___ | SNAME | GENE: S#___ | C#___ | GR For S#= (0 1 1), mask S-tuples with P' S#,2 ^P' S#,1 ^P S#, Concatenate and output (BARB, D) For S#= (0 0 1), mask S-tuples with P' S#,2 ^P' S#,1 ^P S#, 0 Concatenate and output (THAD, D)

SELECT S.n, E.g FROM S, E WHERE S.s=E.s Can the join, S.s = E.s, be speeded up? Since there is no selection involved this time, pontentially, we would have to visit every E-tuple, mask the matching S- tuples, concatenate and output. We can speed that up by masking all common E- tuples and retaining partial masks as we go CLAY THAD QING BARB AMAL JOAN S: S#___ | SNAME | GENE: S#___ | C#___ | GR For S#= (0 0 0), mask E-tuples with P' S#,2 ^P' S#,1 ^P' S#, For S#= (0 0 1), mask S-tuples with P' S#,2 ^P' S#,1 ^P' S#, 0 Concatenate and output (CLAY, B) and (CLAY, A) Continue down to the next E-tuple and do the same....

2-Dimensional P-trees: natural choice for, e.g., 2-D image files. For images, any ordering of pixels will work (raster, diagonalized, Peano, Hilbert, Jordan), but the space-filling “Peano” ordering has advantages for fast processing, yet compresses well in the presence of spatial continuity For an image bit-file (e.g., hi-order bit of the red color band of an image): Which, in spatial raster order is: Top-down construction of its 2-dimensional Peano ordered P-tree is built by recording the truth of universal predicate “pure 1” in a fanout=4 tree recursively on quarters (1/2 2 subsets), until purity achieved Pure-1? False=0 Pure! pure!

From here on we will take 4 bit positions at a time, for efficiency Bottom-up construction of the 2-Dimensional P-tree is done using Peano (in order) traversal of a fanout=4, log 4 (64)= 4 level tree, collapsing pure siblings, as we go: Start here

Node ID (NID) = Tree levels (going down): 3, 2, 1, 0, with purity-factors of respectively Fan-out = 2 dimension = 2 2 = =111 ( 7, 1 ) ( 111, 001 ) =001 Some aspects of 2-D P-trees: level-3 (pure=4 3 ) 1001 level-2 (pure=4 2 ) level-1 (pure=4 1 ) level-0 (pure=4 0 ) ROOT-COUNT = level-sum * level-purity-factor. Root Count = 7 * * * 4 2 = 55

3-Dimensional Ptrees: Top-down construction of its 3-dimensional Peano ordered P-tree: record the truth of universal predicate pure1 in a fanout=8 tree recursively on eighths (1/2 3 subsets), until purity achieved. Bottom-up construction of the 3-Dimensional P-tree is done using Peano (in order) traversal of a fanout=8, log 8 (64)= 2 level tree, collapsing pure siblings, as we go:

1 Situation space CEASR bio-agent detector (uses 3-D Ptrees) All other positions contain a 0-bit, i.e., the level of bio-agent detected by the nano-sensors in each of the other 63 cells is below a danger threshold. P 00 Start We can save time by noting that all the remaining 56 cells (in 7 other octants) contain all 0 s. Each of the next 7 octants will produce eight 0 s at the leaf level (8 pure-0 siblings), each of which will collapse to a 0 at level-1. So, proceeding an octant at a time (rather than a cell at a time): And that position corresponds to this 1-bit position in this cutaway view   at a position in the situation space. Suppose a biological agent is sensed by nano-sensors 0 ONE tiny, 3-D P-tree can represent this “bio-situation” completely. It is constructed (bottom up) as a fan-out=8, 3-D P-tree, as follows This entire situation can be transmitted to a personal display unit, as merely two bytes of data plus their two NIDs. For NID, use [level, global_level_offset] rather than [local_segment_offset,…local_segment_offset]. So assume every node not sent is all 0 s, that in any 13-bit node segment sent (only need send “mixed” segments), the 1 st 2 bits are the level, the next 3 bits are the global_level_offset within that level (i.e., 0..7), the final 8 bits are the node’s data, then the complete situation can be transmitted as these 13 bits: If 2 n 3 cells (n=2 above) situation it will take only log 2 (n) blue, 2 3n-3 green, 8 red bits So even if there are =2 24 ~16,000,000 cells, transmit merely =32 bits. We have now captured the data in the 1 st octant (forward-upper-left). Moving to the next octant (forward-upper-right):

Basic, Value and Tuple Ptrees Tuple Ptrees (predicate: quad is purely target tuple) e.g., P (1, 2, 3) = P (001, 010, 111) = P 1, 001 AND P 2, 010 AND P 3, 111 AND Value Ptrees (predicate: quad is purely target value in target attribute) e.g., P 1, 5 = P 1, 101 = P 11 AND P 12 ’ AND P 13 AND Target Attribute Target Value Basic Ptrees for a 7 column, 8 bit table e.g., P 11, P 12, …, P 18, P 21, …, P 28, …, P 71, …, P 78 Target Attribute Target Bit Position Rectangle Ptrees (predicate: quad is purely in target rectangle (product of intervals) e.g., P ([13],, [0.2]) = (P 1,1 OR P 1,2 OR P 1,3 ) AND (P 3,0 OR P 3,1 OR P 3,2 ) AND/OR

Horizontal Processing of Vertical Structures for Record-based Workloads  For record-based workloads (where the result is a set of records), changing the horizontal record structure and then having to reconstruct it, may introduce too much post processing? R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R R( A 1 A 2 A 3 A 4 ) R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 1  For data mining workloads, the result is often a bit (Yes/No, True/False) or another unstructured result, where there is no reconstructive post processing?

Thank you.