Vertical Data 2 In this example database (which is used throughout these notes), there are two entities, Students (a student has a number, S#, a name,

Vertical Data 2

In this example database (which is used throughout these notes), there are two entities, Students (a student has a number, S#, a name, SNAME, and gender, GEN Courses (course has a number, C#, name, CNAME, State where the course is offered, ST, TERM and ONE relationship, Enrollments (a student, S#, enrolls in a class, C#, and gets a grade in that class, GR). The horizontal Education Database consists of 3 files, each of which consists of a number of instances of identically structured horizontal records: C#|CNAME|ST|TERM 0 |BI |ND| F 1 |DB |ND| S 2 |DM |NJ| S 3 |DS |ND| F 4 |SE |NJ| S 5 |AI |ND| F Courses S#|SNAME|GEN 0 |CLAY | M 1 |THAD | M 2 |QING | F 3 |AMAL | M 4 |BARB | F 5 |JOAN | F Student S#|C#|GR 0 |1 |B 0 |0 |A 3 |1 |A 3 |3 |B 1 |3 |B 1 |0 |D 2 |2 |D 2 |3 |A 4 |4 |B 5 |5 |B Enrollments A Education Database Example We have already talked about the process of structuring data in a horizontal database (e.g., develop an Entity-Relationship diagram or ER diagram, etc. - in this case: What is the process of structuring this data into a vertical database? This is an open question. Much research is needed on that issue! (great term paper topics exist here!!!) We will discuss this a little more on the next slide. CoursesStudent Enrollments S# SNAME GEN C# S# GR C# CNAME ST TERM

S:S#___|SNAME|GEN 0 000|CLAY |M 0 1 001|THAD |M 0 2 010|QING |F 1 3 011|BARB |F 1 4 100|AMAL |M 0 5 101|JOAN |F 1 1. Code some attributes in binary For numeric fields, we have used standard binary encoding (red indicates the highorder bit, green the middle bit and blue the loworder bit) to the right of each field value encoded).. For gender, F=1 and M=0. For term, Fall=0, Spring=1. For grade, A=11, B=10, C=01, D=00 (which could be called GPA encoding?). We have abreviated STUDENT to S, COURSE to C and ENROLLMENT to E. C:C#___|CNAME|ST|TERM 0 000|BI |ND|F 0 1 001|DB |ND|S 1 2 010|DM |NJ|S 1 3 011|DS |ND|F 0 4 100|SE |NJ|S 1 5 101|AI |ND|F 0 E:S#___|C#___|GR. 0 000|1 001|B 10 0 000|0 000|A 11 3 011|1 001|A 11 3 011|3 011|D 00 1 001|3 011|D 00 1 001|0 000|B 10 2 010|2 010|B 10 2 010|3 011|A 11 4 100|4 100|B 10 5 101|5 101|B 10 One way to begin to vertically structure this data is: The above encoding seem natural. But how did we decide which attributes are to be encoded and which are not? As a term paper topic, that would be one of the main issues to research Note, we have decided not to encode names (our rough reasoning (not researched) is that there would be little advantage and it would be difficult (e.g. if name is a CHAR(25) datatype, then in binary that's 25*8 = 200 bits!). Note that we have decided not to encode State. That may be a mistake! Especially in this case, since it would be so easy (only 2 States ever? so 1 bit), but more generally there could be 50 and that would mean at least 6 bits. 2. Another binary encoding scheme (which can be used for numeric and non-numeric fields) is value map or bitmap encoding. The concept is simple. For each possible value, a, in the domain of the attribute, A, we encode 1=true and 0=false for the predicate A=a. The resulting single bit column becomes a map where a 1 means that row has A-value = a and a 0 means that row or tuple has A-value which is not a. There is a wealth of existing research on bit encoding. There is also quite a bit of research on vertical databases. There is even the first commercial vertical database announced called Vertica (check it out by Googling that name). Vertica was created by the same guy, Mike Stonebraker, who created one of the first Relational Databases, Ingres.

C:C#___|CNAME|ST|TERM 0 000|BI |ND|F 0 1 001|DB |ND|S 1 2 010|DM |NJ|S 1 3 011|DS |ND|F 0 4 100|SE |NJ|S 1 5 101|AI |ND|F 0 S:S#___|SNAME|GEN 0 000|CLAY |M 0 1 001|THAD |M 0 2 010|QING |F 1 3 011|BARB |F 1 4 100|AMAL |M 0 5 101|JOAN |F 1 Method-1 for vertically structuring the Educational Database The M1 VDBMS would then be stored as: E:S#___|C#___|GR. 0 000|1 001|B 10 0 000|0 000|A 11 3 011|1 001|A 11 3 011|3 011|D 00 1 001|3 011|D 00 1 001|0 000|B 10 2 010|2 010|B 10 2 010|3 011|A 11 4 100|4 100|B 10 5 101|5 101|B 10 010101010101 CLAY THAD QING BARB AMAL JOAN 001101001101 BI DB DM DS SE AI 001100001100 000011000011 011010011010 000011000011 010101010101 001100001100 ND NJ ND NJ ND 00000000110000000011 00110011000011001100 00111100010011110001 00000000110000000011 00011011000001101100 10111001011011100101 01100001000110000100 11100111111110011111 Rather than have to label these (0-Dimensional uncompressed) Ptrees, we will remember the color code, namely: purple border = S#; brown border = C#; light blue border = GR; Yellow border = GEN; Black border = TERM. red bits means highorder bit (on the left); green bits means middle bit; blue bit means loworder bit (on the right). We will be able to distinguish whether a S# Ptrees is from STUDENT or ENROLLMENT by its length. Similarly, we will be able to distingusih the C# Ptrees of COURSE versus ENROLLMENT.

SELECT S.n, E.g FROM S, E WHERE S.s=E.s & E.g=D For the selection, ENROLL.gr = D ( or E.g=D) we we create a ptree mask: EM = E'.g 1 AND E'.g 2 (because we want both bits to be zero for D). 010101010101 CLAY THAD QING BARB AMAL JOAN 001101001101 001100001100 000011000011 00000000110000000011 00110011000011001100 00111100010011110001 00000000110000000011 00011011000001101100 10111001011011100101 01100001000110000100 11100111111110011111 S: S#___ | SNAME | GENE: S#___ | C#___ | GR 10011110111001111011 00011000000001100000 00011000000001100000 EM

SELECT S.n, E.g FROM S, E WHERE S.s=E.s & E.g=D For the join, S.s = E.s, we sequence through the masked E tuples and for each, we create a mask for the matching S tuples, concatenate them and output the concatenation. 010101010101 CLAY THAD QING BARB AMAL JOAN 001101001101 001100001100 000011000011 00000000110000000011 00110011000011001100 00111100010011110001 00000000110000000011 00011011000001101100 10111001011011100101 01100001000110000100 11100111111110011111 S: S#___ | SNAME | GENE: S#___ | C#___ | GR 00011000000001100000 For S#= (0 1 1), mask S-tuples with P' S#,2 ^P S#,1 ^P S#, 0 010101010101 001100001100 111100111100 000100000100 Concatenate and output (BARB, D)

SELECT S.n, E.g FROM S, E WHERE S.s=E.s & E.g=D For the join, S.s = E.s, we sequence through the masked E tuples and for each, we create a mask for the matching S tuples, concatenate them and output the concatenation. 010101010101 CLAY THAD QING BARB AMAL JOAN 001101001101 001100001100 000011000011 00000000110000000011 00110011000011001100 00111100010011110001 00000000110000000011 00011011000001101100 10111001011011100101 01100001000110000100 11100111111110011111 S: S#___ | SNAME | GENE: S#___ | C#___ | GR 00011000000001100000 For S#= (0 1 1), mask S-tuples with P' S#,2 ^P' S#,1 ^P S#, 0 010101010101 110011110011 111100111100 010000010000 Concatenate and output (BARB, D) For S#= (0 0 1), mask S-tuples with P' S#,2 ^P' S#,1 ^P S#, 0 Concatenate and output (THAD, D)

SELECT S.n, E.g FROM S, E WHERE S.s=E.s Can the join, S.s = E.s, be speeded up? Since there is no selection involved this time, pontentially, we would have to visit every E-tuple, mask the matching S- tuples, concatenate and output. We can speed that up by masking all common E- tuples and retaining partial masks as we go. 010101010101 CLAY THAD QING BARB AMAL JOAN 001101001101 001100001100 000011000011 00000000110000000011 00110011000011001100 00111100010011110001 00000000110000000011 00011011000001101100 10111001011011100101 01100001000110000100 11100111111110011111 S: S#___ | SNAME | GENE: S#___ | C#___ | GR For S#= (0 0 0), mask E-tuples with P' S#,2 ^P' S#,1 ^P' S#, 0 101010101010 110011110011 111100111100 100000100000 For S#= (0 0 1), mask S-tuples with P' S#,2 ^P' S#,1 ^P' S#, 0 Concatenate and output (CLAY, B) and (CLAY, A) 11111111001111111100 11001100111100110011 11000011101100001110 11000000001100000000 Continue down to the next E-tuple and do the same....

2-Dimensional P-trees: natural choice for, e.g., 2-D image files. For images, any ordering of pixels will work (raster, diagonalized, Peano, Hilbert, Jordan), but the space-filling “Peano” ordering has advantages for fast processing, yet compresses well in the presence of spatial continuity. 0 1000 00101101 11100010110 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 For an image bit-file (e.g., hi-order bit of the red color band of an image): 1111110011111000111111001111111011110000111100001111000001110000 Which, in spatial raster order is: Top-down construction of its 2-dimensional Peano ordered P-tree is built by recording the truth of universal predicate “pure 1” in a fanout=4 tree recursively on quarters (1/2 2 subsets), until purity achieved Pure-1? False=0 Pure! pure!

1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 11 1 1 1 1 111 1 11 11 1 1 1 1 1 11 0 00000 0 From here on we will take 4 bit positions at a time, for efficiency. 1 1 10 0 0 10 0 1 1 1 11 1 0 101 1 1 0 0 0 0 0 0 0 0 0 0000 0 Bottom-up construction of the 2-Dimensional P-tree is done using Peano (in order) traversal of a fanout=4, log 4 (64)= 4 level tree, collapsing pure siblings, as we go: Start here

Node ID (NID) = 2.2.3 Tree levels (going down): 3, 2, 1, 0, with purity-factors of 4 3 4 2 4 1 4 0 respectively Fan-out = 2 dimension = 2 2 = 4 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 7=111 ( 7, 1 ) ( 111, 001 ) 10.10.11 1=001 Some aspects of 2-D P-trees: 0 1001 00100001 11100010110 1 0 level-3 (pure=4 3 ) 1001 level-2 (pure=4 2 ) 0010110 1 level-1 (pure=4 1 ) 11100010110 1 level-0 (pure=4 0 ) 01232 3 2. 2. 3 ROOT-COUNT = level-sum * level-purity-factor. Root Count = 7 * 4 0 + 4 * 4 1 + 2 * 4 2 = 55

3-Dimensional Ptrees: Top-down construction of its 3-dimensional Peano ordered P-tree: record the truth of universal predicate pure1 in a fanout=8 tree recursively on eighths (1/2 3 subsets), until purity achieved. Bottom-up construction of the 3-Dimensional P-tree is done using Peano (in order) traversal of a fanout=8, log 8 (64)= 2 level tree, collapsing pure siblings, as we go:

1 Situation space CEASR bio-agent detector (uses 3-D Ptrees) All other positions contain a 0-bit, i.e., the level of bio-agent detected by the nano-sensors in each of the other 63 cells is below a danger threshold. P 00 Start 0 00000 1 0 We can save time by noting that all the remaining 56 cells (in 7 other octants) contain all 0 s. Each of the next 7 octants will produce eight 0 s at the leaf level (8 pure-0 siblings), each of which will collapse to a 0 at level-1. So, proceeding an octant at a time (rather than a cell at a time): 0000000 0 And that position corresponds to this 1-bit position in this cutaway view   at a position in the situation space. Suppose a biological agent is sensed by nano-sensors 0 ONE tiny, 3-D P-tree can represent this “bio-situation” completely. It is constructed (bottom up) as a fan-out=8, 3-D P-tree, as follows. 0000000 0 0 0000000 0 0 0000000 0 0 0000000 0 0 0000000 0 0 0000000 0 0 This entire situation can be transmitted to a personal display unit, as merely two bytes of data plus their two NIDs. For NID, use [level, global_level_offset] rather than [local_segment_offset,…local_segment_offset]. So assume every node not sent is all 0 s, that in any 13-bit node segment sent (only need send “mixed” segments), the 1 st 2 bits are the level, the next 3 bits are the global_level_offset within that level (i.e., 0..7), the final 8 bits are the node’s data, then the complete situation can be transmitted as these 13 bits: 01 000 0000 0001 If 2 n 3 cells (n=2 above) situation it will take only log 2 (n) blue, 2 3n-3 green, 8 red bits So even if there are 2 8 3 =2 24 ~16,000,000 cells, transmit merely 3+21+8=32 bits. We have now captured the data in the 1 st octant (forward-upper-left). Moving to the next octant (forward-upper-right):

Basic, Value and Tuple Ptrees Tuple Ptrees (predicate: quad is purely target tuple) e.g., P (1, 2, 3) = P (001, 010, 111) = P 1, 001 AND P 2, 010 AND P 3, 111 AND Value Ptrees (predicate: quad is purely target value in target attribute) e.g., P 1, 5 = P 1, 101 = P 11 AND P 12 ’ AND P 13 AND Target Attribute Target Value Basic Ptrees for a 7 column, 8 bit table e.g., P 11, P 12, …, P 18, P 21, …, P 28, …, P 71, …, P 78 Target Attribute Target Bit Position Rectangle Ptrees (predicate: quad is purely in target rectangle (product of intervals) e.g., P ([13],, [0.2]) = (P 1,1 OR P 1,2 OR P 1,3 ) AND (P 3,0 OR P 3,1 OR P 3,2 ) AND/OR

Horizontal Processing of Vertical Structures for Record-based Workloads  For record-based workloads (where the result is a set of records), changing the horizontal record structure and then having to reconstruct it, may introduce too much post processing? 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A 1 A 2 A 3 A 4 ) 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 1  For data mining workloads, the result is often a bit (Yes/No, True/False) or another unstructured result, where there is no reconstructive post processing?

Thank you.

Vertical Data 2 In this example database (which is used throughout these notes), there are two entities, Students (a student has a number, S#, a name,

Similar presentations

Presentation on theme: "Vertical Data 2 In this example database (which is used throughout these notes), there are two entities, Students (a student has a number, S#, a name,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Vertical Data 2 In this example database (which is used throughout these notes), there are two entities, Students (a student has a number, S#, a name,

Similar presentations

Presentation on theme: "Vertical Data 2 In this example database (which is used throughout these notes), there are two entities, Students (a student has a number, S#, a name,"— Presentation transcript:

Similar presentations

About project

Feedback