Presentation is loading. Please wait.

Presentation is loading. Please wait.

3. Vertical Data LECTURE 2 Section 3.

Similar presentations


Presentation on theme: "3. Vertical Data LECTURE 2 Section 3."— Presentation transcript:

1 3. Vertical Data LECTURE 2 Section 3

2 A Education Database Example
In this example database (which is used throughout these notes), there are two entities, Students (a student has a number, S#, a name, SNAME, and gender, GEN Courses (course has a number, C#, name, CNAME, State where the course is offered, ST, TERM and ONE relationship, Enrollments (a student, S#, enrolls in a class, C#, and gets a grade in that class, GR). The horizontal Education Database consists of 3 files, each of which consists of a number of instances of identically structured horizontal records: Enrollments Student Courses S#|C#|GR 0 |1 |B 0 |0 |A 3 |1 |A 3 |3 |B 1 |3 |B 1 |0 |D 2 |2 |D 2 |3 |A 4 |4 |B 5 |5 |B S#|SNAME|GEN 0 |CLAY | M 1 |THAD | M 2 |QING | F 3 |AMAL | M 4 |BARB | F 5 |JOAN | F C#|CNAME|ST|TERM 0 |BI |ND| F 1 |DB |ND| S 2 |DM |NJ| S 3 |DS |ND| F 4 |SE |NJ| S 5 |AI |ND| F We have already talked about the process of structuring data in a horizontal database (e.g., develop an Entity-Relationship diagram or ER diagram, etc. - in this case: Courses Student Enrollments S# SNAME GEN C# GR CNAME ST TERM What is the process of structuring this data into a vertical database? This is an open question. Much research is needed on that issue! (great term paper topics exist here!!!) We will discuss this a little more on the next slide. Section 3 # 12 11

3 One way to begin to vertically structure this data is:
1. Code some attributes in binary For numeric fields, we have used standard binary encoding (red indicates the highorder bit, green the middle bit and blue the loworder bit) to the right of each field value encoded). . For gender, F=1 and M=0. For term, Fall=0, Spring=1. For grade, A=11, B=10, C=01, D=00 (which could be called GPA encoding?) We have abreviated STUDENT to S, COURSE to C and ENROLLMENT to E. S:S#___|SNAME|GEN 0 000|CLAY |M 0 1 001|THAD |M 0 2 010|QING |F 1 3 011|BARB |F 1 4 100|AMAL |M 0 5 101|JOAN |F 1 C:C#___|CNAME|ST|TERM 0 000|BI |ND|F 0 1 001|DB |ND|S 1 2 010|DM |NJ|S 1 3 011|DS |ND|F 0 4 100|SE |NJ|S 1 5 101|AI |ND|F 0 E:S#___|C#___|GR . 0 000|1 001|B 10 0 000|0 000|A 11 3 011|1 001|A 11 3 011|3 011|D 00 1 001|3 011|D 00 1 001|0 000|B 10 2 010|2 010|B 10 2 010|3 011|A 11 4 100|4 100|B 10 5 101|5 101|B 10 The above encoding seem natural. But how did we decide which attributes are to be encoded and which are not? As a term paper topic, that would be one of the main issues to research Note, we have decided not to encode names (our rough reasoning (not researched) is that there would be little advantage and it would be difficult (e.g. if name is a CHAR(25) datatype, then in binary that's 25*8 = 200 bits!). Note that we have decided not to encode State. That may be a mistake! Especially in this case, since it would be so easy (only 2 States ever? so 1 bit), but more generally there could be 50 and that would mean at least 6 bits. 2. Another binary encoding scheme (which can be used for numeric and non-numeric fields) is value map or bitmap encoding. The concept is simple. For each possible value, a, in the domain of the attribute, A, we encode 1=true and 0=false for the predicate A=a. The resulting single bit column becomes a map where a 1 means that row has A-value = a and a 0 means that row or tuple has A-value which is not a. There is a wealth of existing research on bit encoding. There is also quite a bit of research on vertical databases. There is even the first commercial vertical database announced called Vertica (check it out by Googling that name). Vertica was created by the same guy, Mike Stonebraker, who created one of the first Relational Databases, Ingres. Section 3 # 13

4 Method-1 for vertically structuring the Educational Database
S:S#___|SNAME|GEN 0 000|CLAY |M 0 1 001|THAD |M 0 2 010|QING |F 1 3 011|BARB |F 1 4 100|AMAL |M 0 5 101|JOAN |F 1 C:C#___|CNAME|ST|TERM 0 000|BI |ND|F 0 1 001|DB |ND|S 1 2 010|DM |NJ|S 1 3 011|DS |ND|F 0 4 100|SE |NJ|S 1 5 101|AI |ND|F 0 E:S#___|C#___|GR . 0 000|1 001|B 10 0 000|0 000|A 11 3 011|1 001|A 11 3 011|3 011|D 00 1 001|3 011|D 00 1 001|0 000|B 10 2 010|2 010|B 10 2 010|3 011|A 11 4 100|4 100|B 10 5 101|5 101|B 10 1 1 1 CLAY THAD QING BARB AMAL JOAN 1 1 1 1 BI DB DM DS SE AI ND NJ 1 1 1 1 1 1 1 1 1 The M1 VDBMS would then be stored as: Rather than have to label these (0-Dimensional uncompressed) Ptrees, we will remember the color code, namely: purple border = S#; brown border = C#; light blue border = GR; Yellow border = GEN; Black border = TERM. red bits means highorder bit (on the left); green bits means middle bit; blue bit means loworder bit (on the right). We will be able to distinguish whether a S# Ptrees is from STUDENT or ENROLLMENT by its length. Similarly, we will be able to distingusih the C# Ptrees of COURSE versus ENROLLMENT. Section 3 # 14

5 EM SELECT S.n, E.g FROM S, E WHERE S.s=E.s & E.g=D
For the selection, ENROLL.gr = D ( or E.g=D) we we create a ptree mask: EM = E'.g1 AND E'.g2 (because we want both bits to be zero for D). S: S#___ | SNAME | GEN E: S#___ | C#___ | GR 1 1 1 CLAY THAD QING BARB AMAL JOAN 1 1 1 1 1 1 1 1 1 1 1 EM 1 Section 3 # 15

6 For S#= (0 1 1), mask S-tuples with P'S#,2^PS#,1^PS#, 0
SELECT S.n, E.g FROM S, E WHERE S.s=E.s & E.g=D For the join, S.s = E.s, we sequence through the masked E tuples and for each, we create a mask for the matching S tuples, concatenate them and output the concatenation. S: S#___ | SNAME | GEN E: S#___ | C#___ | GR 1 1 1 1 1 1 CLAY THAD QING BARB AMAL JOAN 1 1 1 1 1 1 1 1 1 1 For S#= ( ), mask S-tuples with P'S#,2^PS#,1^PS#, 0 Concatenate and output (BARB, D) 1 Section 3 # 16

7 For S#= (0 1 1), mask S-tuples with P'S#,2^P'S#,1^PS#, 0
SELECT S.n, E.g FROM S, E WHERE S.s=E.s & E.g=D For the join, S.s = E.s, we sequence through the masked E tuples and for each, we create a mask for the matching S tuples, concatenate them and output the concatenation. S: S#___ | SNAME | GEN E: S#___ | C#___ | GR 1 1 1 1 1 1 CLAY THAD QING BARB AMAL JOAN 1 1 1 1 1 1 1 1 1 1 For S#= ( ), mask S-tuples with P'S#,2^P'S#,1^PS#, 0 Concatenate and output (BARB, D) 1 For S#= ( ), mask S-tuples with P'S#,2^P'S#,1^PS#, 0 Concatenate and output (THAD, D) Section 3 # 17

8 For S#= (0 0 0), mask E-tuples with P'S#,2^P'S#,1^P'S#, 0
SELECT S.n, E.g FROM S, E WHERE S.s=E.s Can the join, S.s = E.s, be speeded up? Since there is no selection involved this time, pontentially, we would have to visit every E-tuple, mask the matching S-tuples, concatenate and output. We can speed that up by masking all common E-tuples and retaining partial masks as we go. S: S#___ | SNAME | GEN E: S#___ | C#___ | GR 1 1 1 1 1 1 CLAY THAD QING BARB AMAL JOAN 1 1 1 1 1 1 1 1 1 1 1 1 For S#= ( ), mask E-tuples with P'S#,2^P'S#,1^P'S#, 0 1 1 For S#= ( ), mask S-tuples with P'S#,2^P'S#,1^P'S#, 0 Concatenate and output (CLAY, B) and (CLAY, A) Continue down to the next E-tuple and do the same.... Section 3 # 18

9 2-Dimensional P-trees:natural choice for, e. g. , 2-D image files
2-Dimensional P-trees:natural choice for, e.g., 2-D image files. For images, any ordering of pixels will work (raster, diagonalized, Peano, Hilbert, Jordan), but the space-filling “Peano” ordering has advantages for fast processing, yet compresses well in the presence of spatial continuity. For an image bit-file (e.g., hi-order bit of the red color band of an image): Which, in spatial raster order is: Top-down construction of its 2-dimensional Peano ordered P-tree is built by recording the truth of universal predicate “pure 1” in a fanout=4 tree recursively on quarters (1/22 subsets), until purity achieved Pure-1? False=0 Pure! Pure! 1 pure! pure! pure! pure! pure! 1 1 1 1 1 1 1 1 1 1 1 Section 3 # 19

10 From here on we will take 4 bit positions at a time, for efficiency.
Start here Bottom-up construction of the 2-Dimensional P-tree is done using Peano (in order) traversal of a fanout=4, log4(64)= 4 level tree, collapsing pure siblings, as we go: From here on we will take 4 bit positions at a time, for efficiency. 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Section 3 # 20

11 Some aspects of 2-D P-trees:
ROOT-COUNT = level-sum * level-purity-factor. Root Count = 7 * 40 + 4 * 41 + 2 * 42 = 55 1=001 level-3 (pure=43) 1 1 level-2 (pure=42) 1 level-1 (pure=41) 1 level-0 (pure=40) 1 1 2 3 2 3 7=111 Tree levels (going down): 3, 2, 1, 0, with purity-factors of respectively Fan-out = 2dimension = 22 = 4 Node ID (NID) = ( 7, 1 ) ( 111, 001 ) Section 3 # 21

12 3-Dimensional Ptrees: Top-down construction of its 3-dimensional Peano ordered P-tree: record the truth of universal predicate pure1 in a fanout=8 tree recursively on eighths (1/23 subsets), until purity achieved. Bottom-up construction of the 3-Dimensional P-tree is done using Peano (in order) traversal of a fanout=8, log8(64)= 2 level tree, collapsing pure siblings, as we go: Section 3 # 22

13 Situation space 1 P CEASR bio-agent detector (uses 3-D Ptrees)
 at a position in the situation space. Suppose a biological agent is sensed by nano-sensors Situation space And that position corresponds to this 1-bit position in this cutaway view We have now captured the data in the 1st octant (forward-upper-left). Moving to the next octant (forward-upper-right): All other positions contain a 0-bit, i.e., the level of bio-agent detected by the nano-sensors in each of the other 63 cells is below a danger threshold. Start 1 ONE tiny, 3-D P-tree can represent this “bio-situation” completely. It is constructed (bottom up) as a fan-out=8, 3-D P-tree, as follows. P We can save time by noting that all the remaining 56 cells (in 7 other octants) contain all 0s. Each of the next 7 octants will produce eight 0s at the leaf level (8 pure-0 siblings), each of which will collapse to a 0 at level-1. So, proceeding an octant at a time (rather than a cell at a time): 1 This entire situation can be transmitted to a personal display unit, as merely two bytes of data plus their two NIDs. For NID, use [level, global_level_offset] rather than [local_segment_offset,…local_segment_offset]. So assume every node not sent is all 0s, that in any 13-bit node segment sent (only need send “mixed” segments), the 1st 2 bits are the level, the next 3 bits are the global_level_offset within that level (i.e., 0..7), the final 8 bits are the node’s data, then the complete situation can be transmitted as these 13 bits: If 2n3 cells (n=2 above) situation it will take only log2(n) blue, 23n-3 green, 8 red bits So even if there are 283=224 ~16,000,000 cells, transmit merely =32 bits. Section 3 # 23

14 Basic, Value and Tuple Ptrees
Basic Ptrees for a 7 column, 8 bit table e.g., P11, P12, … , P18, P21, …, P28, …, P71, …, P78 Target Attribute Target Bit Position Value Ptrees (predicate: quad is purely target value in target attribute) e.g., P1, 5 = P1, = P11 AND P12’ AND P13 AND Target Attribute Target Value Tuple Ptrees (predicate: quad is purely target tuple) e.g., P(1, 2, 3) = P(001, 010, 111) = P1, AND P2, AND P3, 111 AND Rectangle Ptrees (predicate: quad is purely in target rectangle (product of intervals) e.g., P([13],, [0.2]) = (P1,1 OR P1,2 OR P1,3) AND (P3,0 OR P3,1 OR P3,2) AND/OR Section 3 # 24

15 Horizontal Processing of Vertical Structures for Record-based Workloads
For record-based workloads (where the result is a set of records), changing the horizontal record structure and then having to reconstruct it, may introduce too much post processing? R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 R( A1 A2 A3 A4) For data mining workloads, the result is often a bit (Yes/No, True/False) or another unstructured result, where there is no reconstructive post processing? R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 1 Section 3 # 25

16 Please proceed on to 03_Vertical_data_LECTURE3
Section 3


Download ppt "3. Vertical Data LECTURE 2 Section 3."

Similar presentations


Ads by Google