Download presentation
Presentation is loading. Please wait.
Published byAllie Cookson Modified over 10 years ago
1
1 The Stream Star Schema Stephen A. Broeker 1010
2
2 Conclusion The Stream Star Schema processes data streams in real- time. Up to gigabits per second. Stream Star performance is O(1). 2020
3
3 phone calls road traffic network traffic website traffic power supplies credit card transactions sensor arrays financial markets are data rich. But real-time analysis po Large Fast Dynamic Data Streams 3030
4
4 phone calls road traffic network traffic website traffic power supplies credit card transactions sensor arrays financial markets Data rich. But poor in real-time analysis. Large Fast Dynamic Data Streams 4040 phone calls road traffic network traffic website traffic power supplies credit card transactions sensor arrays financial markets
5
5 What are the consequences? Large Fast Dynamic Data Streams 5050
6
6 hard tosee patternshard tosee patterns Therefore difficult to detect problems. Large Fast Dynamic Data Streams 6060
7
7 Network monitoring at high speed is difficult: Packets arrive every nanosecond on a 1Gbps NIC Must use SRAM for per-packet processing Traditional solution of sampling is inherently not accurate due to the loss of data. Challenge of Network Monitoring 7070
8
8 Achieve real-time OLAP for massive data streams. Achieve cybernetic control for systems that depend on rapid data analysis. Vision 8080
9
9 Detection 9090
10
10 Forensics 10
11
11 Data RATES are measured in bits per second. So, Gigabits (Gb) Gigabytes (GB). Data Rates versus Data Storage Lowercase b 11
12
12 Data RATES are measured in bits per second. Data STORAGE is measured in Bytes. So, Gigabits (Gb) Gigabytes (GB). Data Rates versus Data Storage Lowercase bUppercase B 12
13
13 Ethernet Network Interface Card transferring data at 1 Gbps. Data accumulates at 450MB per hour. Thats 10.5 TB per day, 73.8 TB per week! Data Storage based on Data Rate 13
14
14 What if BYTES were pennies? Picturing Orders of Magnitude X Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA 10 6 = 2 20 10 9 = 2 30 10 12 = 2 40 10 15 = 2 50 14
15
15 What if BYTES were pennies? Picturing Orders of Magnitude X Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA 10 6 = 2 20 10 9 = 2 30 10 12 = 2 40 10 15 = 2 50 15
16
16 What if BYTES were pennies? Picturing Orders of Magnitude X Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA 10 6 = 2 20 10 9 = 2 30 10 12 = 2 40 10 15 = 2 50 16
17
17 What if BYTES were pennies? Picturing Orders of Magnitude X At 1Gbps, 2.2 PB accumulate per month. Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA 10 6 = 2 20 10 9 = 2 30 10 12 = 2 40 10 15 = 2 50 17
18
18 What if BYTES were pennies? Picturing Orders of Magnitude X Used with permission: © Copyright 2001 Alan Taylor – The Mega Penney Project - KOKOGIAK MEDIA 10 18 = 2 60 17
19
19 The network stream is segmented into flows, which are inserted into a database. Observed database input rate for 1 Gb Ethernet NIC: 700,000 flows per hour. Existing databases cant keep up! From Streaming Data to Database 18
20
20 Disk Star Schema STREAM Star Schema Consider 2 Database Schemas 19
21
21 So wheres the star? Disk Star Schema From Fact Table to Dimension Tables Content Table Sender Table Subject Table Recipient Table Destination IP Table Content Destination IP Sender Recipient Subject Thats all there is to the star concept. Heres the star. 20
22
22 Value of the Disk Star Schema Conserve Disk Space 21
23
23 Dimensions Each Dimension gets a key. 22
24
24 Resulting in a Dimension Table 1NF: No Repeating Groups 23
25
25 Thus deriving a Fact Table. Substitute Keys for Facts 24
26
26 Disk Star Schema = Slow data insertion time. Relational databases are normalized to conserve space. Speed is sacrificed. So real-time analysis is compromised. 25 Slow Bottleneck
27
27 Disk Star Schema 26
28
28 Disk Star Schema 27
29
29 Disk Star Schema 28
30
30 Disk Star Schema 29
31
31 Dimension table insertion time depends on the table size which is O (log n ) where n is the number of records in a table. Disk Star Schema insertion time, is the sum of all dimension table insert times O ( Ʃ 1 i l (log n i )) where l is the number of attributes in the database and n i is the number of values for attribute i. Cant fill dimension tables fast enough! Bottleneck 30
32
32 1,000,000,000 bit Ethernet NIC (1Gb) 700,000 Observed Flows per hour 460 MBs per hour, 10.5 TBs a day All we can get is a snapshot-analysis! Short Pause to Review Numbers 31
33
33 Disk Star Schema STREAM Star Schema Consider 2 Database Schemas 32
34
34 Stream Star Schema 33 Stream Star Schema
35
35 34 Stream Star Schema
36
36 Stream Star Schema 35 Stream Star Schema
37
37 Disk Star Schema Nearly 1:1 Correspondence between string attributes and Dimension tables. 36
38
38 Disk Star Schema Two kinds of tables - fact, dimension. All string dimensions have dimension tables. Minimize disk space. Dimension tables can be large. Long insert time = O ( Ʃ 1 i l (log n i )) No string duplication. 37
39
39 Many:1 38 Stream Star Schema
40
40 Three kinds of tables - fact, dimension, string. Few dimension tables. Dimension tables are small. Minimizes insertion time. I n s e r t t i m e i s c o n s t a n t. Allow string duplication. Allow string duplication. 39 Stream Star Schema
41
41 Side x Side Comparison SlowFast OldNew 40
42
42 Test Results 41
43
43 Test Results The magnified area is different because I measured the insert time for (1, 10, 100) as opposed to (1000, 2000, 3000) streams. 42
44
44 Test Results The magnified area is different because of how MySQL works. I can only present a hypothesis since I dont have the MySQL source code. But I suspect that MySQL is optimized for less than 100 streams for this problem. 43
45
45 Conclusion 44
46
46 Conclusion The Stream Star Schema processes data streams in real- time. Up to gigabits per second. Stream Star performance is O(1). 45
47
47 Hope Detection Forensics RFID 46
48
48 Theres data flow 47
49
49 And then theres DATA FLOW! 48
50
50 Disk Star Schema handles 3 million flows per hour, about this much. 49
51
51 The Stream Star Schema handles 113 million flows per hour! Disk Star Schema handles 3 million flows per hour, about this much. 50
52
52 Nearly 40x Faster! 51
53
53 For The Future Implement the Stream Star Schema in the Cloud. Use multiple Stream Star Schema computer nodes to handle an infinite stream. Storage could be handled similarly to S3. 52
54
54 For The Future The Stream Star Schema fully supports the analysis of high-speed data streams thus enabling security applications and forensic processing. 53
55
55 END
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.