Download presentation
Presentation is loading. Please wait.
Published byMadeline Blankenship Modified over 9 years ago
2
CSCI 765 Big Data and Infinite Storage
3
One new idea introduced in this course is the emerging idea of structuring data into vertical structures and processing across those vertical structures. This is in contrast to the traditional method of structuring data into horizontal structures and processing down those horizontal structures (horizontal structures are often called records, e.g., an employee file containing horizontal employee records which are made up of fields such as Name, Address, Salary, Phone, etc.) Thus, horizontal processing of vertical data (HPVD) will be introduced as an alternative to the traditional vertical processing of horizontal data (VPHD). Why do we need to structure and process data differently than we have in the past? What has changed? Data (digital data) has gotten really BIG!! How big is BIG DATA these days and how big will it get?
4
An Example: The US Library of Congress is storing EVERY tweet sent since Twitter launched in 2006. Each tweet record contains fifty fields. Let's assume each of those horizontal tweet records is about 1000 bits wide. Let's estimate approximately 1 trillion tweets from 1 billion tweeters, to 1 billion tweetees over 10 years of tweeting? As a full data file that's 10 30 data items (10 12 *10 9 * 10 9 )
5
That's BIG! Is it going to get even bigger? Yes. Let’s look at how the definition of “big data” has evolved just over my work lifetime. My first job in this industry was as THE technician at the St. John’s University IBM 1620 Computer Center. I did the following: 1. I turned the 1620 switch on. 2. I waited for the ready light bulb to come on (~15 minutes) 3. I put the Op/Sys punch card stack on the card reader (~4 inches high) 4. I put the FORTRAN compiler card stack on the reader (~3 inches) 5. I put the FORTRAN program card stack on the reader (~2 inches) 6. The 1620 produced an object code stack which I read in (~1 inch) 7. I read in the object stack and a 1964 BIG DATA stack (~40 inches) The 1st FORTRAN upgrade allowed for a “continue” card so that the data stack could be read in segments (and I could sit down).
6
How high would a 2013 BIG DATA STACK reach today if it were put on punch cards? Let's be conservative and assume an exabyte (2 18 bytes) of data on cards How high is an exabyte punch card stack? Take a guess.................? Keep in mind that we're being conservative because the US LoC tweet database may be ~10 30 bytes or more soon (if it's fully losslessly stored).
7
That exabyte stack of punch cards would reach to JUPITER! So, in my work lifetime, BIG DATA has gone from 40 inches high all the way to Jupiter! What will happen to BIG DATA over your work lifetime?
8
I must deal with a data file that would reach Jupiter as a punch card stack, but I can replace it losslessly by 1000 extendable vertical pTrees and write programs to process across those 1000 vertical structures horizontally. You may have to deal with a data file that would reach the end of space (if on cards), but you can replace it losslessly by 1000 extendable vertical pTrees and write programs to process across those 1000 vertical structures horizontally. The next generation may have to deal with a data file that creates new space, but can replace it losslessly by 1000 extendable vertical pTrees and write programs to process across those 1000 vertical structures horizontally. You will be able to use my code! The next generation will be able to use my code too! It seems clear that DATA WILL HAVE TO BE COMPRESSED and that data will have to be VERTICALLY structured. Let's take a quick look at how one might organize and compressed vertical data (more on that later too).
9
But it's pure0 so this branch ends 0 0 0 0 1 P 11 4. Left half of rt half ? false 0 0 2. Left half pure1? false 0 0 0 1. Whole thing pure1? false 0 5. Rt half of right half? true 1 0 0 1 R 11 0 1 predicate Trees = pTrees: slice by column (4 vertical structures). Record truth of predicate: "purely 1-bits" in a tree, recursively on halves, until the half is pure. 3. Right half pure1? false 0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 vertically slice off each bit position (12 vertical structures) then compress each bit slice into a tree using a predicate (We will walk thru the compression of R 11 into pTree, P 11 ) P 11 pure1? false=0 pure1? true=1 pure1? false=0 Traditional Vertical Processing of Horizontal Data (VPHD) R(A 1 A 2 A 3 A 4 ) 2 7 6 1 6 7 6 0 3 7 5 1 2 7 5 7 3 2 1 4 2 2 1 5 7 0 1 4 for Horizontally structured, record- oriented data, one scans vertically 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 = Base 10Base 2 P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P 43 0 0 0 0 1 1 0 0 0 0 0 0 1 01 10 0 1 0 0 1 0 0 0 0 1 0 01 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 01 0 0 0 0 01 0 0 0 0 1 0 0 10 01 ^^ ^ ^ ^ ^^ =2 To count (7,0,1,4) s use 111000001100 P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 7 0 1 4 0 *2 3 0 0 *2 2 =2 0 1 *2 1 *2 0 = A Vertical Data Structuring using vertical pTrees find number occurrences of 7 0 1 4 e.g., find the number of occurences of 7 0 1 4 Imagine an excillion records, not just 8 (We need speed!). More typically, we compress strings of bits not single bits (eg, 64 bit strings or strides).
10
The age of Big Data is upon us and so is the age of Infinite Storage. Many of us have enough money in our pockets right now to buy all the storage we will be able to fill for the next 5 years. So having adequate storage capacity is no longer much of a problem. Managing our storage is a problem (especially managing BIG DATA storage). How much data is there?
11
Tera Bytes (TBs) are certainly here already. 1 TB may cost << 1k$ to buy 1 TB may cost >> 1k$ to own Management and curation are the expensive part Searching 1 TB takes a long time. I’m Terrified by TeraBytes I’m Petrified by PetaBytes Googolplex 10 Googol Googol 10 100... (tredecillion)10 42 (duodecillion)10 39 (undecillion)10 36 (decillion)10 33 (nontillion)10 30 (octillion)10 27 Yotta (septillion)10 24 Zetta (sextillion)10 21 Exa (quintillion)10 18 Peta (quadrillion)10 15 Tera (trillion)10 12 Giga (billion)10 9 Mega (million)10 6 Kilo (thousand) 10 3 We are here I’m Exafied by ExaBytes I’m Zettafied by ZettaBytes You could be Yottafied by YottaBytes. You may not be Googified by GoogolBytes, but the next generation may be?
12
How much information is there? Soon everything may be recorded. Most of it will never be seen by humans. Data summarization, Vertical Structuring, Compression, trend detection, anomaly detection, data mining, are key technologies Yotta Zetta Exa Peta Tera Giga Mega Kilo A Book.Movie All books (words) All Books MultiMedia Everything! Recorded A Photo 10 -24 Yocto, 10 -21 zepto, 10 -18 atto, 10 -15 femto, 10 -12 pico, 10 -9 nano, 10 -6 micro, 10 -3 milli
13
First Disk, in 1956 IBM 305 RAMAC 4 MB 50 24” disks 1200 rpm (revolutions per minute) 100 milli-seconds (ms) access time 35k$/year to rent Included computer & accounting software (tubes not transistors) 7 th Grade C.S. lab Tech.
14
10 years later 1.6 meters 30 MB
15
Disk Evolution Kilo Mega Giga Tera Peta Exa Zetta Yotta
16
Memex As We May Think, Vannevar Bush, 1945 “A memex is a device in which an individual stores all his books, records, and communications, and which is mechanized so that it may be consulted with exceeding speed and flexibility” “yet if the user inserted 5000 pages of material a day it would take him hundreds of years to fill the repository, so that he can enter material freely”
17
Can you fill a terabyte in a year? ItemItems/TBItems/day a 300 KB JPEG image3 M10,000 a 1 MB Document1 M3,000 a 1 hour, 256 kb/s MP3 audio file 10 K26 a 1 hour video3000.8
18
On a Personal Terabyte, how Will We Find Anything? Need Queries, Indexing, Vertical Structuring?, Compression, Data Mining, Scalability, Replication… If you don’t use a DBMS, you will implement one of your own! Need for Data Mining, Machine Learning is more important then ever! Of the digital data in existence today, 80% is personal/individual 20% is Corporate/Governmental DBMS
19
Parkinson’s Law (for data) Data expands to fill available storage Disk-storage version of Moore’s Law Available storage doubles every 9 months! How do we get the information we need from the massive volumes of data we will have? Vertical Structuring and Compression Querying (for the information we know is there) Data mining (for answers to questions we don't know to ask precisely Moore’s Law with respect to processor performance seems to be over ( processor performance doubles every x months …). Note that the processors we find in our computers today are the same as the ones we found a few years ago. That’s because that technology seems to have reached a limit (minaturizing). Now the direction is to put multiple processor on the same chip or die and to use other types of processor to increase performance.
20
Thank you.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.