Big Data I/Hadoop explained Presented to ITS at the UoA on December 6 th 2012
The gospel according to Dilbert
What is the scale of Big Data? There is no single agreed definition of Big Data but examples of its scale include: 12 terabytes of daily tweets 2.8 petabytes of untapped power utilities data 350 billion annual meter readings 5 million daily stock market trades 500 million daily call-centre records 100s of live video feeds millions of daily click stream records from web logs
65,313,993 rows of data doth not Big Data make FACT_EFTS_SNAPSHOT largest table in the DSS data warehouse – 11.6GB, 65,313,993 rows Lecture theatre recording data – 1TB
So how do we define Big Data? Volume; velocity; variety; veracity 1 TB of data can be handled by traditional enterprise relational databases A working definition of Big Data: data that make the use of tools like Hadoop necessary Therefore the UoA does not deal in Big Data, nor do most organisations in New Zealand, so why do you think consultants are pushing it?
The Big Data problem Enterprise-scale relational databases adequately handle large amounts of data Businesses need to analyse huge amounts of data In the search for competitive advantage SQL joins in row-based relational databases cannot handle Big Data Big Data changes everything Googles solution to the Big Data problem is a disruptive technology for Big Data but not for merely large amounts of data
Google File System (GFS) GFS was created to address the storage scalability problem GFS is a distributed file system housed on clusters of cheaper commodity servers and disks Commodity servers and disks fail often so huge data files are chunked and replicated across the file system to minimise the impact of failures how-the-giants-of-the-web-store-big-data/
Google Bigtable Bigtable is Googles distributed storage system for managing data and sits on top of the Google File System Is designed to scale to a very large size: petabytes of data across thousands of commodity servers/disks Near-linear scalability is achieved by performing computations on the distributed servers/disks that manage and contain the data rather than moving data to separate processing nodes Many projects at Google store data in Bigtable, including web indexing and Google Earth
Bigtable is column-based rather than row-based Bigtable maps two arbitrary string values (row key and column key) and timestamp (hence three dimensional mapping) into an associated arbitrary byte array (an array of key/value pairs) Bigtable can be better defined as a sparse (gaps between keys), distributed (across many machines/disks), multi-dimensional (maps within maps), sorted (by key rather than value), map (key with an associated value) Bigtable is therefore a columnar data store rather than a row-based relational database
Row-based and columnar examples IDNAMEAGEINTERESTS 1RickySoccer, Movies, Baseball 2Ankur20 3Sam25Music Row-based example, e.g., an RDBMS table called PERSONAL_DETAILS Columnar breakdown IDNAME 1Ricky 2Ankur 3Sam IDAGE IDINTERESTS 1Soccer 1Movies 1Baseball 3Music and-hbase.html
Conceptual columnar Bigtable equivalent Primary Index ROWKEY:COLUMNKEY:TIMESTAMP Column Family PERSONAL_DETAILS 1:PERSONAL_DETAILS:01/01/2011 NAME:Ricky INTERESTS:Soccer INTERESTS: Movies INTERESTS: Baseball 2:PERSONAL_DETAILS:31/03/2012 NAME:Ankur AGE:20 3:PERSONAL_DETAILS:20/10/2012NAME:Sam AGE:25 INTERESTS: Music
Google MapReduce MapReduce processes massive distributed datasets by mapping data into key/value pairs then reducing over all pairs with the same key
Hadoop Hadoop is Apaches free open source implementation of Google File System, Google Bigtable, Google MapReduce and other software Hadoop (written in Java) is buggy, needing strong (expensive) Java expertise to fix the code Wrappers for underlying Hadoop function calls can be written in almost any language Tools like HBase (an example of a NoSQL columnar data store) sit on top of HDFS (Hadoop Distributed File System) and offer tables and a query language supporting MapReduce as well as DML like Get/Put/Scan Hadoop expertise is relatively scarce (expensive), especially when configuring 100s/1,000s of servers/disks, when writing MapReduce jobs on a huge distributed infrastructure, and when managing data in a new way
Other utilities Apache Pig and Apache Hive are platforms providing data summarisation, analyses, and queries Pig Latin is a procedural data flow language for exploring large datasets HiveQL is an SQL-like (but not SQL) language for exploring large datasets Pig Latin and HiveQL commands compile to create MapReduce jobs