An Open Source Project Commonly Used for Processing Big Data Sets Hadoop An Open Source Project Commonly Used for Processing Big Data Sets Two sources: 1) ACM webinar on Big Data with Hadoop, July 23, 2014. 2) Big Data and its Technical Challenges HV Jagadish et. al. CACM, July 2014, Vol. 57, No. 7 Copyright © 2014-2017 Curt Hill
Introduction An Apache open source project for distributed storage and distributed processing of large amounts of data Automatically replicates and distributes data over multiple nodes Executes jobs that process that data Using map reduce Tracks progress and results of the multiple jobs Presumes that node failure is possible and needs to be handled Spreadsheet is approximate limit on what can be analyzed by hand JSON = JavaScript Object Notation Metadata is data describing format and purpose of data Copyright © 2014-2017 Curt Hill
Ecosystem Original definition: More recently: A biological community of interacting organisms and their physical environment. More recently: A complex network or interconnected system. Any popular OS is an ecosystem Eclipse IDE is one also Hadoop is as well Copyright © 2014-2017 Curt Hill
Hadoop Pieces The Common YARN Map reduce HDFS Utilities Job scheduling and cluster management framework Map reduce Mechanism for parallel processing large data sets HDFS Hadoop Distributed File System Copyright © 2014-2017 Curt Hill
Map Reduce Software technique for processing large quantities of data over several processors Developed by Google Several steps Map the data into key – value pairs Shuffle the data onto various nodes Reduce all those keys with the same value Both the key and data may be of any size and type Copyright © 2014-2017 Curt Hill
Example The classic example is to count words in very large collection of text Consider Shakespeare’s collected works The key would be the word itself The data could be as simple as the location of the word As complicated as the play, act, scene, speaker, line number If we consider the latter, then we may move from simple counts to much more complicated analysis Copyright © 2014-2017 Curt Hill
Workflow A typical system will chunk the input into pieces Each piece will be distributed to a machine A map script will be run on the pieces On each node The shuffle or sort step will rearrange directing to proper node A reduce script will combine the rearranged mappings Copyright © 2014-2017 Curt Hill
MapReduce Picture Copyright © 2014-2017 Curt Hill
MapReduce vs. RDMS RDMS MapReduce Data Size Gigabyte to Terabyte Petabyte to Hexabyte Updates Write many and read many Write once and read many Access type Interactive, batch Batch Schema Static Dynamic Scaling Worse than linear Linear Integrity ACID – High BASE - Low Copyright © 2014-2017 Curt Hill
Hadoop Again Typically the map and reduce scripts are made in Java Other languages are possible Each script may be written as if it were only to be executed on a single machine Hadoop handles the replication and task tracking Copyright © 2014-2017 Curt Hill
Scale Up Example Suppose that we have an RDMS Three servers that communicate with a SANS Server to server via Ethernet Server to SANS via fiber Very fast for what it can do Any number of problems can disable the whole thing Communication between servers Communication to the SANS Disk failure in the SANS Copyright © 2014-2017 Curt Hill
Scale Out Example Multiple servers Hadoop replicates The data The tasks accessing the data Any one or two failures may slow throughput but the processing may still complete Due to a lack of specialized and high speed hardware this will be slower than the previous But perhaps more available Copyright © 2014-2017 Curt Hill
Apache Hadoop Projects Aside from the basic project Apache has at least 11 projects in the Hadoop ecosystem Several scalable NoSQL databases Cassandra, HBase and Hive Several data-flow utilities Pig is high level dataflow language Tez is data-flow programming framework Copyright © 2014-2017 Curt Hill
Summary Hadoop is an open source system Replicates and distributes the data Uses map and reduce scripts to process Manages the clusters Copyright © 2014-2017 Curt Hill