Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Open Source Project Commonly Used for Processing Big Data Sets

Similar presentations


Presentation on theme: "An Open Source Project Commonly Used for Processing Big Data Sets"— Presentation transcript:

1 An Open Source Project Commonly Used for Processing Big Data Sets
Hadoop An Open Source Project Commonly Used for Processing Big Data Sets Two sources: 1) ACM webinar on Big Data with Hadoop, July 23, 2014. 2) Big Data and its Technical Challenges HV Jagadish et. al. CACM, July 2014, Vol. 57, No. 7 Copyright © Curt Hill

2 Introduction An Apache open source project for distributed storage and distributed processing of large amounts of data Automatically replicates and distributes data over multiple nodes Executes jobs that process that data Using map reduce Tracks progress and results of the multiple jobs Presumes that node failure is possible and needs to be handled Spreadsheet is approximate limit on what can be analyzed by hand JSON = JavaScript Object Notation Metadata is data describing format and purpose of data Copyright © Curt Hill

3 Ecosystem Original definition: More recently:
A biological community of interacting organisms and their physical environment. More recently: A complex network or interconnected system. Any popular OS is an ecosystem Eclipse IDE is one also Hadoop is as well Copyright © Curt Hill

4 Hadoop Pieces The Common YARN Map reduce HDFS Utilities
Job scheduling and cluster management framework Map reduce Mechanism for parallel processing large data sets HDFS Hadoop Distributed File System Copyright © Curt Hill

5 Map Reduce Software technique for processing large quantities of data over several processors Developed by Google Several steps Map the data into key – value pairs Shuffle the data onto various nodes Reduce all those keys with the same value Both the key and data may be of any size and type Copyright © Curt Hill

6 Example The classic example is to count words in very large collection of text Consider Shakespeare’s collected works The key would be the word itself The data could be as simple as the location of the word As complicated as the play, act, scene, speaker, line number If we consider the latter, then we may move from simple counts to much more complicated analysis Copyright © Curt Hill

7 Workflow A typical system will chunk the input into pieces
Each piece will be distributed to a machine A map script will be run on the pieces On each node The shuffle or sort step will rearrange directing to proper node A reduce script will combine the rearranged mappings Copyright © Curt Hill

8 MapReduce Picture Copyright © Curt Hill

9 MapReduce vs. RDMS RDMS MapReduce Data Size Gigabyte to Terabyte
Petabyte to Hexabyte Updates Write many and read many Write once and read many Access type Interactive, batch Batch Schema Static Dynamic Scaling Worse than linear Linear Integrity ACID – High BASE - Low Copyright © Curt Hill

10 Hadoop Again Typically the map and reduce scripts are made in Java
Other languages are possible Each script may be written as if it were only to be executed on a single machine Hadoop handles the replication and task tracking Copyright © Curt Hill

11 Scale Up Example Suppose that we have an RDMS
Three servers that communicate with a SANS Server to server via Ethernet Server to SANS via fiber Very fast for what it can do Any number of problems can disable the whole thing Communication between servers Communication to the SANS Disk failure in the SANS Copyright © Curt Hill

12 Scale Out Example Multiple servers Hadoop replicates
The data The tasks accessing the data Any one or two failures may slow throughput but the processing may still complete Due to a lack of specialized and high speed hardware this will be slower than the previous But perhaps more available Copyright © Curt Hill

13 Apache Hadoop Projects
Aside from the basic project Apache has at least 11 projects in the Hadoop ecosystem Several scalable NoSQL databases Cassandra, HBase and Hive Several data-flow utilities Pig is high level dataflow language Tez is data-flow programming framework Copyright © Curt Hill

14 Summary Hadoop is an open source system
Replicates and distributes the data Uses map and reduce scripts to process Manages the clusters Copyright © Curt Hill


Download ppt "An Open Source Project Commonly Used for Processing Big Data Sets"

Similar presentations


Ads by Google