Presentation is loading. Please wait.

Presentation is loading. Please wait.

HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.

Similar presentations


Presentation on theme: "HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light."— Presentation transcript:

1 HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light

2 Contents  Hadoop Overview  MapReduce  HDFS  History  Architecture  Applications

3 What is Hadoop?  Open Source software project  Used to distribute the processing of large data sets over clusters of servers.  Software is resilient because it is great at detecting and handling failures at the application layer. http://tinyurl.com/m33wgcw

4 Overview  Hadoop contains a lot of apache projects (e.g. Pig, Hive, Zookeeper)  Mainly relies on MapReduce and HDFS (Hadoop Distributed File System)  MapReduce is a framework that assigns work to the nodes in a cluster  HDFS is a file system that spans over all of the nodes in the cluster to store data. http://www.ibmbigdatahub.com/sites/default/files/public_images/hadoop.jpg

5 MapReduce  “MapReduce is the heart of Hadoop. It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster”. http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/ http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/ http://people.apache.org/~rdonkin/hadoop-talk/diagrams/map-reduce.png

6 Example: http://www-01.ibm.com/software/ebusiness/jstart/graphics/hadoopDiagram.png

7 HDFS  The HDFS breaks down the data in the cluster into small blocks and distributes them throughout the cluster.  This helps with scalability because you can break down the data making the map and reduce functions able to work on smaller subsets of the large data sets.  The goal of Hadoop is to use common servers with inexpensive internal disk drives in large clusters

8 HDFS, Cont.  More machines means potentially higher fault rate  Hadoop was developed with high fail rates in mind  Hadoop has built-in fault tolerance and compensation capabilities. The same for HDFS.

9 HDFS, Cont.  The data gets divided into blocks, and then copies of these blocks are made.  The copied blocks are then stored throughout the other servers in the cluster.  This was if the cluster fails, you can get the file by combining the copied blocks

10 History  Underlying technology invented by Google in order to index the rich textural and structural information.  Designed to solve large data problems where you have a mixture of structured and complex data.

11 History, Cont.  Uses a MapReduce engine, HDFS  Written in Java  Being consistently built and used by a global community of contributors.

12 Architecture  Designed to run on many machines that do not share memory or disks.  The software busts data into pieces and spread it across all the machines.  To achieve this Hadoop implements MapReduce.

13 Architecture, Cont.  Hadoop keeps track of where all the data resides and keeps copies in case of a server failure.  There are many different ways to customize Hadoop to fit specific needs.

14 Applications  Hadoop can be applied to multiple markets.  Including: - Risk analysis for financing corporations - online retail, product suggestions

15 References  Turner, James. January 12, 2011. Hadoop: what it is, how it works, and what it can do. http://strata.oreilly.com/2011/01/what-is- hadoop.html  Wikipedia. September 18, 2013. Apache Hadoop. http://en.wikipedia.org/wiki/Hadoop

16 References cont.  What is Hadoop? http://www- 01.ibm.com/software/data/infosphere/hadoop/  What is MapReduce? http://www- 01.ibm.com/software/data/infosphere/hadoop/mapreduce/  What is HDFS? http://www- 01.ibm.com/software/data/infosphere/hadoop/hdfs/

17 Questions?


Download ppt "HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light."

Similar presentations


Ads by Google