HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light
Contents Hadoop Overview MapReduce HDFS History Architecture Applications
What is Hadoop? Open Source software project Used to distribute the processing of large data sets over clusters of servers. Software is resilient because it is great at detecting and handling failures at the application layer.
Overview Hadoop contains a lot of apache projects (e.g. Pig, Hive, Zookeeper) Mainly relies on MapReduce and HDFS (Hadoop Distributed File System) MapReduce is a framework that assigns work to the nodes in a cluster HDFS is a file system that spans over all of the nodes in the cluster to store data.
MapReduce “MapReduce is the heart of Hadoop. It is this programming paradigm that allows for massive scalability across hundreds or thousands of servers in a Hadoop cluster”
Example:
HDFS The HDFS breaks down the data in the cluster into small blocks and distributes them throughout the cluster. This helps with scalability because you can break down the data making the map and reduce functions able to work on smaller subsets of the large data sets. The goal of Hadoop is to use common servers with inexpensive internal disk drives in large clusters
HDFS, Cont. More machines means potentially higher fault rate Hadoop was developed with high fail rates in mind Hadoop has built-in fault tolerance and compensation capabilities. The same for HDFS.
HDFS, Cont. The data gets divided into blocks, and then copies of these blocks are made. The copied blocks are then stored throughout the other servers in the cluster. This was if the cluster fails, you can get the file by combining the copied blocks
History Underlying technology invented by Google in order to index the rich textural and structural information. Designed to solve large data problems where you have a mixture of structured and complex data.
History, Cont. Uses a MapReduce engine, HDFS Written in Java Being consistently built and used by a global community of contributors.
Architecture Designed to run on many machines that do not share memory or disks. The software busts data into pieces and spread it across all the machines. To achieve this Hadoop implements MapReduce.
Architecture, Cont. Hadoop keeps track of where all the data resides and keeps copies in case of a server failure. There are many different ways to customize Hadoop to fit specific needs.
Applications Hadoop can be applied to multiple markets. Including: - Risk analysis for financing corporations - online retail, product suggestions
References Turner, James. January 12, Hadoop: what it is, how it works, and what it can do. hadoop.html Wikipedia. September 18, Apache Hadoop.
References cont. What is Hadoop? 01.ibm.com/software/data/infosphere/hadoop/ What is MapReduce? 01.ibm.com/software/data/infosphere/hadoop/mapreduce/ What is HDFS? 01.ibm.com/software/data/infosphere/hadoop/hdfs/
Questions?