Download presentation
Presentation is loading. Please wait.
Published byIrma Hunt Modified over 8 years ago
1
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim Elhag
2
2 Topics The Origins of Hadoop Hadoop Internals – The Hadoop Distributed File System (HDFS) – The Hadoop Map/Reduce Runtime Hadoop evolve Analyzing Networks Running in Real time
3
3 The Origins of Hadoop
4
4
5
5 Post Web 2.0 Era – An Explosion of Data on the Internet and in the Enterprise – 1000 GB = 1 Terabyte : 1000 Terabytes = 1 Petabyte How do we handle unstructured data ? How do we process the volume ? A need to process 100 TB datasets – On 1 Node: Scanning @50 MB/s = 23 days (MTBF = 3 years) – On 1000 Node Cluster Scanning @50 MB/s = 33 mins (MTBF = 1 day) Need a framework for distribution (Efficient, Reliable, Easy to use)
6
6 The Origins of Hadoop In 2004 Google publishes seminal whitepapers on a new programming paradigm to handle data at Internet Scale (Google processes upwards of 20 PB per day using Map/Reduce) http://research.google.com/people/sanjay/index.html The Apache Foundation launches Hadoop – An Open-Source implementation of Google Map/Reduce and the distributed Google FileSystem Google and IBM create the “Cloud Computing Academic Initiative” to teach Hadoop and Internet Scale Computing Skills to the next generation of Engineers
7
7
8
8 So what exactly is Apache Hadoop ? a framework for running applications (aka jobs) on large clusters built on commodity hardware capable of processing petabytes of data. a framework that transparently provides applications both reliability and data motion. It ensures data locality. it implements a computational paradigm named Map/Reduce, where the application is divided into self contained units of work, each of which may be executed or re-executed on any node in the cluster. it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. node failures are automatically handled by the framework.
9
9 Hadoop Internals
10
10 Hadoop – The Hadoop Cluster - Distributed File System - Map/Reduce
11
11 Hadoop – HDFS -Goals -Large single file system for the entire cluster -Optimized for streaming reads of large files. -Handle Hardware failures -Block Storage and Reliability -Files are broken into large blocks -Replicated to several nodes for reliability -Distributed -Allows access to data on any node in the cluster -Optimized for local access of the Data -Designed for Commodity hardware
12
12
13
13 Hadoop - Map/Reduce -The Application programming model. -Works like a Unix Pipeline -Design to deal with large data sets. -Jobs are broken into map and reduce tasks and spread across the nodes. -Each map/reduce operation must be independent -Multi-language support -Map step -One map task for each file split (block in HDFS) -The map function will be called for to each record in the input dataset -Produces a list of (key, value) pairs -Reduce step -Results are grouped by key and passed to Reduce step -The reduce function is called once for each key in sorted order -Reduce step is not required
14
Hadoop - Map/Reduce Logical Flow
15
15 Hadoop - Map/Reduce on the Cluster
16
16 Hadoop – Map/Reduce – JobTracker Details
17
17 Hadoop – Map/Reduce – Job Details
18
Area of evolvement: Analyzing Networks: MapReduce was not designed to analyze data sets threaded with connections: A social network, for example, is best represented in graph form, where in each person becomes a vertex and an edge drawn between two individuals signifies a connection. Running in Real-time: Hadoop is too slow, and other tools have begun to emerge. real-time response capabilities into HBase, a software stack that sits atop the basic Hadoop infrastructure. “ preference movies”. Cloudant, another real-time engine, uses a MapReduce-based framework to query data, but the data itself is stored as documents. “ faster” Cloudant can track new and incoming information and only process the changes. “We don’t require the daily extraction of data from one system into another, analysis in Hadoop, and re-injection back into a running application layer,” “A lot of people are talking about big data, but most people are just creating it,” says Guestrin. “The real value is in the analysis.”
19
19 The Hadoop ecosystem
20
20 Hadoop – External Related Projects
21
21 Hadoop – Key Internal Hadoop related projects JAQL – Query Language for JSON GumShoe – Enterprise Enhanced Search BigSheets – Ad-hoc analytics for business professionals at web-scale SystemT – Information Extraction and Programmable Search for unstructured content GPFS – General Parallel File System being provided as an alternative to HDFS for Hadoop Extreme Analytic Platform (XAP) – Hadoop Based Analytics Platform Hadoop Appliance - Low operational cost hardware + software appliance pre-loaded with IBM Hadoop
22
22 How is Industry using Hadoop? Trend Analysis of existing unstructured data (such as mining log files for key metrics)-Visible Measures Targeted crawling (obtains the data) coupled with information extraction and classification (structures the data) - ZVents Text Analytics – the ability to run extractors over unstructured data to cleanse, structure and normalize it so that it can be queried via - (Pig / HIVE / BigSheets). A programming model for cloud computing : Hadoop jobs running natively in the cloud, over data stored in the cloud and storing the output in the cloud – Amazon EC2
23
23 Hadoop - A programming model for Cloud Computing Amazon EC2 and S3 Overview Hadoop on the Amazon Cloud HiPODS Academic Cluster
24
24 How is Academia using Hadoop ? Research and algorithms improve with the quantity of data one has to analyze but researchers are thus left with the following problems: Hardware Acquisition and Cost of Maintaining large clusters Spending an inordinate amount of time understanding, writing and troubleshooting parallel computing tasks that are not intrinsic to their research Academia is turning to running Hadoop jobs cheaply on Amazon EC2 instances and hiring an CS intern to write the jobs for them - $127 results !
25
25 Hadoop Future Directions Database Technologies – Hadoop DB (Yale - Hybrid Parallel Database System) – Map/Reduce Online (Berkeley - Realtime M/R) – SQOOP (Cloudera – Enabling JDBC Import into HDFS) Higher Order Interpreters – Hive, PIG, JAQL, Big Sheets Systems Management/ Resource Allocation
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.