Presentation is loading. Please wait.

Presentation is loading. Please wait.

Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.

Similar presentations


Presentation on theme: "Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim."— Presentation transcript:

1 Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim Elhag

2 2 Topics  The Origins of Hadoop  Hadoop Internals – The Hadoop Distributed File System (HDFS) – The Hadoop Map/Reduce Runtime  Hadoop evolve  Analyzing Networks  Running in Real time

3 3 The Origins of Hadoop

4 4

5 5  Post Web 2.0 Era – An Explosion of Data on the Internet and in the Enterprise – 1000 GB = 1 Terabyte : 1000 Terabytes = 1 Petabyte  How do we handle unstructured data ?  How do we process the volume ? A need to process 100 TB datasets – On 1 Node: Scanning @50 MB/s = 23 days (MTBF = 3 years) – On 1000 Node Cluster Scanning @50 MB/s = 33 mins (MTBF = 1 day)  Need a framework for distribution (Efficient, Reliable, Easy to use)

6 6 The Origins of Hadoop  In 2004 Google publishes seminal whitepapers on a new programming paradigm to handle data at Internet Scale (Google processes upwards of 20 PB per day using Map/Reduce) http://research.google.com/people/sanjay/index.html  The Apache Foundation launches Hadoop – An Open-Source implementation of Google Map/Reduce and the distributed Google FileSystem  Google and IBM create the “Cloud Computing Academic Initiative” to teach Hadoop and Internet Scale Computing Skills to the next generation of Engineers

7 7

8 8 So what exactly is Apache Hadoop ?  a framework for running applications (aka jobs) on large clusters built on commodity hardware capable of processing petabytes of data.  a framework that transparently provides applications both reliability and data motion. It ensures data locality.  it implements a computational paradigm named Map/Reduce, where the application is divided into self contained units of work, each of which may be executed or re-executed on any node in the cluster.  it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.  node failures are automatically handled by the framework.

9 9 Hadoop Internals

10 10 Hadoop – The Hadoop Cluster - Distributed File System - Map/Reduce

11 11 Hadoop – HDFS -Goals -Large single file system for the entire cluster -Optimized for streaming reads of large files. -Handle Hardware failures -Block Storage and Reliability -Files are broken into large blocks -Replicated to several nodes for reliability -Distributed -Allows access to data on any node in the cluster -Optimized for local access of the Data -Designed for Commodity hardware

12 12

13 13 Hadoop - Map/Reduce -The Application programming model. -Works like a Unix Pipeline -Design to deal with large data sets. -Jobs are broken into map and reduce tasks and spread across the nodes. -Each map/reduce operation must be independent -Multi-language support -Map step -One map task for each file split (block in HDFS) -The map function will be called for to each record in the input dataset -Produces a list of (key, value) pairs -Reduce step -Results are grouped by key and passed to Reduce step -The reduce function is called once for each key in sorted order -Reduce step is not required

14 Hadoop - Map/Reduce Logical Flow

15 15 Hadoop - Map/Reduce on the Cluster

16 16 Hadoop – Map/Reduce – JobTracker Details

17 17 Hadoop – Map/Reduce – Job Details

18 Area of evolvement: Analyzing Networks: MapReduce was not designed to analyze data sets threaded with connections: A social network, for example, is best represented in graph form, where in each person becomes a vertex and an edge drawn between two individuals signifies a connection. Running in Real-time: Hadoop is too slow, and other tools have begun to emerge. real-time response capabilities into HBase, a software stack that sits atop the basic Hadoop infrastructure. “ preference movies”. Cloudant, another real-time engine, uses a MapReduce-based framework to query data, but the data itself is stored as documents. “ faster” Cloudant can track new and incoming information and only process the changes. “We don’t require the daily extraction of data from one system into another, analysis in Hadoop, and re-injection back into a running application layer,” “A lot of people are talking about big data, but most people are just creating it,” says Guestrin. “The real value is in the analysis.”

19 19 The Hadoop ecosystem

20 20 Hadoop – External Related Projects

21 21 Hadoop – Key Internal Hadoop related projects  JAQL – Query Language for JSON  GumShoe – Enterprise Enhanced Search  BigSheets – Ad-hoc analytics for business professionals at web-scale  SystemT – Information Extraction and Programmable Search for unstructured content  GPFS – General Parallel File System being provided as an alternative to HDFS for Hadoop  Extreme Analytic Platform (XAP) – Hadoop Based Analytics Platform  Hadoop Appliance - Low operational cost hardware + software appliance pre-loaded with IBM Hadoop

22 22 How is Industry using Hadoop?  Trend Analysis of existing unstructured data (such as mining log files for key metrics)-Visible Measures  Targeted crawling (obtains the data) coupled with information extraction and classification (structures the data) - ZVents  Text Analytics – the ability to run extractors over unstructured data to cleanse, structure and normalize it so that it can be queried via - (Pig / HIVE / BigSheets).  A programming model for cloud computing : Hadoop jobs running natively in the cloud, over data stored in the cloud and storing the output in the cloud – Amazon EC2

23 23 Hadoop - A programming model for Cloud Computing  Amazon EC2 and S3 Overview  Hadoop on the Amazon Cloud  HiPODS Academic Cluster

24 24 How is Academia using Hadoop ? Research and algorithms improve with the quantity of data one has to analyze but researchers are thus left with the following problems:  Hardware Acquisition and Cost of Maintaining large clusters  Spending an inordinate amount of time understanding, writing and troubleshooting parallel computing tasks that are not intrinsic to their research Academia is turning to running Hadoop jobs cheaply on Amazon EC2 instances and hiring an CS intern to write the jobs for them - $127 results !

25 25 Hadoop Future Directions  Database Technologies – Hadoop DB (Yale - Hybrid Parallel Database System) – Map/Reduce Online (Berkeley - Realtime M/R) – SQOOP (Cloudera – Enabling JDBC Import into HDFS)  Higher Order Interpreters – Hive, PIG, JAQL, Big Sheets  Systems Management/ Resource Allocation

26


Download ppt "Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim."

Similar presentations


Ads by Google