Download presentation
Presentation is loading. Please wait.
1
Ministry of Higher Education
Sultanate of Oman Ministry of Higher Education University of Nizwa Big Data AND Hadoop Prepared by : Amal Al Muqrishi Course Title: Advanced Database Systems - INFS504
2
Outline Big data and its challenges Apache Hadoop
Requirements for Hadoop Apache Hadoop modules Hadoop Distributed File System Hadoop MapReduce Hadoop and RDBMS Comparison Advanced Database Systems - INFS504
3
Big data The data is growing rapidly because of
Users Systems Sensors Applications Big data: It is a term for data sets that are so large or complex that traditional data processing applications are inadequate to deal with them. Advanced Database Systems - INFS504
4
Challenges of Big data Challenges of Big data that called V attributes : Velocity : means the data comes at high speed Volume : focus on large and growing files Variety : means the files come in various formats (e.g. text, sound and video) How to handle the big data ? How to process big data with reasonable cost and time? Advanced Database Systems - INFS504
5
What is Apache Hadoop? Hadoop: It is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation. Advanced Database Systems - INFS504
6
Who Uses Hadoop? Advanced Database Systems - INFS504
7
Requirements for Hadoop
Must support partial failure Must be scalable Advanced Database Systems - INFS504
8
Partial Failures Failure of a single component must not cause the failure of the entire system only a degradation of the application performance Failure should not result in the loss of any data Advanced Database Systems - INFS504
9
Component Recovery If a component fails, it should be able to recover without restarting the entire system Component failure or recovery during a job must not affect the final output Advanced Database Systems - INFS504
10
Scalability Increasing resources should increase load capacity
Increasing the load on the system should result in a graceful decline in performance for all jobs Not system failure Advanced Database Systems - INFS504
11
Hadoop modules Hadoop consists of numerous functional modules :
At a minimum, Hadoop uses Hadoop Common as a kernel to provide the framework's essential libraries. Hadoop Distributed File System (HDFS), which is capable of storing data across thousands of commodity servers to achieve high bandwidth between nodes. Hadoop MapReduce, which provides the programming model used to tackle large distributed data processing - mapping data and reducing it to a result. Hadoop Yet Another Resource Negotiator (YARN), which provides resource management and scheduling for user applications. Advanced Database Systems - INFS504
12
Hadoop modules Hadoop also supports a range of related projects that can complement and extend Hadoop's basic capabilities. Complementary software packages include: Apache Flume Apache HBase Apache Hive Apache Oozie Apache Phoenix Apache Pig Apache Sqoop Apache Spark Apache Storm Apache ZooKeeper Advanced Database Systems - INFS504
13
Hadoop Distributed File System
HDFS is a file system written in Java based on the Google’s FS Responsible for storing data on the cluster Data files are split into blocks and distributed across the nodes in the cluster Each block is replicated multiple times The NameNode keeps track of which blocks make up a file and where they are stored Advanced Database Systems - INFS504
14
Data Replication Advanced Database Systems - INFS504
15
Data Retrieval When a client wants to retrieve data
Communicates with the NameNode to determine which blocks make up a file and on which data nodes those blocks are stored Then communicated directly with the data nodes to read the data Advanced Database Systems - INFS504
16
MapReduce Overview A method for distributing computation across multiple nodes Each node processes the data that is stored at that node Consists of two main phases Map Reduce Advanced Database Systems - INFS504
17
MapReduce cont. The Mapper: Shuffle and Sort: The Reducer:
Reads data as key/value pairs Shuffle and Sort: Output from the mapper is sorted by key All values with the same key are guaranteed to go to the same machine The Reducer: Called once for each unique key Gets a list of all values associated with a key as input The reducer outputs zero or more final key/value pairs Usually just one output per input key Advanced Database Systems - INFS504
18
MapReduce cont. Advanced Database Systems - INFS504
19
Summary Hadoop MapReduce and HDFS
Hadoop implements Google’s MapReduce, using HDFS. MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located. Advanced Database Systems - INFS504
20
Hadoop and RDBMS Comparison
The traditional systems, such as Relational Database Management Systems (RDBMS), are not able to handle the big data, while hadoop handles the issues by using HDFS for storage and Hadoop MapReduce for analysis and processing Hadoop fits well to analyze whole dataset in a batch fashion RDMS is for real-time data retrieval with low-latency and small data sets Advanced Database Systems - INFS504
21
Hadoop and RDBMS Comparison cont.
Advanced Database Systems - INFS504
22
Summary Big data and its challenges Apache Hadoop
Velocity, Volume, and Variety Apache Hadoop Requirements for Hadoop Must support partial failure Must be scalable Apache Hadoop modules Hadoop Distributed File System Hadoop MapReduce Hadoop and RDBMS Comparison Advanced Database Systems - INFS504
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.