Hadoop.

Hadoop

Hadoop Hadoop is a distributed framework to run MapReduce programs.
There are multiple versions; we'll be talking about v1 here (v2 a.k.a. YARN is covered later).

Hadoop Distributed File System (HDFS)
HDFS is designed to run on a cluster of commodity hardware. It implements a relaxed version of POSIX to enable streaming access to file system data. Its primary purpose is to provide a high-throughput access to large databases. Metadata (filenames, access times, permissions, checksums) are stored on a dedicated server (called the NameNode). Application data (the actual file contents) are stored on other servers called DataNodes. All the servers are fully connected and communicate via TCP-based protocols. The file contents are replicated across multiple DataNodes.

Why would HDFS replicate file contents?
1. Increase durability (reliability) 2. Increase bandwidth (multiple nodes can serve data) 3. Increase colocation (data with programs needing data) 4. Increase profitability (Hint: Hadoop is Open Source)

HDFS Assumptions/Goals
Hardware Failures: Using commodity hardware means failure is the norm not the exception. With thousands of nodes, the system needs automatic detection and recover from failures. Batch Processing: HDFS is optimized for batch processing over interactive use. Emphasis on high throughput over low latency Large Datasets: Handles huge files (> terabytes range) Simple Coherency Model: In general, there is one writer and many readers. Files can't by updated, only appended to.

Components of HDFS NameNode: Secondary NameNodes: DataNodes:
Maintains a image of the file system comprised of the inodes (index nodes), block locations, and attributes (ownership, permissions, creation and access times, and disk space quotas). Keeps a Write-ahead commit log called the Journal. It is used to keep checkpoints for the purpose of recovery. Secondary NameNodes: As the NameNode is a single-point-of-failure, backup NameNodes exist to take over is needed. DataNodes: Blocks (data) are stored on these servers. The DataNode periodically reports its state to the NameNode via a heartbeat mechanism.

Hadoop Ecosystem MapReduce is the most prevalent application within the Hadoop ecosystem. However, there are many programs that utilize the Hadoop framework. Pig: An application that has a dataflow language (called Pig Latin). A script (PigScript) translates into a directed acyclic graph (DAG) of MapReduce jobs. Hive: Provides a SQL interface on top of MapReduce Oozie: A service that provides scheduling and running of workflows of jobs from the above Sqoop: Used to move data efficiently between Hadoop and relational databases Hbase: A column based database that we already talked about

Apache Pig Problems with MapReduce:
User must express programs as a one-input, two-stage (map and reduce) process. MapReduce provides no methods for describing a complex dataflow that applies a sequence of transformations on the input. Pig (and Pig Latin) was created to fill this hole. Example: We want to find the average pagerank of the popular categories of URLs. SQL query: SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 10**6

Apache Pig (Part 2) Example: We want to find the average pagerank of the popular categories of URLs. SQL query: SELECT category, AVG(pagerank) FROM urls WHERE pagerank > 0.2 GROUP BY category HAVING COUNT(*) > 10**6 Pig Script: good_urls = FILTER urls BY pagerank > 0.2; groups = GROUP good_urls BY categories; big_groups = FILTER groups BY COUNT(good_urls) > 10**6 output = FOREACH big_groups GENERATE category, AVG(good_urls.pagerank);

Apache Pig (Part 3) Pig Latin support operations on nested data structures like XML and JSON. Pig also supports user defined functions. 4 datatypes: Atoms: Simple atomic values like numbers or strings Tuples: A sequence of fields (whose types can be any of the datatypes) Bag: A collection of tuples with possible duplicates Map: A collection of data items where each item has a key for direct access Pig is a tool for enhancing the productivity of programmers using Hadoop (highly recommended).

Apache Hive Hive is for the people who want SQL for big data jobs.
The Hive query language (HiveQL) includes a subset of SQL that includes all types of joins, Group By operations, as well as useful functions related to primitive and complex data types. Tables in Hive are linked to directories in HDFS. Users can define horizontal partitions within tables. For example, a Web log table can be partitioned by day and within a day, by hour. Each partition introduces a level of directories in HDFS. A table may be stored as bucketed on a set of columns. This means that the specific files storing the table are defined by the value of a set columns. For example, within a hour of logs, the data is buckets according to user_id. Then each user_id is located in a particular file with the directory. The user can define how many files should be generated.

Hadoop.

Similar presentations

Presentation on theme: "Hadoop."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hadoop.

Similar presentations

Presentation on theme: "Hadoop."— Presentation transcript:

Similar presentations

About project

Feedback