Download presentation
Presentation is loading. Please wait.
Published byPoppy Page Modified over 9 years ago
1
Workshop on Basics & Hands on Kapil Bhosale M.Tech (CSE) Walchand College of Engineering, Sangli. (Worked on Hadoop in Tibco) 1
2
Parallel programming Parallel programming is used to improve performance and efficiency. In a parallel program, the processing is broken up into parts, that run on different machines concurrently. Super-computers can be replaced by large clusters of CPUs. These CPUs may be on the same machine or they may be in a network of computers. Hadoop2
3
Parallel programming Hadoop3 Web graphic Super Computer Janet E. Ward, 2000 Cluster of Desktops
4
Map/Reduce and Hadoop What is Map/Reduce. Map/Reduce is a programming model for processing large data sets, generally used for data intensive tasks, by Google. Map/Reduce is typically used to do distributed computing on clusters of computers. – The model is inspired by the map and reduce functions commonly used in functional programming, although their purpose in the Map/Reduce framework is not the same as their original forms. How Hadoop and Map/Reduce are related. – Hadoop is implementation of computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster.
5
What is Hadoop? Apache Hadoop is a framework for running applications on large cluster built of commodity hardware. Why the name Hadoop? It includes: – Map/Reduce : Computational Model – HDFS : Hadoop Distributed File System Yahoo! is the biggest contributor Here's what makes it especially useful: – Scalable: It can reliably store and process petabytes. – Economical: It distributes the data and processing across clusters of commonly available computers (in thousands). – Reliable: It automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures. – Works on Commodity hardware. 5
6
Open source Apache project Implemented in Java Apache Top Level Project http://hadoop.apache.org/core/ Core (15 Committers) HDFS MapReduce Community of contributors is growing Though mostly Yahoo for HDFS and MapReduce You can contribute too!
7
Big Data - Large availability of data gathering tools -Huge network of sensors - Social networks - And the data is unstructured or semi-structured - -Need to analysis the data. - - Finding useful Information from the data. - - need distributed tool.
8
What does it do? Hadoop implements Google’s MapReduce, using HDFS MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located. Hadoop ‘s target is to run on clusters of the order of 10,000-nodes. 8
9
Hadoop: Assumptions It is written with large clusters of computers in mind and is built around the following assumptions: – Hardware will fail – Processing will be run in batches. Thus there is an emphasis on high throughput. – Applications that run on HDFS have large data sets. A typical file in HDFS is gigabytes to terabytes in size. – Applications need a write-once-read-many access model. – Moving Computation is Cheaper than Moving Data. – Commodity hardware. 9
10
Apache Hadoop Wins Terabyte Sort Benchmark (July 2008) One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual general purpose (daytona) terabyte sort benchmark. The sort benchmark specifies the input data (10 billion 100 byte records), which must be completely sorted and written to disk.Hadoopterabyte sort benchmark The sort used 1800 maps and 1800 reduces and allocated enough memory to buffers to hold the intermediate data in memory. The cluster had 910 nodes; 2 quad core Xeons @ 2.0ghz per node; 4 SATA disks per node; 8G RAM per a node; 1 gigabit ethernet on each node; 40 nodes per a rack; 8 gigabit ethernet uplinks from each rack to the core; Red Hat Enterprise Linux Server Release 5.1 (kernel 2.6.18); Sun Java JDK 1.6.0_05- b13. Again Same year Google Recored 68 Sec. Yahoo used Hadoop again to reduce it 62 Sec. 10
11
Example Applications and Organizations using Hadoop Amazon: To build Amazon's product search indices; process millions of sessions daily for analytics, using both the Java and streaming APIs; clusters vary from 1 to 100 nodes. Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research for Ad Systems and Web Search Yahoo! AOL : Used for a variety of things ranging from statistics generation to running advanced algorithms for doing behavioral analysis and targeting; cluster size is 50 machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB hard-disk giving us a total of 37 TB HDFS capacity. AOL Facebook: To store copies of internal log and dimension data sources and use it as a source for reporting/analytics and machine learning; 320 machine cluster with 2,560 cores and about 1.3 PB raw storage; Facebook FOX Interactive Media : 3 X 20 machine cluster (8 cores/machine, 2TB/machine storage) ; 10 machine cluster (8 cores/machine, 1TB/machine storage); Used for log analysis, data mining and machine learning FOX Interactive Media University of Nebraska Lincoln: one medium-sized Hadoop cluster (200TB) to store and serve physics data; 11
12
Main Components of Hadoop HDFS MapReduce
13
What is HDFS The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. – highly fault-tolerant and is designed to be deployed on low- cost hardware. – provides high throughput access to application data and is suitable for applications that have large data sets. – relaxes a few POSIX requirements to enable streaming access to file system data. – part of the Apache Hadoop Core project. The project URL is http://hadoop.apache.org/core/. 13
15
MapReduce Paradigm Programming model developed at Google Sort/merge based distributed computing Initially, it was intended for their internal search/indexing application, but now used extensively by more organizations (e.g., Yahoo, Amazon.com, IBM, etc.) It is functional style programming (e.g., LISP) that is naturally parallelizable across a large cluster of workstations or PCS. 15
16
NameNode Metadata Meta-data in Memory – The entire metadata is in main memory – No demand paging of meta-data Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor A Transaction Log – Records file creations, file deletions. etc
17
DataNode A Block Server – Stores data in the local file system (e.g. ext3) – Stores meta-data of a block (e.g. CRC) – Serves data and meta-data to Clients Block Report – Periodically sends a report of all existing blocks to the NameNode Facilitates Pipelining of Data – Forwards data to other specified DataNodes
18
Block Placement Current Strategy -- One replica on local node -- Second replica on a remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed Clients read from nearest replica Would like to make this policy pluggable
19
NameNode Failure A single point of failure Transaction Log stored in multiple directories – A directory on the local file system – A directory on a remote file system Need to develop a real Fault Aware solution
20
Hadoop Map/Reduce The Map-Reduce programming model – Framework for distributed processing of large data sets – Pluggable user code runs in generic framework Common design pattern in data processing cat * | grep | sort | unique -c | cat > file input | map | shuffle | reduce | output Natural for: – Log processing – Web search indexing – Ad-hoc queries
21
How does MapReduce work? The run time partitions the input and provides it to different Map instances; Map (key, value) (key’, value’) The run time collects the (key’, value’) pairs and distributes them to several Reduce functions so that each Reduce function gets the pairs with the same key’. Each Reduce produces a single (or zero) file output. Map and Reduce are user written functions 21
23
First Hadoop Example DeFacto Example: Wordcount. Google originally invented the paradigm for Inverted index calculation. Input : the quick brown fox the fox ate the mouse how now brown cos Output : you can predict. 23
24
Word Count Dataflow
29
Mapper In action
30
Reducer In action
31
Main Class In action
32
Partitioning function By default, your reduce tasks will be distributed evenly by using a hash(intrmdt-key) mod N function. You can specify a custom partitioning function. Useful for locality reasons, such as if the key is a URL and you want all URLs belonging to a single host to be processed on a single machine.
33
Combiner function After a map phase, the mapper transmits over the network the entire intermediate data file to the reducer. Sometimes this file is highly compressible. The user can specify a combiner function. It's just like a reduce function, except it's run by the mapper before passing the job to the reducer.
35
Hadoop is critical to Yahoo’s business When you visit yahoo, you are interacting with data processed with Hadoop!
36
Hadoop is critical to Yahoo’s business Ads Optimization Content Optimization Search Index Content Feed Processing When you visit yahoo, you are interacting with data processed with Hadoop!
37
Hadoop Log Analysis Failure prediction and root cause analysis Hadoop Data Rebalancing Based on access patterns and load Best use of flash memory? More Ideas for Research
38
Hadoop single node cluster formation Hadoop Multi-Node cluster formation Practical Installation
39
Thank you all Any queries?.
41
Map in Lisp (Scheme) (map f list [list 2 list 3 …]) (map square ‘(1 2 3 4)) – (1 4 9 16) (reduce + ‘(1 4 9 16)) – (+ 16 (+ 9 (+ 4 1) ) ) – 30 (reduce + (map square (map – l 1 l 2 )))) Unary operator Binary operator
42
MapReduce ala Google map(key, val) is run on each item in set – emits new-key / new-val pairs reduce(key, vals) is run for each unique key emitted by map() – emits final output
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.