Mapreduce 与 Hadoop 陆嘉恒 中国人民大学. 主要内容 分布式计算软件构架 MapReduce 介绍 分布式计算开源框架 Hadoop 介绍 小结.

Slides:



Advertisements
Similar presentations
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Advertisements

Apache Hadoop and Hive.
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
MapReduce. 2 (2012) Average Searches Per Day: 5,134,000,000 (2012) Average Searches Per Day: 5,134,000,000.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
MapReduce Technical Workshop This presentation includes course content © University of Washington Redistributed under the Creative Commons Attribution.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Workshop on Basics & Hands on Kapil Bhosale M.Tech (CSE) Walchand College of Engineering, Sangli. (Worked on Hadoop in Tibco) 1.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Apache Hadoop and Hive Dhruba Borthakur Apache Hadoop Developer
Google Distributed System and Hadoop Lakshmi Thyagarajan.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
1 The Google File System Reporter: You-Wei Zhang.
Introduction to MapReduce Amit K Singh. “The density of transistors on a chip doubles every 18 months, for the same cost” (1965) Do you recognize this.
Introduction to Hadoop 趨勢科技研發實驗室. Copyright Trend Micro Inc. Outline Introduction to Hadoop project HDFS (Hadoop Distributed File System) overview.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Owen O’Malley Yahoo! Grid Team
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
1 MapReduce: Theory and Implementation CSE 490h – Intro to Distributed Computing, Modified by George Lee Except as otherwise noted, the content of this.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team Modified by R. Cook.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore
School of Computer Science and Mathematics BIG DATA WITH HADOOP EXPERIMENTATION APARICIO CARRANZA NYC College of Technology – CUNY ECC Conference 2016.
Hadoop.
Dhruba Borthakur Apache Hadoop Developer Facebook Data Infrastructure
Introduction to MapReduce and Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Basics.
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Presentation transcript:

Mapreduce 与 Hadoop 陆嘉恒 中国人民大学

主要内容 分布式计算软件构架 MapReduce 介绍 分布式计算开源框架 Hadoop 介绍 小结

MapReduce Online Evaluation 使用 mapreduce 框架编程解决问题 在线检测系统允许测试自己的程序

MapReduce: Insight Consider the problem of counting the number of occurrences of each word in a large collection of documents” How would you do it in parallel ?

MapReduce Programming Model Inspired from map and reduce operations commonly used in functional programming languages like Lisp. Users implement interface of two primary methods: – 1. Map: (key1, val1) → (key2, val2) – 2. Reduce: (key2, [val2]) → [val3]

Map operation Map, a pure function, written by the user, takes an input key/value pair and produces a set of intermediate key/value pairs. – e.g. (doc—id, doc-content) Draw an analogy to SQL, map can be visualized as group-by clause of an aggregate query.

Reduce operation On completion of map phase, all the intermediate values for a given output key are combined together into a list and given to a reducer. Can be visualized as aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute.

Pseudo-code map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

MapReduce: Execution overview

MapReduce: Example

MapReduce in Parallel: Example

MapReduce: Fault Tolerance Handled via re-execution of tasks. Task completion committed through master What happens if Mapper fails ? – Re-execute completed + in-progress map tasks What happens if Reducer fails ? – Re-execute in progress reduce tasks What happens if Master fails ? – Potential trouble !!

MapReduce: Walk through of One more Application

MapReduce : PageRank PageRank models the behavior of a “random surfer”. C(t) is the out-degree of t, and (1-d) is a damping factor (random jump) The “random surfer” keeps clicking on successive links at random not taking content into consideration. Distributes its pages rank equally among all pages it links to. The dampening factor takes the surfer “getting bored” and typing arbitrary URL.

PageRank : Key Insights Effects at each iteration is local. i+1 th iteration depends only on i th iteration At iteration i, PageRank for individual nodes can be computed independently

PageRank using MapReduce Use Sparse matrix representation (M) Map each row of M to a list of PageRank “credit” to assign to out link neighbours. These prestige scores are reduced to a single PageRank value for a page by aggregating over them.

PageRank using MapReduce PageRank using MapReduce Map: distribute PageRank “credit” to link targets Reduce: gather up PageRank “credit” from multiple sources to compute new PageRank value Iterate until convergence Source of Image: Lin 2008

Phase 1: Process HTML Map task takes (URL, page-content) pairs and maps them to (URL, (PR init, list-of-urls)) – PR init is the “seed” PageRank for URL – list-of-urls contains all pages pointed to by URL Reduce task is just the identity function

Phase 2: PageRank Distribution Reduce task gets (URL, url_list) and many (URL, val) values – Sum vals and fix up with d to get new PR – Emit (URL, (new_rank, url_list)) Check for convergence using non parallel component

MapReduce: Some More Apps Distributed Grep. Count of URL Access Frequency. Clustering (K-means) Graph Algorithms. Indexing Systems MapReduce Programs In Google Source Tree

MapReduce: Extensions and similar apps PIG (Yahoo) Hadoop (Apache) DryadLinq (Microsoft)

Large Scale Systems Architecture using MapReduce User App MapReduce Distributed File Systems (GFS)

分布式计算软件构架 MapReduce 介绍 分布式计算开源框架 Hadoop 介绍 小结

Hadoop Book Our new book about cloud computing and Hadoop Download Chapter: ting2010/index.html

Outline Architecture of Hadoop Distributed File System Hadoop usage at Facebook

Hadoop, Why? Need to process Multi Petabyte Datasets Expensive to build reliability in each application. Nodes fail every day – Failure is expected, rather than exceptional. – The number of nodes in a cluster is not constant. Need common infrastructure – Efficient, reliable, Open Source Apache License

Hadoop History Dec 2004 – Google GFS paper published July 2005 – Nutch uses MapReduce Feb 2006 – Becomes Lucene subproject Apr 2007 – Yahoo! on 1000-node cluster Jan 2008 – An Apache Top Level Project Jul 2008 – A 4000 node test cluster Sept 2008 – Hive becomes a Hadoop subproject

Who uses Hadoop? Amazon/A9 Facebook Google IBM Joost Last.fm New York Times PowerSet Veoh Yahoo!

Commodity Hardware Typically in 2 level architecture – Nodes are commodity PCs – nodes/rack – Uplink from rack is 3-4 gigabit – Rack-internal is 1 gigabit

Goals of HDFS Very Large Distributed File System – 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recovers from them Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth User Space, runs on heterogeneous OS

Distributed File System Single Namespace for entire cluster Data Coherency – Write-once-read-many access model – Client can only append to existing files Files are broken up into blocks – Typically 128 MB block size – Each block replicated on multiple DataNodes Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode

NameNode Metadata Meta-data in Memory – The entire metadata is in main memory – No demand paging of meta-data Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor A Transaction Log – Records file creations, file deletions. etc

DataNode A Block Server – Stores data in the local file system (e.g. ext3) – Stores meta-data of a block (e.g. CRC) – Serves data and meta-data to Clients Block Report – Periodically sends a report of all existing blocks to the NameNode Facilitates Pipelining of Data – Forwards data to other specified DataNodes

Block Placement Current Strategy -- One replica on local node -- Second replica on a remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed Clients read from nearest replica Would like to make this policy pluggable

Data Correctness Use Checksums to validate data – Use CRC32 File Creation – Client computes checksum per 512 byte – DataNode stores the checksum File access – Client retrieves the data and checksum from DataNode – If Validation fails, Client tries other replicas

NameNode Failure A single point of failure Transaction Log stored in multiple directories – A directory on the local file system – A directory on a remote file system (NFS/CIFS)

Data Pipelining Client retrieves a list of DataNodes on which to place replicas of a block Client writes block to the first DataNode The first DataNode forwards the data to the next DataNode in the Pipeline When all replicas are written, the Client moves on to write the next block in file

Rebalancer Goal: % disk full on DataNodes should be similar – Usually run when new DataNodes are added – Cluster is online when Rebalancer is active – Rebalancer is to avoid network congestion

Hadoop at Facebook Production cluster – 4800 cores, 600 machines, 16GB per machine – April 2009 – 8000 cores, 1000 machines, 32 GB per machine – July 2009 – 4 SATA disks of 1 TB each per machine – 2 level network hierarchy, 40 machines per rack – Total cluster size is 2 PB, projected to be 12 PB in Q Test cluster 800 cores, 16GB each

Useful Links HDFS Design: – Hadoop API: – Hive: –

小结 分布式计算软件构架 MapReduce 分布式计算开源框架 Hadoop

谢谢!