Big Data Programming: an Introduction

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Hadoop: The Definitive Guide Chap. 2 MapReduce
Distributed Computations
Distributed Computations MapReduce
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Hadoop Ecosystem Overview
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Introduction to Hadoop and HDFS
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Next Generation of Apache Hadoop MapReduce Owen
Part III BigData Analysis Tools (YARN) Yuan Xue
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Big Data is a Big Deal!.
Hadoop Aakash Kag What Why How 1.
Apache hadoop & Mapreduce
Software Systems Development
INTRODUCTION TO BIGDATA & HADOOP
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
An Open Source Project Commonly Used for Processing Big Data Sets
Chapter 10 Data Analytics for IoT
Large-scale file systems and Map-Reduce
Hadoop.
Introduction to MapReduce and Hadoop
Introduction to HDFS: Hadoop Distributed File System
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MapReduce Simplied Data Processing on Large Clusters
Cloud Distributed Computing Environment Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Hadoop Basics.
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Introduction to Apache
Overview of big data tools
Lecture 16 (Intro to MapReduce and Hadoop)
CS 345A Data Mining MapReduce This presentation has been altered.
Charles Tappert Seidenberg School of CSIS, Pace University
Presentation transcript:

Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ.

Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop High-level overview: HDFS, MapReduce, YARN

Learning Goal Understand concepts in distributed computing for big data Able to develop MapReduce programs to crunch big data Able to perform basic management/administration/troubleshooting of Hadoop cluster Able to understand and use tools in Hadoop ecosystems by self-learning final projects/presentations

Prerequisite Proficiency in C++, Java or Python And being able to pick up a new language quickly Familiarity with Unix/Linux systems Understanding of Unix file systems, users and permissions… Basics Unix commands, Shell scripting: to automate running your programs and collecting results…

What is Big Data Data sets that grow so large that they become awkward to work with using on hand database management tools. (Wikipedia)

Where do they come from? New York Stock Exchange: one terabyte of new trade data per day Facebook: 10 billion photos, one petabyte of storage Data generated by machines: logs, sensor networks, GPS traces, electronic transactions, … Have you collected data? Network traces projects: Internet measurements…

Multiple of Bytes: decimal prefix 1000 kB kilobyte 10002 MB megabyte 10003 Gb gigabyte 10004 TB terabyte 10005 PB petabyte 10006 EB exabyte 10007 ZB zettabyte 10008 YB yottabyte

Cost of Storage 1991, consumer grade, 1 gigabyte (1/1000 TB) disk drives, US$2699 1995, 1 GB drives, US$849 2007: 1 terabyte hard disk, $375 2010: 2 terabyte hard disk costs US$200 2012: 4 terabyte hard disk US$450, 1 terabyte hard disk US$100 2013: 4 terabyte hard disk US$179, 3 terabyte hard disk $129, 2 terabyte hard disk $100, 1 terabyte hard disk US$80 2014: 4 terabyte hard disk US$150, 3 terabyte hard disk $129, 2 terabyte hard disk $90, 1 terabyte hard disk US$60

Challenges General Problems in Big Data Era: How to process very big volume of data in a reasonable amount of time? It turns out: disk bandwidth has become bottleneck, i.e., hard disk cannot read data fast enough… Solutions: parallel processing Google’s problems: to crawl, analyze and rank web pages into giant inverted index (to support search engine) Google engineers went ahead to build their own systems: Google File System, “exabyte-scale data management using commodity hardware” Google MapReduce (GMR), “implementation of design pattern applied to massively parallel processing”

Background: Inverted Index Goal: to support search query, where we need to locate documents containing some given words, and then rank these documents by relevance Means: create inverted index, which stores a list of the documents containing each word Example: Word: Documents where the work appears the Document 1, Document 3, Document 4, Document 5 cow Document 2, Document 3, Document 4 says Document 5 moo Document 7

Hadoop History Originally Yahoo Nutch Project: crawl and index a large number of web pages Idea: a program is distributed, and process part of data stored with them Two Google papers => Hadoop project (an open source implementation of Distributed File system and MapReduce framework) Hadoop: schedule and resource management framework for execute map and reduce jobs in a cluster environment Now an open source project, Apache Hadoop Hadoop ecosystem: various tools to make it easier to use Hive, Pig: tools that can translate more abstract description of workload to map- reduce pipelines.

High-level View: HDFS, MapReduce

HDFS (Hadoop Distributed File System) A file system running on clusters of commodity hardware Capable of storing very large files Optimized for streaming data access (i.e., sequential reads) initial intent of Hadoop for large parallel, batch processing jobs resilient to node failures through replication via replication

HDFS as a file system Command line operations: hadoop fs -ls (-mkdir, -cd, …) hadoop fs -copyFromLocal … hadoop fs -copyToLocal … Java Programming API: open file, close file, read and write file,… from programs

MapReduce • End-user MapReduce API for programming MapReduce application. • MapReduce framework, the runtime implementation of various phases such as map phase, sort/shuffle/merge aggregation and reduce phase. • MapReduce system, which is the backend infrastructure required to run the user’s MapReduce application, manage cluster resources, schedule thousands of concurrent jobs etc.

MapReduce Programming Model Split Shuffle intermediate [key,value] pairs [k1,v11,v12, …] [k2,v21,v22,…] … Input: a set of [key,value] pairs Output: a set of [key,value] pairs

Woud Count Example Example: Counting number of occurrences of each word in a large collection of documents. pseudo-code: map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1"); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

WeatherData Example Problem: analyze highest temperature for each year Input: a single file contains multiple year weather data Output: [year, highest_temp] pairs Input: [k,v] pairs intermediate: [k,v] pairs Output: [k,v] pairs

Parallel Execution: Scaling Out A MapReduce job is a unit of work that client/user wants to be performed input data MapReduce program Configuration information Hadoop system: * divides job into map and reduce tasks. * divides input into fixed-size pieces called input splits, or splits. * Hadoop creates one map task for each split, which runs the user-defined map function for each record in the split

MapReduce and HDFS Parallism of MapReduce + very high aggregate I/O bandwidth across a large cluster provided by HDFS => economics of the system are extremely compelling – a key factor in the popularity of Hadoop. Keys: lack of data motion i.e. move compute to data, and do not move data to compute node via network.  Specifically, MapReduce tasks can be scheduled on the same physical nodes on which data is resident in HDFS, which exposes the underlying storage layout across the cluster.  Benefits: reduces network I/O and keeps most of the I/O on local disk or within same rack.

Hadoop 1.x There are two types of nodes that control the job execution process: a jobtracker and a number of tasktrackers. jobtracker: coordinates all jobs run on the system by scheduling tasks to run on tasktrackers. Tasktrackers: run tasks and send progress reports to the jobtracker, which keeps a record of the overall progress of each job. If a task fails, the jobtracker can reschedule it on a different tasktracker.

YARN: Yet Another Resource Negotiator Resource management => a global ResourceManager Per-node resource monitor => NodeManager Job scheduling/monitoring => per-application ApplicationMaster (AM). Hadoop Deamons are Java processes, running in background, talking to other via RPC over SSH protocol.

YARN: Master-slave System: ResourceManager and per-node slave, NodeManager (NM), form the new, and generic, system for managing applications in a distributed manner. ResourceManager: ultimate authority that arbitrates resources among all applications in the system. Pluggable Scheduler, allocate resources to various running applications based on the resource requirements of the applications based on abstract notion of a Resource Container which incorporates resource elements such as memory, cpu, disk, network etc. Per-application ApplicationMaster: negotiate resources from ResourceManager and working with NodeManager(s) to execute and monitor component tasks.