Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.

Slides:



Advertisements
Similar presentations
Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!
Advertisements

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Apache Hadoop and Hive Dhruba Borthakur Apache Hadoop Developer
Big Data A big step towards innovation, competition and productivity.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee.
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
History & Motivations –RDBMS History & Motivations (cont’d) … … Concurrent Access Handling Failures Shared Data User.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
BIG DATA/ Hadoop Interview Questions.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
A Tutorial on Hadoop Cloud Computing : Future Trends.
Hadoop Aakash Kag What Why How 1.
Hadoop.
Software Systems Development
An Open Source Project Commonly Used for Processing Big Data Sets
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Large-scale file systems and Map-Reduce
Hadoopla: Microsoft and the Hadoop Ecosystem
Hadoop.
Introduction to HDFS: Hadoop Distributed File System
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
Ministry of Higher Education
Big Data Programming: an Introduction
The Basics of Apache Hadoop
Hadoop Basics.
Lecture 16 (Intro to MapReduce and Hadoop)
Zoie Barrett and Brian Lam
Presentation transcript:

Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1

Data!  We live in the data age.  Estimates 0.18 ZB in 2006 and forecasting a tenfold growth by 2011 to 1.8 ZB 1 ZB = bytes = 1,000 EB = 1,000,000 PB = 1,000,000,000 TB  The flood of data is coming from many sources  New York Stock Exchange generates 1 TB of new trade data per day  Facebook hosts about 10 billion photos taking up 1 PB (=1,000 TB) of storage  Internet Archive stores around 2 PB, and is growing at a rate of 20 TB per month  ‘Big Data’ can affects smaller organizations or individuals  Digital photos, individual’s interactions – phone calls, s, documents – are captured and stored for later access  The amount of data generated by machines will be even greater than that generated by people  Machine logs, RFID readers, sensor networks, vehicle GPS traces, retail transactions 2

Data!  Data can be shared for anyone to download and analyze  Public Data Sets on Amazon Web Services, Infochimps.org, theinfo.org  Astrometry.net project Watches the astrometry group on Flickr for new photos of the night sky Analyzes each image and identifies the sky  The project shows that are possible when data is made available and used for something that was not anticipated by the creator  Big Data is here. We are struggling to store and analyze it. 3

Data Storage and Analysis  The storage capacities have increased but access speeds haven’t kept up Writing is even slower!  Solution : Read and write data in parallel to/from multiple disks  Problem  To solve hardware failure  replication RAID : Redundant copies of the data are kept in case of failure  To combine the data in a disk with the others  What Hadoop provides  A reliable shared storage (HDFS)  Efficient analysis (MapReduce) drive stores1,370 MB1 TB transfers4.4 MB/s100 MB/s reads all the data from a full drive5 minutes2 hours and 30 minutes

Comparison with Other Systems - RDBMS  RDBMS  B-Tree index  Optimized for accessing and updating a small proportion of records  MapReduce  Efficient for updating the large data, uses Sort/Merge to rebuild the DB  Good for the needs to analyze the whole dataset in a batch fashion  Structured vs. Semi- or Unstructured Data  Structured data : particular predefined schema  RDBMS  Semi- or Unstructured data : looser or no particular internal structure  MapReduce  Normalization  To retain the integrity and remove redundancy, relational data is often normalized  MapReduce performs high-speed streaming reads and writes, and records that is not normalized are well-suited to analysis with MapReduce. 5

Comparison with Other Systems - RDBMS  RDBMS vs. MapReduce  Co-evolution of RDBMS and MapReduce systems  RDBs start incorporating some of the ideas from MapReduce  Higher-level query languages built on MapReduce Making MapReduce systems more approachable to traditional database programmers 6

Comparison with Other Systems – Grid Computing  Grid Computing  High Performance Computing(HPC) and Grid Computing communities have been doing large-scale data processing Using APIs as Message Passing Interface(MPI)  HPC Distribute the work across a cluster of machines, which access a shared filesystem, hosted by a SAN Works well for compute-intensive jobs Meets a problem when nodes need to access larger data volumes – hundreds of GB, since the network bandwidth is the bottleneck and compute nodes become idle  Data locality, the heart of MapReduce  MapReduce collocates the data with the compute node, so data access is fast since it is local  MPI vs. MapReduce  MPI programmers need to handle the mechanics of the data flow  MapReduce programmers think in terms of functions of key and value pairs, and the data flow is implicit 7

Comparison with Other Systems – Grid Computing  Partial failure  MapReduce is a shared-nothing architecture  tasks have no dependence on one other.  the order in which the tasks run doesn’t matter.  MPI programs have to manage the check-pointing and recovery 8

Comparison with Other Systems – Volunteer Computing  Volunteer computing projects  Breaking the problem into chunks called work units  Sending to computers around the world to be analyzed  The Results are sent back to the server when the analysis is completed  The client gets another work unit   to analyze radio telescope data for signs of intelligent life outside earth  vs. MapReduce  very CPU-intensive, which makes it suitable for running on hundreds of thousands of computers across the world. Volunteers are donating CPU cycles, not bandwidth Runs a perpetual computation on untrusted machines on the Internet with highly variable connection speeds and no data locality  MapReduce Designed to run jobs that last minutes or hours on HW running in a single data center with very high aggregate bandwidth interconnects 9

A Brief History of Hadoop  Hadoop  Created by Doug Cutting, the creator of Apache Lucene, text search library  Has its origin in Apache Nutch, an open source web search engine, a part of the Lucene project  ‘Hadoop’ was the name that Doug’s kid gave to a stuffed yellow elephant toy  History  In 2002, Nutch was started A working crawler and search system emerged Its architecture wouldn’t scale to the billions of pages on the Web  In 2003, Google published a paper describing the architecture of Google’s distributed filesystem, GFS  In 2004, Nutch project implemented the GFS idea into the Nutch Distributed Filesystem, NDFS  In 2004, Google published the paper introducing MapReduce  In 2005, Nutch had a working MapReduce implementation in Nutch By the middle of that year, all the major Nutch algorithms had been ported to run using MapReduce and NDFS 10

A Brief History of Hadoop  History  In Feb. 2006, Doug Cutting started an independent subproject of Lucene, called Hadoop In Jan. 2006, Doug Cutting joined Yahoo! Yahoo! Provided a dedicated team and the resources to turn Hadoop into a system at web scale  In Feb. 2008, Yahoo! announced its search index was being generated by a 10,000 core Hadoop cluster  In Apr. 2008, Hadoop broke a world record to sort a terabytes of data  In Nov. 2008, Google reported that its MapReduce implementation sorted one terabytes in 68 seconds.  In May 2009, Yahoo! used Hadoop to sort one terabytes in 62 seconds 11

Apache Hadoop and the Hadoop Ecosystem  The Hadoop projects that are covered in this book are following  Common – a set of components and interfaces for filesystems and I/O.  Avro – a serialization system for RPC and persistent data storage.  MapReduce – a distributed data processing model.  HDFS – a distributed filesystem running on large clusters of machines.  Pig – a data flow language and execution environment for large datasets.  Hive – a distributed data warehouse providing SQL-like query language.  HBase – a distributed, column-oriented database.  ZooKeeper – a distributed, highly available coordination service.  Sqoop – a tool for efficiently moving data between relational DB and HDFS. 12