O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee.

Slides:



Advertisements
Similar presentations
Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!
Advertisements

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
Apache Hadoop and Hive.
R and HDInsight in Microsoft Azure
By: Chris Hayes. Facebook Today, Facebook is the most commonly used social networking site for people to connect with one another online. People of all.
AStudy on the Viability of Hadoop Usage on the Umfort Cluster for the Processing and Storage of CReSIS Polar Data Mentor: Je’aime Powell, Dr. Mohammad.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Big Data & Hadoop By Mr.Nataraj smallest unit is bit 1 byte=8 bits 1 KB (Kilo Byte)= 1024 bytes =1024*8 bits 1MB (Mega Byte)=1024 KB=(1024)^2 * 8 bits.
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Apache Hadoop and Hive Dhruba Borthakur Apache Hadoop Developer
Big Data A big step towards innovation, competition and productivity.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Database Management System Lecture 2 Introduction to Database management.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Bleeding edge technology to transform Data into Knowledge HADOOP In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
HDFS Hadoop Distributed File System
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
SEMINAR ON Guided by: Prof. D.V.Chaudhari Seminar by: Namrata Sakhare Roll No: 65 B.E.Comp.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Amazon Web Services BY, RAJESH KANDEPU. Introduction  Amazon Web Services is a collection of remote computing services that together make up a cloud.
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.
Bleeding edge technology to transform Data into Knowledge HADOOP In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
The Memory B. Ramamurthy C B. Ramamurthy1. Topics for discussion On chip memory On board memory System memory Off system/online storage/ secondary memory.
Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop implementation of MapReduce computational model Ján Vaňo.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
RDFPath: Path Query Processing on Large RDF Graph with MapReduce Martin Przyjaciel-Zablocki et al. University of Freiburg ESWC May 2013 SNU IDB.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
BACS 287 Big Data & NoSQL 2016 by Jones & Bartlett Learning LLC.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Next Generation of Apache Hadoop MapReduce Owen
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
This is a free Course Available on Hadoop-Skills.com.
By: Joel Dominic and Carroll Wongchote 4/18/2012.
BIG DATA/ Hadoop Interview Questions.
Bleeding edge technology to transform Data into Knowledge HADOOP In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,
Hadoop Aakash Kag What Why How 1.
Dhruba Borthakur Apache Hadoop Developer Facebook Data Infrastructure
Hadoopla: Microsoft and the Hadoop Ecosystem
Hadoop.
Big Data Dr. Mazin Al-Hakeem (Nov 2016), “Big Data: Reality and Challenges”, LFU – Erbil.
Hadoop Clusters Tess Fulkerson.
Big Data Programming: an Introduction
The Basics of Apache Hadoop
Hadoop Basics.
Bleeding edge technology to transform Data into Knowledge
Zoie Barrett and Brian Lam
Presentation transcript:

O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee

Outline  Data!  Data Storage and Analysis  Comparison with Other Systems –RDBMS –Grid Computing –Volunteer Computing  The Apache Hadoop Project 2

‘Digital Universe’ Nears a Zettabyte 3  Digital Universe: the total amount of data stored in the world’s computers  Zettabyte: bytes >> Exabyte >> Petabyte >> Terabyte

Flood of Data 4 NYSE generates 1TB new trade data / day

Flood of Data 5 Facebook hosts 10 billion photos (1 petabyte)

Flood of Data 6 Internet Archive stores 2 petabytes of data

Individuals’ Data are Growing Apace 7 It becomes easier to take more and more photos

Individuals’ Data are Growing Apace 8 LifeLog, my life in a terabyte SQL Capture and encoding Microsoft Research’s MyLifeBits Project

Amount of Public Data Increases  Available Public Data Sets on AWS –Annotated Human Genome –Public database of chemical structures –Various census data and labor statistics 9

Large Data! How to store & analyze large data? 10 “More data usually beats better algorithms”

Outline  Data!  Data Storage and Analysis  Comparison with Other Systems –RDBMS –Grid Computing –Volunteer Computing  The Apache Hadoop Project 11

Current HDD How long it takes to read all the data off the disk? 12 capacity1TB transfer rate100MB/s How about using multiple disks?

Problems with Multiple Disks 13  Hardware Failure  Doing tasks need to combine the distributed data  What Hadoop Provides –Reliable shared storage (HDFS) –Reliable analysis system (MapReduce)

Outline  Data!  Data Storage and Analysis  Comparison with Other Systems –RDBMS –Grid Computing –Volunteer Computing  The Apache Hadoop Project 14

RDBMS 15 * Low latency for point queries or updates ** Update times of a relatively small amount of data * **

Grid Computing 16 Shared storage (SAN)  Works well for predominantly CPU-intensive jobs  Becomes a problem when nodes need to access large data

Volunteer Computing 17  Volunteers donate CPU time from their idle computers  Work units are sent to computers around the world  Suitable for very CPU-intensive work with small data sets  Risky due to running work on untrusted machines

Outline  Data!  Data Storage and Analysis  Comparison with Other Systems –RDBMS –Grid Computing –Volunteer Computing  The Apache Hadoop Project 18

Brief History of Hadoop 19  Created by Doug Cutting  Originated in Apache Nutch (2002) –Open source web search engine, a part of the Lucene project  NDFS (Nutch Distributed File System, 2004)  MapReduce (2005)  Doug Cutting joins Yahoo! (Jan 2006)  Official start of Apache Hadoop project (Feb 2006)  Adoption of Hadoop on Yahoo! Grid team (Feb 2006)

The Apache Hadoop Project 20 PigChukwaHiveHBase MapReduceHDFS Zoo Keeper CoreAvro