Hadoop & Neptune Feb. 2009 김형준.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Author: Murray Stokely Presenter: Pim van Pelt Distributed Computing at Google.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
Hadoop Ali Sharza Khan High Performance Computing 1.
CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop implementation of MapReduce computational model Ján Vaňo.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Bigtable: A Distributed Storage System for Structured Data Google Inc. OSDI 2006.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
Hadoop. Introduction Distributed programming framework. Hadoop is an open source framework for writing and running distributed applications that.
Image taken from: slideshare
Unit 2 Hadoop and big data
Software Systems Development
INTRODUCTION TO BIGDATA & HADOOP
What is Apache Hadoop? Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created.
Chapter 10 Data Analytics for IoT
Introduction to HDFS: Hadoop Distributed File System
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
Meng Cao, Xiangqing Sun, Ziyue Chen May 28th, 2014
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
The Basics of Apache Hadoop
Introduction to Apache
Overview of big data tools
Lecture 16 (Intro to MapReduce and Hadoop)
CS 345A Data Mining MapReduce This presentation has been altered.
Distributed Systems CS
Apache Hadoop and Spark
Presentation transcript:

Hadoop & Neptune Feb 김형준

The Data 'Tsunami'

More CPU Faster Disk Program Tuning More Memory

Uninstall

Where? Distributed File System How? Distributed/Parallel Computing

Hadoop DFS Unlimited Storage No Backup, Self-healing Thousands Nodes But, No POSIX No Random write

: machine : daemon process NameNode (DFS Master) JobTracker (Job Master) DataNode (DFS Slave) TaskTracker (Task Mgmt.) Local Disk DataNode (DFS Slave) TaskTracker (Task Mgmt.) Local Disk DataNode (DFS Slave) TaskTracker (Task Mgmt.) Local Disk Secondary NameNode ClientAPI control data control data

Hadoop MapReduce 1TB group by -> 10 분 More Machine -> 1 분

map (k1,v1) → list(k2,v2) reduce (k2, list (v2)) → result value This is a book. That book is on the desk. I like that book. This is a book. That book is on the desk. I like that book. (This,1) (book, 1) (That, 1) (book, 1) … (I,1) (that, 1) (book, 1) … map() (book, [1,1,1]) … (is, [1,1]) … (This,[1]) (book, 3) … (is, 2) … (This,1) reduce() Exec distributed/parallel Map&Reduce execution platform Split Partition Merge Sort

: machine : daemon process NameNode (DFS Master) JobTracker (Job Master) DataNode (DFS Slave) TaskTracker (Task Mgmt.) Local Disk DataNode (DFS Slave) TaskTracker (Task Mgmt.) Local Disk DataNode (DFS Slave) TaskTracker (Task Mgmt.) Local Disk Secondary NameNode ClientAPI control data control data

A piece of Cake

Neptune Database running on DFS(Hadoop) Unlimited Structured Data No Backup But, No JOIN, No SQL No Multiple row operation No Aggregation function

Operation Create/Drop Table put/get like/between scan/merge scan(join) MapReduce

Why Neptune? Tablet A-3 Tablet A-N … Tablet A-2 TabletA-1 TableA JobTracker Make Map&Reduce function Run on Map&Reduce framework META Table Get tablet list Map Task TaskTracker Map Task TaskTracker Map Task TaskTracker Map Task Task assign to each node TaskTracker Reduce Task TaskTracker Reduce Task TableB Tablet B-2 Tablet B-1 분산 / 병렬처리 : Speed, Scalability 분산 / 병렬처리 : Speed, Scalability

분산파일시스템 (Hadoop or other) TabletServer #1 TabletServer #2 TabletServer #n Cluster Management System Neptune Master Neptune Master 분산 / 병렬컴퓨팅 플랫폼 (Hadoop) 사용자 애플리케이션 Neptune ( 대용량분산 데이터 저장소 ) 논리적 Table 물리적 저장소

When use Neptune Large Data Online put/get and analysis Less complex Google Personalized Search Google analytics

Finding developer

Cheap Hardware and Smart Software Use cheap commodity hardware  frequent failure Develop smart software for reducing the cost of failure Easy Management High Scalability by automatic discovery of new servers and racks High Redundancy for failure of servers, racks, even data centers Speed and Then More Speed High speed with low cost Rapid development and deployment of new products Use existing technologies Use techniques from the leading edge of computer science Use open source codes as a starting point Principle of Google Infra

Google Infra Google Linux GFS Bigtable Map & Reduce Client API Chubby Cluster Mgmt Batch application Online Services Hardware Low-end commodity servers 40 or more pizza box server per rack Google’s core competency Google’s software stack

Q&A