The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.

Slides:



Advertisements
Similar presentations
Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
RAID- Redundant Array of Inexpensive Drives. Purpose Provide faster data access and larger storage Provide data redundancy.
Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Apache Hadoop and Hive Dhruba Borthakur Apache Hadoop Developer
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.
MapReduce.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
SEMINAR ON Guided by: Prof. D.V.Chaudhari Seminar by: Namrata Sakhare Roll No: 65 B.E.Comp.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee.
Hadoop & Condor Dhruba Borthakur Project Lead, Hadoop Distributed File System Presented at the The Israeli Association of Grid Technologies.
The Memory B. Ramamurthy C B. Ramamurthy1. Topics for discussion On chip memory On board memory System memory Off system/online storage/ secondary memory.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop implementation of MapReduce computational model Ján Vaňo.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
The IEEE International Conference on Cluster Computing 2010
HADOOP Carson Gallimore, Chris Zingraf, Jonathan Light.
BACS 287 Big Data & NoSQL 2016 by Jones & Bartlett Learning LLC.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Next Generation of Apache Hadoop MapReduce Owen
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
BIG DATA BIGDATA, collection of large and complex data sets difficult to process using on-hand database tools.
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
Supercomputing versus Big Data processing — What's the difference?
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
An Open Source Project Commonly Used for Processing Big Data Sets
Large-scale file systems and Map-Reduce
Introduction to MapReduce and Hadoop
Ministry of Higher Education
Hadoop Basics.
Ch 4. The Evolution of Analytic Scalability
Hadoop Technopoints.
Overview of big data tools
Mark Zbikowski and Gary Kimura
Zoie Barrett and Brian Lam
Presentation transcript:

The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available increases every year. success in the future will be dictated to a large extent by their ability to extract value from other organizations’ data. Variety of Data, Velocity of the Data, Volume of the data –V 3 Data Storage & Analysis –The storage capacity of the hard drives increased, but access speeds have not kept up significantly –Now 1 Terabyte data is norm for disks and speed is around 100 MB/s,so it takes more than two and half hours to read all the data from the disk. So there is a long time to read zetta bytes of data –So alternative solution—To read from multiple disks

Data Storage & Analysis –Problems in reading from and writing to multiple disks –Multiple hardware pieces are prone for failure-So data loss probability is high –Solution for Data loss-Replication –RAID works with replication only –Data Analyis need to combine the data from various elements & challenges – Need a solution as reliable shared storage and analysis system Hello ! Hadoop NUTCH project by Doug Cutting Google GFS & Map Reduce distributed data storage and processing Yahoo Development Project Doug Cutting Apache Hadoop Open source frame work Hadoop-Made up Name

HADOOP Best fit for Adhoc Analysis Written once and read many times Variety of Data Peta bytes of data Batch analysis Dynamic Schema Data Locality Data flow is implicit Shared Nothing Architecture Scaling out approach with commodity hardware Key/value pair RDBMS Good for low latency data Organized data/Structured data Gigabytes of data Interactive and Batch Static Schema Scaling is expensive Tables Structure HPC,GRID&VOLUNTEER COMPUTING Distribution of work across the cluster Data intensive applications, Network Bandwidth Compute nodes idle MPI(message passing interface) flexibility but complexity for data flow Volunteer computing Volunteers are donating CPU cycles, not bandwidth Volunteer computing, untrusted computers, no data locality

Top of Existing File System Streaming Data Access patterns Very large files Commodity Hardware High Through put rather than low latency Lot of small files Low latency Data access Multiple Writes, 1) MAP 2) REDUCE 3) CODE for MR JOB 4) Automatic parallelization 5) Fault Tolerance Java,Python etc House keeping in built

HDFS block size 64 MB -128 MB Why is it so large? Name Node Secondary Name Node Client Heart Beating, Block replication and Balancing Data node Data Node Data Nodes Data nodes Data node