Big Data & Hadoop By Mr.Nataraj smallest unit is bit 1 byte=8 bits 1 KB (Kilo Byte)= 1024 bytes =1024*8 bits 1MB (Mega Byte)=1024 KB=(1024)^2 * 8 bits.

Slides:



Advertisements
Similar presentations
Meet Hadoop Doug Cutting & Eric Baldeschwieler Yahoo!
Advertisements

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
+ Hbase: Hadoop Database B. Ramamurthy. + Introduction Persistence is realized (implemented) in traditional applications using Relational Database Management.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Unit 3—Part A Computer Memory
Hadoop Ecosystem Overview
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-1 HDFS itself is “big” Why do we need “hbase” that is bigger and more complex? Word count, web logs.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Scaling for Large Data Processing What is Hadoop? HDFS and MapReduce
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Sky Agile Horizons Hadoop at Sky. What is Hadoop? - Reliable, Scalable, Distributed Where did it come from? - Community + Yahoo! Where is it now? - Apache.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
SEMINAR ON Guided by: Prof. D.V.Chaudhari Seminar by: Namrata Sakhare Roll No: 65 B.E.Comp.
Experimenting Lucene Index on HBase in an HPC Environment Xiaoming Gao Vaibhav Nachankar Judy Qiu.
O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee.
Section 1 # 1 CS The Age of Infinite Storage.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
1 3 Computing System Fundamentals 3.2 Computer Architecture.
Unit 2—Part A Computer Memory Computer Technology (S1 Obj 2-3)
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
Apache Hadoop Daniel Lust, Anthony Taliercio. What is Apache Hadoop? Allows applications to utilize thousands of nodes while exchanging thousands of terabytes.
Hadoop implementation of MapReduce computational model Ján Vaňo.
MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Floppy Disk Drive Lesson 5 CES Industries, Inc.. 1. Evolved from audio tape to floppy disk drives, with the first being an 8” disk to modern 3 1/2” 2.
Nov 2006 Google released the paper on BigTable.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Big Data Why it matters Patrice KOEHL Department of Computer Science Genome Center UC Davis.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
This is a free Course Available on Hadoop-Skills.com.
BIG DATA/ Hadoop Interview Questions.
CPSC8985 FA 2015 Team C3 DATA MIGRATION FROM RDBMS TO HADOOP By Naga Sruthi Tiyyagura Monika RallabandiRadhakrishna Nalluri.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Understanding Big Data
Software Systems Development
Computer Memory Digital Literacy.
Memory Parts of a computer
Hadoopla: Microsoft and the Hadoop Ecosystem
NOSQL.
What is Binary? Binary is a two-digit (Base-2) numerical system, which computers use to process and store data. The reason computers use the binary system.
Hadoop.
Big Data Dr. Mazin Al-Hakeem (Nov 2016), “Big Data: Reality and Challenges”, LFU – Erbil.
Unit 2 Computer Memory Computer Technology (S1 Obj 2-3)
Hadoop Clusters Tess Fulkerson.
Unit 3—Part A Computer Memory
Ministry of Higher Education
Big Data Programming: an Introduction
How do computers work? Storage.
Hadoop Basics.
Unit 3—Part A Computer Memory
Zoie Barrett and Brian Lam
Presentation transcript:

Big Data & Hadoop By Mr.Nataraj

smallest unit is bit 1 byte=8 bits 1 KB (Kilo Byte)= 1024 bytes =1024*8 bits 1MB (Mega Byte)=1024 KB=(1024)^2 * 8 bits 1 GB (Giga Byte)=1024 MB =(1024)^3 * 8 bits 1 TB (Tera Byte)=1024GB =(1024)^4 * 8 bits 1 PB (Peta Byte)=1024 TB =(1024)^5 * 8 bits 1 EB (Exa Byte)=1024 PB=(1024)^6 * 8 bits 1 ZB (Zetta Byte)=1024 EB=(1024)^7 * 8 bits 1 YB (Yotta Byte)=1024 ZB=(1024)^8 * 8 bits 1 XB (Xenotta Byte) =1024 YB=(1024)^9 * 8 bits smallest unit is bit 1 byte=8 bits 1 KB (Kilo Byte)= 1024 bytes =1024*8 bits 1MB (Mega Byte)=1024 KB=(1024)^2 * 8 bits 1 GB (Giga Byte)=1024 MB =(1024)^3 * 8 bits 1 TB (Tera Byte)=1024GB =(1024)^4 * 8 bits 1 PB (Peta Byte)=1024 TB =(1024)^5 * 8 bits 1 EB (Exa Byte)=1024 PB=(1024)^6 * 8 bits 1 ZB (Zetta Byte)=1024 EB=(1024)^7 * 8 bits 1 YB (Yotta Byte)=1024 ZB=(1024)^8 * 8 bits 1 XB (Xenotta Byte) =1024 YB=(1024)^9 * 8 bits

 1 byte =A single character  1 KB = A very short story  1 MB=A small novel (6 seconds of TV-quality video)  1 Gigabyte: A pickup truck filled with paper  1 Terabyte : trees made into paper  2 PB: All US academic research libraries  5 EB: All words ever spoken by human beings

WHAT IS BIG DATA

SOME INTERESTING FACTS Google: 20,00,000 query per second Facebook likes per minute Online Shopping of USD 300,000 per minute 1,00,000 tweets in twitter per minute 600 new videos are uploaded per minute in yT Barack Obama used Big Data to win election Driver-less cars uses Big Data Processing for driving vehicles

AT&T transfers about 30 petabytes of data through its networks each day AT&T Google processed about 24 petabytes of data per day in 2009 Google The 2009 movie Avatar is reported to have taken over 1 petabyte of local storage at Weta Digital for the rendering of the 3D CGI effects AvatarWeta Digital

As of January 2013, Facebook users had uploaded over 240 billion photos, with 350 million new photos every day. For each uploaded photo, Facebook generates and stores four images of different sizes, which translated to a total of 960 billion images and an estimated 357 petabytes of storage Processing capabiltiy Google process 20 PB a day Facebook 2.5 PB of User data + 15 TB/day ebay 6.5 PB of data +50TB/day

Doug Cutting working on Lucene Project(A Search engine to search document)got problem of Storage and computation, was looking for distributed Processing. Google publish a Paper GFS(Google File System) Doug cutting & Michael Cafarella implemented GFS to come out with Hadoop Doug Cutting working on Lucene Project(A Search engine to search document)got problem of Storage and computation, was looking for distributed Processing. Google publish a Paper GFS(Google File System) Doug cutting & Michael Cafarella implemented GFS to come out with Hadoop

WHAT IS HADOOP A framework written in Java for running applications on large clusters of commodity hardware. Mainly contains 2 parts – HDFS for Storing data – Map-Reduce for processing data Maintains fault-tolerant using replication factor. A framework written in Java for running applications on large clusters of commodity hardware. Mainly contains 2 parts – HDFS for Storing data – Map-Reduce for processing data Maintains fault-tolerant using replication factor.

employee.txt(eno,ename,empAge,empSal,empDes) 101,prasad,t20,1000,lead

Assume you have around 100,00,000,00000, records and you would like to find out all the employees above 60 years of age.How do you program them traditionally. 10 GB= 10 min 1 TB= 1000 minutes =16 hours Google process 20 PB of data per day To process 20 PB it will take 3200 hours = 133 days INSPIRATION FOR HADOOP To store huge data(unlimited) To process huge data

Node -A single computer with its own processor and memory. Cluster-combination of nodes as a single unit Commodity Hardware-cheap non-reliable hardware Replication Factor-data getting duplicated & saved in more than one place Data Local Optimization-data will be processed locally

Block :- A part of data node1 node2 node3 data1 data2 data3 1 file 200 MB(50MB 50MB 50MB 50MB) Block size :- The size of data that can stored as a single unit apache hadoop:- 64 MB(configurable) 1GB in apache hadoop=16 blocks 65MB(apache)=64MB+ 1MB Replication:- duplicate the data replication factor is: 3

SCALING Vertical Scaling Adding more powerful hardware to an existing system. Will Scale only up to certain limit. Horizontal Scaling Adding a completely new node to an existing cluster. will scale up to many nodes

3 V's of Hadoop Volume: The amount of data generated Variety: structured data,unstructed data.Database table data Velocity: The frequency at which data is generated

1.hadoop believes on scale out instead of scale up when needed buy more oxes dont grow your oxe more powerful 2.hadoop on structured as well unstructured RDBMS only works with structured data. (However now a days many no-sql database has comeout in the market like mongo db,couch base.) 3.hadoop believes on key-value pair rather than data in the column 1.hadoop believes on scale out instead of scale up when needed buy more oxes dont grow your oxe more powerful 2.hadoop on structured as well unstructured RDBMS only works with structured data. (However now a days many no-sql database has comeout in the market like mongo db,couch base.) 3.hadoop believes on key-value pair rather than data in the column

No doubt Hadoop is a framework for processing big data. But it is not the only framework to do so. Below are few more alternative.  Apache Spark Apache Spark  GraphLab GraphLab  HPCC Systems- (High Performance Computing Cluster) HPCC Systems  Dryad Dryad  Stratosphere Stratosphere  Storm Storm  R3 R3  Disco Disco  Phoenix Phoenix  Plasma Plasma

You can download hadoop from link on/ · 18 November, 2014: Release available18 November, 2014: Release available · 27 June, 2014: Release available27 June, 2014: Release available · 1 Aug, 2013: Release (stable) available1 Aug, 2013: Release (stable) available You can download hadoop from link on/ · 18 November, 2014: Release available18 November, 2014: Release available · 27 June, 2014: Release available27 June, 2014: Release available · 1 Aug, 2013: Release (stable) available1 Aug, 2013: Release (stable) available DOWLOADING HADOOP

1. Name Node 2. Secondary Name Node 3. Job Tracker 4. Task Tracker 5. Data Node 1. Name Node 2. Secondary Name Node 3. Job Tracker 4. Task Tracker 5. Data Node HADOOP 1. Storing Huge Data 2. Processing Huge Data 1. Storing Huge Data 2. Processing Huge Data Hadoop Daemons

Modes in Hadoop Standalone Mode Pseudo Distributed Mode Fully Distributed Mode Standalone mode It is the default mode 1 node No separate process will be running(daemons) Everything runs in a single JVM Small development,Test,Debugging

Pseudo Distributed Mode 1. A single node, but cluster will be simulated 2. Daemons will run on separate process separate JVMs 3. Development and Debugging

1. Multiple nodes 2. Hadoop will run in a cluster of machines/nodes used in Production Environment

Hadoop Architecture

 Hive  Pig  Scoop  Avro  Flume  Oozie  HBase  Cassandra

Job Tracker

Job Tracker contd..

Job Tracker Contd..

HDFS write