Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.

Slides:



Advertisements
Similar presentations
Introduction to Hadoop Richard Holowczak Baruch College.
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Hadoop Ecosystem Overview
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Introduction to Hadoop and HDFS
SEMINAR ON Guided by: Prof. D.V.Chaudhari Seminar by: Namrata Sakhare Roll No: 65 B.E.Comp.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Nov 2006 Google released the paper on BigTable.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
HADOOP Course Content By Mr. Kalyan, 7+ Years of Realtime Exp. M.Tech, IIT Kharagpur, Gold Medalist. Introduction to Big Data and Hadoop Big Data › What.
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.
Apache Hadoop on Windows Azure Avkash Chauhan
Microsoft Ignite /28/2017 6:07 PM
BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.
Big Data Introduction to Big Data, Hadoop and Spark.
A Tutorial on Hadoop Cloud Computing : Future Trends.
Introduction to Hadoop
OMOP CDM on Hadoop Reference Architecture
Best IT Training Institute in Hyderabad
Image taken from: slideshare
Big Data is a Big Deal!.
PROTECT | OPTIMIZE | TRANSFORM
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop Aakash Kag What Why How 1.
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Hadoop.
Introduction to Distributed Platforms
Apache hadoop & Mapreduce
Unit 2 Hadoop and big data
Software Systems Development
INTRODUCTION TO BIGDATA & HADOOP
HADOOP ADMIN: Session -2
An Open Source Project Commonly Used for Processing Big Data Sets
What is Apache Hadoop? Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created.
HDFS Yarn Architecture
Chapter 10 Data Analytics for IoT
Chapter 14 Big Data Analytics and NoSQL
Hadoopla: Microsoft and the Hadoop Ecosystem
TABLE OF CONTENTS. TABLE OF CONTENTS Not Possible in single computer and DB Serialised solution not possible Large data backup difficult so data.
Pyspark 최 현 영 컴퓨터학부.
Introduction to MapReduce and Hadoop
Rahi Ashokkumar Patel U
Calculation of stock volatility using Hadoop and map-reduce
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
MIT 802 Introduction to Data Platforms and Sources Lecture 2
The Basics of Apache Hadoop
Hadoop Basics.
Introduction to Apache
Overview of big data tools
Setup Sqoop.
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
MIT 802 Introduction to Data Platforms and Sources Lecture 2
Oracle 1z0-928 Oracle Cloud Platform Big Data Management 2018 Associate.
Presentation transcript:

Hadoop Introduction

Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure to Data Warehousing? 2

Big Data History Google – Started with My SQL for search engine (scalability was major issue) – Developed solution from scratch Distributed file system – GFS Distributed processing – Map Reduce Big Table 3

Big Data Characteristics - Volume, Variety and Velocity Batch – Hadoop with Map Reduce – GFS -> HDFS – Map Reduce -> Hadoop Map Reduce Operational but not Transactional – NoSQL – Google Big Table -> HBase 4

Characteristics Volume – Huge amounts of data Variety – Structured and Semi Structured Velocity – Speed at which data needs to be processed 5

Hadoop core components HDFS – Storage YARN/Map Reduce – Processing 6

Oracle Architecture Storage Network Switch (interconnect) Network Switch (interconnect) Database Servers

Hadoop Architecture Metadata Helper Storage Processing Processing Master

Hadoop Architecture HDFS Map Reduce

HDFS Namenode Secondary Namenode Datanode Map Reduce

Processing MRv1/Classic MRv2/YARN * We will look into details later 11

Typical Hadoop Cluster Network Switch(es) HDFS YARN HDFS YARNHDFSYARNHDFSYARN HDFSYARNHDFSYARNHDFSYARN HDFSYARNHDFSYARNHDFSYARN HDFSYARNHDFSYARNHDFSYARN HDFSYARNHDFSYARNHDFSYARN HDFSYARNHDFSYARNHDFSYARN HDFSYARNHDFSYARNHDFSYARN HDFSYARNHDFSYARNHDFSYARN

Typical Hadoop Cluster Network Switch(es) NN RM SNN DNNMDNNMDNNM DNNMDNNMDNNM DNNMDNNMDNNM DNNMDNNMDNNM DNNMDNNMDNNM DNNMDNNMDNNM DNNMDNNMDNNM DNNMDNNMDNNM

Hadoop eco system 14 Distributed File System (HDFS) Map Reduce Hadoop Core Components Hive Pig Flume Non Map Reduce Impala Presto Sqoop Oozie Mahout Hadoop eco system Hadoop Components HBase

Difference between Oracle and Hadoop Oracle Architecture Hadoop Architecture Theoretical differences 15

Oracle Architecture Storage Network Switch (interconnect) Network Switch (interconnect) Database Servers

Oracle Architecture Servers – Cluster of servers with same binaries and configuration – All of them run the same back end processes Storage – NFS (Network File System) – Mounted to all the database servers Software – Binaries will be installed on all the servers Parameter files – init.ora, pfile, spfile etc Backend processes (same on all of them) – smon – pmon – etc Memory – Same amount of memory on all the nodes in the cluster Memory Structures – pga – sga Shared pool (cache of code like sql plans) Database buffer cache Network – Typically 3 network switches – one for interconnect of nodes, one to connect with storage, one for public connectivity 17

Hadoop Architecture Metadata Helper Storage Processing Processing Master

Hadoop Architecture Storage -> Hadoop Distributed File System Processing – Map Reduce (majority of Hadoop eco system tools use Map Reduce for processing) – Non Map Reduce 19

Hadoop eco system Hadoop Core Components Non Map Reduce Hive Pig Flume Sqoop Oozie Mahout Hadoop eco system Hadoop Components Distributed File System (HDFS) Map Reduce Impala Presto HBase Spark

Oracle Big Data Appliance Hadoop Distributions and Hadoop Appliances 21 Hadoop Hive Sqoop Many more Monitoring Hortonworks Cloudera MapR Many More

Hadoop Distributions and Hadoop Appliances Hadoop and eco system tools are Apache open source projects Cloudera, Hortonworks and other leading Hadoop based technology companies commit to these open source projects They provide training, support and services. Cloudera have proprietary monitoring tool developed for large hadoop clusters. It is free up to 50 nodes after which license fee needs to be paid. Hortonworks uses Ambari which is a open source monitoring tool. No license fee 22

HDFS How to copy files to and from HDFS? What is HDFS? What are HDFS daemons? Explain namenode, secondary namenode as well as datanode. What are different parameter files? Explain importance of the parameters. What is “final” for a parameter? What is Gateway node? What is the role of gateway node with respect to HDFS? What is block, block size and how data is distributed? What is fault tolerance? What is the role of replication factor? What is default block size and what is replication factor? Given a scenario, explain how files are stored in HDFS? Understand size of each of the block and replication factor. How to override parameters such as block size and replication factor while copying files? 23

Map Reduce How to run map reduce job? What is difference between classic and yarn? What is map task and reduce task? What are the map reduce daemons in classic? What are the map reduce daemons in yarn? What is application master? What is job history server? How to troubleshoot logs? How fault tolerance works in map reduce? What is split size? How to override parameters while running programs? What is the role of Gateway node with respect to running map reduce jobs? 24

Map Reduce - Programming If we develop map reduce program, what is the criteria for developing map function and reduce function? What are the steps involved in development life cycle? What is shuffle and sort? What are different input and output formats? What are different key and value classes? Why we have new set of classes such as IntWritable, Text etc instead of Java primitive classes Integer, String? What are the steps involved in developing custom key and value classes? 25

Map Reduce – shuffle and sort For each input data set how many times map function will be invoked? How map output will look like? How data will be partitioned and sorted? What is spill? How many times it happens? How reducer input will look like? For each input data set how many times reducer will be invoked? 26

Apache Hive What is Hive? How data is stored and processed in Hive? Where is metadata stored? How tables are created in Hive? What is the difference between managed table and external table? How to specify delimiters? What are different file formats supported by Hive? 27

Apache Sqoop 28

Apache Spark 29