Introduction to Hadoop

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Hadoop Ecosystem Overview
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
HADOOP ADMIN: Session -2
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
An Introduction to HDInsight June 27 th,
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Hadoop implementation of MapReduce computational model Ján Vaňo.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
BIG DATA/ Hadoop Interview Questions.
Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Apache Hadoop on Windows Azure Avkash Chauhan
Microsoft Ignite /28/2017 6:07 PM
BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
A Tutorial on Hadoop Cloud Computing : Future Trends.
MapReduce Compilers-Apache Pig
SAS users meeting in Halifax
MapReduce Compiler RHadoop
Hadoop.
Apache hadoop & Mapreduce
Software Systems Development
INTRODUCTION TO BIGDATA & HADOOP
HADOOP ADMIN: Session -2
An Open Source Project Commonly Used for Processing Big Data Sets
What is Apache Hadoop? Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created.
How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Chapter 10 Data Analytics for IoT
Spark Presentation.
Hadoopla: Microsoft and the Hadoop Ecosystem
Rahi Ashokkumar Patel U
Hadoop Clusters Tess Fulkerson.
Software Engineering Introduction to Apache Hadoop Map Reduce
Central Florida Business Intelligence User Group
07 | Analyzing Big Data with Excel
Ministry of Higher Education
MIT 802 Introduction to Data Platforms and Sources Lecture 2
CS6604 Digital Libraries IDEAL Webpages Presented by
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Server & Tools Business
Hadoop Technopoints.
Introduction to Apache
Overview of big data tools
Setup Sqoop.
CSE 491/891 Lecture 21 (Pig).
TIM TAYLOR AND JOSH NEEDHAM
Lecture 16 (Intro to MapReduce and Hadoop)
Zoie Barrett and Brian Lam
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Big-Data Analytics with Azure HDInsight
Pig Hive HBase Zookeeper
Big Data.
Presentation transcript:

Introduction to Hadoop Jessica Krosschell

Copyright © 2014 Tiber Solutions, LLC Agenda Introduction and History Framework Real Life Use Case/Enterprise Architecture Projects/Demo Lab * Sources: http://en.wikipedia.org/wiki/Apache_Hadoop http://hadoop.apache.org/ http://hortonworks.com/ http://www.infoq.com/articles/BigDataPlatform Copyright © 2014 Tiber Solutions, LLC

Introduction and History Open sourced Java-based framework of tools for storage and large-scale processing of data sets on clusters of hardware Characteristics: Batch processing; massive full data scans; structured and unstructured data sources; doesn’t necessarily replace RDBMS Big Data challenge points: Velocity, Volume, and Variety History 2005, Doug Cutting and Mike Cafarella from Yahoo creates Hadoop based on Google File System white papers. Team names project “Hadoop” after Cutting’s son’s toy elephant. 2006, Apache Hadoop project established 2008, Cloudera founded; Doug Cutting is chief architect 2011, Hortonworks founded by several key Yahoo Hadoop engineers Copyright © 2014 Tiber Solutions, LLC

Introduction and History, cont. Users of Hadoop: Yahoo, Facebook Twitter uses Scribe to write logs to Hadoop and Pig to analyze the data sets Facebook example: https://www.facebook.com/notes/facebook- engineering/hadoop/16121578919 Examples of application: Mining users’ behaviors to generate recommendations Search for uncommon patterns Hadoop distributions: companies that build on top of Apache Hadoop, provide enterprise Hadoop, and help simplify the process. Top 3 listed below. Cloudera: most established by far; largest user base; proprietary Cloudera Management Suite Hortonworks: uses 100% open source Hadoop (no additional proprietary software) MapR: proprietary file system (MapRFS) rather than HDFS; http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support Copyright © 2014 Tiber Solutions, LLC

Copyright © 2014 Tiber Solutions, LLC Framework Approach: break down data and computation into smaller pieces on servers (data and analysis co-location means that analysis is moved to the data) Fault tolerant and highly scalable: scalability is linear and scales out; if you want to double processing, double computers Apache Hadoop Project contains the following modules: Hadoop Common: contains libraries and utilities needed by other modules Hadoop Distributed File System (HDFS): distributed file system that stores data on commodity machines providing very high aggregate bandwidth across cluster; self healing and provides redundancy Hadoop MapReduce: framework for processing large data jobs across many nodes and combining results Hadoop YARN (Yet Another Resource Negotiator): Added as part of Hadoop 2.0; application management framework that allows Hadoop to go beyond MapReduce apps Copyright © 2014 Tiber Solutions, LLC

Copyright © 2014 Tiber Solutions, LLC Framework, cont. Data Services: projects that store, process, and access data in many ways Pig: Scripting language for Hadoop (pig latin) to analyze large sets of data. Infrastructure layer consists of compiler that produces Map-Reduce jobs and executes them using the Hadoop cluster. Appeals to developers more familiar with scripting languages and SQL than Java. Hive: SQL interface for Hadoop (HiveQL or HQL). Hortonworks has a Hive ODBC driver that allows BI tools to connect to Hadoop. HCatalog: Metadata and table management. Enables users to access data as a set of tables without having to worry about how/where data is stored. Enables data sharing among other tools such as Pig, MapReduce, and Hive. HBase: Non-relational (NoSQL) DB for interactive apps. Commonly used for predictions and recommendations (intelligent applications). Flume: Stores log files and events. Primary use case is to move web log files into Hadoop. Sqoop: Move structured data from or into SQL database (SQL to Hadoop) Copyright © 2014 Tiber Solutions, LLC

Copyright © 2014 Tiber Solutions, LLC Framework, cont. Operational Services – projects for operations and management Ambari: Management and monitoring. Makes clusters easy to operate and simplifies provisioning. Oozie: Workflow and scheduling. Coordinates jobs written in multiple languages and allows for specification of order and dependencies between jobs. Lots of additional projects (either Apache or company-specific). Example of HDP framework below. Copyright © 2014 Tiber Solutions, LLC

Enterprise Architecture Copyright © 2014 Tiber Solutions, LLC

Hadoop Core Architecture - HDFS HDFS: Java-based file system A distributed filesystem that runs on large clusters of commodity machines providing rapid data transfer rates and uninterrupted system operations in the case of node failures. NameNode: tracks locations of data and governs space; stores metadata about file DataNode: pings NameNode and gets instructions back; manages reads/writes/replication http://hortonworks.com/hadoop/hdfs/ Copyright © 2014 Tiber Solutions, LLC

Hadoop Core Architecture - MapReduce MapReduce: computational paradigm where application is divided into small fragments of work. Map: function that parcels out work to different nodes in the distributed cluster. Map function generate key- value pairs to set up the Reduce function. As a SQL analogy, it is the SELECT clause. Reduce: a function that collates, merges, and/or aggregates the key-value pairs into a single result set. As a SQL analogy, it is the GROUP by clause. JobTracker: splits a job into tasks and distributes based on availability and data location TaskTracker: executes task via JVM and sends status back to JobTracker http://hortonworks.com/hadoop/mapreduce/ Copyright © 2014 Tiber Solutions, LLC

Hadoop Core Architecture - YARN YARN: cluster resource management introduced in Hadoop 2.0 Moved resource management from MapReduce to YARN. Allows MapReduce to focus on data processing and allows other engines to use YARN and HDFS. Reused framework from MapReduce to help ensure compatibility with existing applications. Splits up JobTracker responsibilities into separate daemons: a global ResourceManager and per-application ApplicationMaster (job scheduling/monitoring) http://hortonworks.com/get-started/yarn/ Copyright © 2014 Tiber Solutions, LLC

Copyright © 2014 Tiber Solutions, LLC Demo Outline Scenario – We are going to count the words in two text files, load the results in a table, and report from it. Execute wordcount Command Line Hadoop Pig Script Manage files File Browser (Hue Application) HDFS Explorer Create and load tables HCatalog Hive Report from tables ODBC - BusinessObjects Copyright © 2014 Tiber Solutions, LLC

Command Line WordCount Use putty and log into sandbox as root. Change directories to tiber_demo. Cat both word files. Create the directory in /user/hue for the input. hadoop fs -mkdir /user/hue/wc_input Upload the two text files to the new directory. hadoop fs -put word1.txt word2.txt /user/hue/wc_input Display the contents of the directory. hadoop fs -ls /user/hue/wc_input Run the wordcount program. hadoop jar /usr/lib/hadoop-mapreduce/hadoop- mapreduce-examples.jar wordcount /user/hue/wc_input /user/hue/wc_output Display the contents of the output directory hadoop fs -ls /user/hue/wc_output Display contents of result file. hadoop fs -cat /user/hue/wc_output/part-r-00000 Word1.txt: This is a word file with Hello Hadoop and goodbye. Word2.txt: I’m using Hortonworks Sandbox to run my Hadoop demos. Hello and Goodbye. Copyright © 2014 Tiber Solutions, LLC

Copyright © 2014 Tiber Solutions, LLC Hue Application Browser based environment that enables interaction with Hadoop cluster http://54.225.203.175:8000 Login User: hue Password: caMswQlzSF6QPQHYXlws Copyright © 2014 Tiber Solutions, LLC

File Browser and HDFS Explorer We saw the command line HDFS explorer in previous example. Demo: View wordcount result files using File Browser and HDFS Explorer. http://54.225.203.175:8000/filebrowser/ HDFS Explorer Copyright © 2014 Tiber Solutions, LLC

Copyright © 2014 Tiber Solutions, LLC Apache Pig Scripting language (pig latin) that defines a set of transformations on a data set such as aggregate, join, and sort. Ideal for ETL, research on raw data, iterative data processing. More Info: http://hortonworks.com/hadoop/pig/ Demo: Execute pig_wordcount script and use File Browser to view results. ** If there’s time, execute with additional dump statements to see how data is treated. Script text1 = LOAD '/user/hue/wc_input/word1.txt' USING PigStorage(); text2 = LOAD '/user/hue/wc_input/word2.txt' USING PigStorage(); mergetext = UNION text1, text2; separatewords = FOREACH mergetext GENERATE FLATTEN(TOKENIZE((chararray)$0)) as word; groupwords = GROUP separatewords BY word; countwords = FOREACH groupwords GENERATE COUNT(separatewords), group; STORE countwords INTO '/user/hue/wc_output/pig_wordcount'; Copyright © 2014 Tiber Solutions, LLC

Copyright © 2014 Tiber Solutions, LLC Apache HCatalog Table abstraction (metadata) layer; frees user from knowing where data is stored in HDFS. More Info: http://hortonworks.com/hadoop/hcatalog/ Demo: Use HCatalog to load first wordcount results file into wordcount_out table; browse data using HCatalog; show “database” files using HDFS Explorer. Copyright © 2014 Tiber Solutions, LLC

Copyright © 2014 Tiber Solutions, LLC Apache Hive Run HiveQL, SQL-like language, to interact with Hadoop. More Info: http://hortonworks.com/hadoop/hive/ Hive Cheat Sheet: http://hortonworks.com/blog/hive-cheat-sheet-for-sql-users/ Hive Language Manual: https://cwiki.apache.org/confluence/display/Hive/LanguageManual Demo: Create and load wordcount results from Pig script into table. Retrieve data. Copyright © 2014 Tiber Solutions, LLC

BusinessObjects Query Hortonworks Hive ODBC Driver: http://hortonworks.com/hdp/addons/ Demo: Show universe and report that include word count tables Copyright © 2014 Tiber Solutions, LLC

Copyright © 2014 Tiber Solutions, LLC Labs Use sandbox environment and hue username/password http://54.225.203.175:8000/ Use separate databases as well as initials when saving tables, scripts, etc. Run through the following Hortonworks tutorials: Intro to HCatalog and Pig: http://hortonworks.com/hadoop-tutorial/hello-world- an-introduction-to-hadoop-hcatalog-hive-and-pig/ HCatalog, Pig, and Hive Commands: http://hortonworks.com/hadoop-tutorial/how-to-use- hcatalog-basic-pig-hive-commands/ Load Data into Sandbox (can include Excel, if interested in ODBC): http://hortonworks.com/hadoop- tutorial/loading-data-into-the-hortonworks-sandbox/ Copyright © 2014 Tiber Solutions, LLC

Copyright © 2014 Tiber Solutions, LLC Labs, cont. More Hortonworks tutorials: Pig Grunt Shell: http://hortonworks.com/hadoop-tutorial/exploring-data-apache-pig-grunt- shell/ *Note that you can either download putty or use the Hue shell in the Hue application **Instead of /user/hadoop directory, use /user/hue HDFS Shell: http://hortonworks.com/hadoop-tutorial/using-commandline-manage-files-hdfs/ ODBC Driver: http://hortonworks.com/hadoop-tutorial/how-to-install-and-configure-the- hortonworks-odbc-driver-on-windows-7/ HDFS Explorer: http://hortonworks.com/hadoop-tutorial/use-hdfs-explorer-manage-files- hortonworks-sandbox/ More Hive: http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-hive/ More Pig: http://hortonworks.com/hadoop-tutorial/how-to-process-data-with-apache-pig/ Whatever else you find interesting: http://hortonworks.com/tutorials/#get-started Copyright © 2014 Tiber Solutions, LLC