Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Hadoop Ecosystem Overview
Google Distributed System and Hadoop Lakshmi Thyagarajan.
The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work Presented by Mohit Goenka.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High Throughput Partition-able problems Fault Tolerance.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Hive Facebook 2009.
Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.
Chapter 6 SAS ® OLAP Cube Studio. Section 6.1 SAS OLAP Cube Studio Architecture.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
An Introduction to HDInsight June 27 th,
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
BIG DATA/ Hadoop Interview Questions.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Apache Hadoop on Windows Azure Avkash Chauhan
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
MapReduce Compiler RHadoop
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
Hadoop.
Introduction to Distributed Platforms
Apache hadoop & Mapreduce
INTRODUCTION TO BIGDATA & HADOOP
An Open Source Project Commonly Used for Processing Big Data Sets
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Rahi Ashokkumar Patel U
Hadoop Clusters Tess Fulkerson.
Central Florida Business Intelligence User Group
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
Introduction to Spark.
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Hadoop Technopoints.
Introduction to Apache
Overview of big data tools
Lecture 16 (Intro to MapReduce and Hadoop)
Charles Tappert Seidenberg School of CSIS, Pace University
Big-Data Analytics with Azure HDInsight
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Pig Hive HBase Zookeeper
Presentation transcript:

Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP ABOUT THE PRESENTER Jim Watson SAS Education, Canberra Background in SAS Programming, SQL programming, Database Processing, Grid Processing, et al With SAS since 1999

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP LIST OF TOPICS What is Hadoop? How SAS integrates with Hadoop HDFS LIBNAME Engine Explicit Pass-through High Performance Analytics

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP WHAT IS HADOOP? Apache Hadoop is an Open Source Software Framework Written in Java For Distributed Storage and processing of very large datasets on computer clusters Built from Commodity Hardware

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP ADVANTAGES OF HADOOP Some characteristics of Hadoop include: Open-source Simple to use distributed file system Supports highly parallel processing It’s scalable, so it’s suitable for massive amounts of data It is designed to work on low-cost hardware It’s fault tolerant (redundant) at the data level automatic replication of data automatic fail-over

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP HADOOP FUNDAMENTALS HDFS – “Hadoop Distributed File System” Files are distributed across the Hadoop cluster Hadoop YARN a framework for job scheduling and cluster resource management MapReduce Files are processed locally and in parallel Based on YARN These modules handle the process of reading/writing & processing large files in a distributed environment. This allows the data to be exploited as if it were a single massively powerful server.

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP HADOOP DISTRIBUTED FILE SYSTEM HDFS is hierarchical with LINUX style paths and file ownership and permissions. HADOOP FS commands are similar to LINUX commands. HDFS in not built into the operating system. Files are append-only after they are written. $ hadoop fs –ls /user/student Found 4 items drwxr-xr-x - student1 sasapp :00 /user/student1/.Trash drwx student1 sasapp :05 /user/student1/.stage drwxr-xr-x - student1 sasapp :25 /user/student1/data drwxr-xr-x - student1 sasapp :59 /user/student1/users $ hadoop fs –mkdir /user/student1/newdir $

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP MAPREDUCE MapReduce is a framework written in Java that is built into Hadoop. It automates the distributed processing of data files. mapprocessing of individual rows (filtering, row calculations) shuffle and sort grouping rows for summarisation reduce summary calculations within groups The MapReduce framework coordinates multiple mapping, sorting, and reducing tasks that execute in parallel across the computer cluster.

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP WHAT’S INSIDE SAS Client SAS metadata server SAS workspace server Hadoop NameNode Hive Hadoop DataNode 1 Hadoop DataNode 2 Hadoop DataNode 3

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP PARALLEL PROCESSING EXAMPLE A MapReduce Example: Summarise a detailed order table to derive total revenue by state. The table is already distributed in HDFS. idstrev 1NC10 2GA12 3VA8 4NC9 5VA22 6NC18 7NC2 8GA53... sttotrev GA65 NC39 VA30...

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP PARALLEL PROCESSING EXAMPLE idstrev 1NSW10 2QLD12 3VIC8 4NSW9 strevct NSW101 QLD121 VIC81 NSW91 strevct NSW101 NSW91 NSW181 NSW21 idstrev 5VIC22 6NSW18 7NSW2 8QLD53 Block n map strevct VIC221 NSW181 NSW21 QLD531 strevct VIC81 VIC shuffle sttotrev NSW39 sttotrev VIC reduce output File blocks output

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP “PIG” & “HIVE” Pig and Hive provide less complex higher-level programming methods for parallel processing of Hadoop data files. PigA platform for data analysis that includes stepwise procedural programming that converts to MapReduce. HiveA data warehousing framework to query and manage large data sets stored in Hadoop. Provides a mechanism to structure the data and query the data using an SQL-like language called HiveQL. Most HiveQL queries are compiled into MapReduce programs.

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP THE HADOOP ECOSYSTEM The Apache Hadoop core technologies of HDFS, Yarn, and MapReduce along with additional projects including Pig, Hive, and others are collectively called the Hadoop ecosystem.

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP EXPLOITING THE HDFS The Hadoop FILENAME engine Upload local data to Hadoop Read data from Hadoop Use normal SAS PROC & DATA Steps PROC HADOOP Submit HDFS Commands Submit MapReduce & PIG programs

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP THE FILENAME STATEMENT & HDFS filename hadconfg "/workshop/hadoop_config.xml'; filename mapres hadoop "/user/&std/data/mapoutput" concat cfg=hadconfg user="&std"; data work.commonwords; infile mapres dlm='09'x; input word $ count; … run;

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP PROC HADOOP PROC HADOOP submits Hadoop file system (HDFS) commands MapReduce programs PIG language code. PROC HADOOP ; HDFS ; MAPREDUCE ; PIG ; PROPERTIES ; RUN; PROC HADOOP ; HDFS ; MAPREDUCE ; PIG ; PROPERTIES ; RUN;

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP PROC HADOOP – HDFS STATEMENTS HDFS COPYFROMLOCAL='local-file' OUT='output-location' ; HDFS COPYTOLOCAL='HDFS-file' OUT='output-location' ; HDFS DELETE='HDFS-file' ; HDFS MKDIR='HDFS-path'; HDFS RENAME='HDFS-file' OUT='new-name'; HDFS COPYFROMLOCAL='local-file' OUT='output-location' ; HDFS COPYTOLOCAL='HDFS-file' OUT='output-location' ; HDFS DELETE='HDFS-file' ; HDFS MKDIR='HDFS-path'; HDFS RENAME='HDFS-file' OUT='new-name';

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP ACCESS HIVE TABLES VIA SAS Two main methods to exploit Hadoop Hive tables in SAS: The LIBNAME Engine (aka “Implicit Pass Through”) Assign a LIBREF to Hive and use SAS code upon the LIBREF SAS Code is automatically converted to Hive Explicit Pass Through Hive code is embedded in SAS code and is submitted verbatim to Hadoop

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP THE HADOOP LIBNAME ENGINE libname hivedb hadoop server=namenode subprotocol=hive2 port=10000 schema=diacchad user=studentX pw=StudentX; LIBNAME libref engine-name ; 23 libname hivedb hadoop server=namenode 24 subprotocol=hive2 25 port=10000 schema=diacchad 26 user="&std" pw="&stdpw"; NOTE: Libref HIVEDB was successfully assigned as follows: Engine: HADOOP Physical Name: jdbc:hive2://namenode:10000/diacchad

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP LIBNAME ENGINE EXAMPLE options sastrace=',,,d' sastraceloc=saslog nostsuffix; proc means data=hivedb.order_fact sum mean; var total_retail_price; run; proc freq data=hivedb.order_fact; tables order_type; run; options sastrace=off;

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP LIBNAME ENGINE EXAMPLE NOTE: SQL generation will be used to perform the initial summarization. HADOOP_41: Executed: on connection 7 select T1.ZSQL1, T1.ZSQL2, T1.ZSQL3, T1.ZSQL4 from ( select COUNT(*) as ZSQL1, COUNT(*) as ZSQL2, COUNT(TXT_1.`total_retail_price`) as ZSQL3, SUM(TXT_1.`total_retail_price`) as ZSQL4 from `ORDER_FACT` TXT_1 ) T1 where T1.ZSQL1 > 0 ACCESS ENGINE: SQL statement was passed to the DBMS for fetching data.

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP EXPLICIT PASS THROUGH proc sql; connect to hadoop (server=namenode subprotocol=hive2 schema=diacchad user="&std"); select * from connection to hadoop (select employee_name,salary from salesstaff where emp_hire_date between ' ' and ' ' ); disconnect from hadoop; quit;

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP HIGH PERFORMANCE ANALYTICS InterfacePurposeProduct High- Performance Analytics Procedures Perform complex analytical computations on Hadoop tables within the data nodes of the Hadoop distribution via SAS procedure language. HPDS2 allows for manipulation of data structure (column derivation). SAS High- Performance Analytics Solutions SAS Visual Analytics A web interface to generate graphical visualizations of data distributions and relationships on Hadoop tables pre-loaded into memory within the data nodes of the Hadoop distribution. SAS Visual Analytics PROC IMSTATA programming interface to perform complex analytical calculations on Hadoop tables pre-loaded into memory within the data nodes of the Hadoop distribution. SAS In-Memory Statistics DS2A SAS proprietary language for table manipulation that translates to database language and executes in parallel in the data nodes of a distributed database. SAS In-Database Code Accelerators Data loader for hadoop

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP HIGH PERFORMANCE ANALYTICS SAS metadata server SAS workspace server Hadoop NameNode Hive Hadoop DataNode 1 Hadoop DataNode 2 Hadoop DataNode 3 SAS processes in each HDFS data node execute in parallel. SAS High Performance Analytics Root Node SAS High Performance Analytics Worker Node SAS High Performance Analytics Worker Node SAS High Performance Analytics Worker Node SAS Client

Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP LEARNING MORE SAS Website SAS Education Introduction to SAS & Hadoop 2 Day course requiring some SAS Programming & SQL knowledge DS2 Programming: Essentials 2 Day course, requires intermediate SAS Programming knowledge DS2 Programming Essentials with Hadoop 1 ½ day course, requires intermediate SAS Programming knowledge