Presentation is loading. Please wait.

Presentation is loading. Please wait.

Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP.

Similar presentations


Presentation on theme: "Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP."— Presentation transcript:

1 Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP

2 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP ABOUT THE PRESENTER Jim Watson SAS Education, Canberra Background in SAS Programming, SQL programming, Database Processing, Grid Processing, et al With SAS since 1999

3 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP LIST OF TOPICS What is Hadoop? How SAS integrates with Hadoop HDFS LIBNAME Engine Explicit Pass-through High Performance Analytics

4 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP WHAT IS HADOOP? Apache Hadoop is an Open Source Software Framework Written in Java For Distributed Storage and processing of very large datasets on computer clusters Built from Commodity Hardware

5 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP ADVANTAGES OF HADOOP Some characteristics of Hadoop include: Open-source Simple to use distributed file system Supports highly parallel processing It’s scalable, so it’s suitable for massive amounts of data It is designed to work on low-cost hardware It’s fault tolerant (redundant) at the data level automatic replication of data automatic fail-over

6 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP HADOOP FUNDAMENTALS HDFS – “Hadoop Distributed File System” Files are distributed across the Hadoop cluster Hadoop YARN a framework for job scheduling and cluster resource management MapReduce Files are processed locally and in parallel Based on YARN These modules handle the process of reading/writing & processing large files in a distributed environment. This allows the data to be exploited as if it were a single massively powerful server.

7 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP HADOOP DISTRIBUTED FILE SYSTEM HDFS is hierarchical with LINUX style paths and file ownership and permissions. HADOOP FS commands are similar to LINUX commands. HDFS in not built into the operating system. Files are append-only after they are written. $ hadoop fs –ls /user/student Found 4 items drwxr-xr-x - student1 sasapp 0 2014-05-30 20:00 /user/student1/.Trash drwx------ - student1 sasapp 0 2014-05-30 10:05 /user/student1/.stage drwxr-xr-x - student1 sasapp 0 2014-05-28 15:25 /user/student1/data drwxr-xr-x - student1 sasapp 0 2014-05-28 13:59 /user/student1/users $ hadoop fs –mkdir /user/student1/newdir $

8 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP MAPREDUCE MapReduce is a framework written in Java that is built into Hadoop. It automates the distributed processing of data files. mapprocessing of individual rows (filtering, row calculations) shuffle and sort grouping rows for summarisation reduce summary calculations within groups The MapReduce framework coordinates multiple mapping, sorting, and reducing tasks that execute in parallel across the computer cluster.

9 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP WHAT’S INSIDE SAS Client SAS metadata server SAS workspace server Hadoop NameNode Hive Hadoop DataNode 1 Hadoop DataNode 2 Hadoop DataNode 3

10 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP PARALLEL PROCESSING EXAMPLE A MapReduce Example: Summarise a detailed order table to derive total revenue by state. The table is already distributed in HDFS. idstrev 1NC10 2GA12 3VA8 4NC9 5VA22 6NC18 7NC2 8GA53... sttotrev GA65 NC39 VA30...

11 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP PARALLEL PROCESSING EXAMPLE...... idstrev 1NSW10 2QLD12 3VIC8 4NSW9 strevct NSW101 QLD121 VIC81 NSW91 strevct NSW101 NSW91 NSW181 NSW21 idstrev 5VIC22 6NSW18 7NSW2 8QLD53 Block n map strevct VIC221 NSW181 NSW21 QLD531 strevct VIC81 VIC221................... shuffle sttotrev NSW39 sttotrev VIC30............. reduce output File blocks output

12 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP “PIG” & “HIVE” Pig and Hive provide less complex higher-level programming methods for parallel processing of Hadoop data files. PigA platform for data analysis that includes stepwise procedural programming that converts to MapReduce. HiveA data warehousing framework to query and manage large data sets stored in Hadoop. Provides a mechanism to structure the data and query the data using an SQL-like language called HiveQL. Most HiveQL queries are compiled into MapReduce programs.

13 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP THE HADOOP ECOSYSTEM The Apache Hadoop core technologies of HDFS, Yarn, and MapReduce along with additional projects including Pig, Hive, and others are collectively called the Hadoop ecosystem.

14 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP EXPLOITING THE HDFS The Hadoop FILENAME engine Upload local data to Hadoop Read data from Hadoop Use normal SAS PROC & DATA Steps PROC HADOOP Submit HDFS Commands Submit MapReduce & PIG programs

15 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP THE FILENAME STATEMENT & HDFS filename hadconfg "/workshop/hadoop_config.xml'; filename mapres hadoop "/user/&std/data/mapoutput" concat cfg=hadconfg user="&std"; data work.commonwords; infile mapres dlm='09'x; input word $ count; … run;

16 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP PROC HADOOP PROC HADOOP submits Hadoop file system (HDFS) commands MapReduce programs PIG language code. PROC HADOOP ; HDFS ; MAPREDUCE ; PIG ; PROPERTIES ; RUN; PROC HADOOP ; HDFS ; MAPREDUCE ; PIG ; PROPERTIES ; RUN;

17 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP PROC HADOOP – HDFS STATEMENTS HDFS COPYFROMLOCAL='local-file' OUT='output-location' ; HDFS COPYTOLOCAL='HDFS-file' OUT='output-location' ; HDFS DELETE='HDFS-file' ; HDFS MKDIR='HDFS-path'; HDFS RENAME='HDFS-file' OUT='new-name'; HDFS COPYFROMLOCAL='local-file' OUT='output-location' ; HDFS COPYTOLOCAL='HDFS-file' OUT='output-location' ; HDFS DELETE='HDFS-file' ; HDFS MKDIR='HDFS-path'; HDFS RENAME='HDFS-file' OUT='new-name';

18 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP ACCESS HIVE TABLES VIA SAS Two main methods to exploit Hadoop Hive tables in SAS: The LIBNAME Engine (aka “Implicit Pass Through”) Assign a LIBREF to Hive and use SAS code upon the LIBREF SAS Code is automatically converted to Hive Explicit Pass Through Hive code is embedded in SAS code and is submitted verbatim to Hadoop

19 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP THE HADOOP LIBNAME ENGINE libname hivedb hadoop server=namenode subprotocol=hive2 port=10000 schema=diacchad user=studentX pw=StudentX; LIBNAME libref engine-name ; 23 libname hivedb hadoop server=namenode 24 subprotocol=hive2 25 port=10000 schema=diacchad 26 user="&std" pw="&stdpw"; NOTE: Libref HIVEDB was successfully assigned as follows: Engine: HADOOP Physical Name: jdbc:hive2://namenode:10000/diacchad

20 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP LIBNAME ENGINE EXAMPLE options sastrace=',,,d' sastraceloc=saslog nostsuffix; proc means data=hivedb.order_fact sum mean; var total_retail_price; run; proc freq data=hivedb.order_fact; tables order_type; run; options sastrace=off;

21 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP LIBNAME ENGINE EXAMPLE NOTE: SQL generation will be used to perform the initial summarization. HADOOP_41: Executed: on connection 7 select T1.ZSQL1, T1.ZSQL2, T1.ZSQL3, T1.ZSQL4 from ( select COUNT(*) as ZSQL1, COUNT(*) as ZSQL2, COUNT(TXT_1.`total_retail_price`) as ZSQL3, SUM(TXT_1.`total_retail_price`) as ZSQL4 from `ORDER_FACT` TXT_1 ) T1 where T1.ZSQL1 > 0 ACCESS ENGINE: SQL statement was passed to the DBMS for fetching data.

22 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP EXPLICIT PASS THROUGH proc sql; connect to hadoop (server=namenode subprotocol=hive2 schema=diacchad user="&std"); select * from connection to hadoop (select employee_name,salary from salesstaff where emp_hire_date between '2011-01-01' and '2011-12-31' ); disconnect from hadoop; quit;

23 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP HIGH PERFORMANCE ANALYTICS InterfacePurposeProduct High- Performance Analytics Procedures Perform complex analytical computations on Hadoop tables within the data nodes of the Hadoop distribution via SAS procedure language. HPDS2 allows for manipulation of data structure (column derivation). SAS High- Performance Analytics Solutions SAS Visual Analytics A web interface to generate graphical visualizations of data distributions and relationships on Hadoop tables pre-loaded into memory within the data nodes of the Hadoop distribution. SAS Visual Analytics PROC IMSTATA programming interface to perform complex analytical calculations on Hadoop tables pre-loaded into memory within the data nodes of the Hadoop distribution. SAS In-Memory Statistics DS2A SAS proprietary language for table manipulation that translates to database language and executes in parallel in the data nodes of a distributed database. SAS In-Database Code Accelerators Data loader for hadoop

24 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP HIGH PERFORMANCE ANALYTICS SAS metadata server SAS workspace server Hadoop NameNode Hive Hadoop DataNode 1 Hadoop DataNode 2 Hadoop DataNode 3 SAS processes in each HDFS data node execute in parallel. SAS High Performance Analytics Root Node SAS High Performance Analytics Worker Node SAS High Performance Analytics Worker Node SAS High Performance Analytics Worker Node SAS Client

25 Copyright © 2015, SAS Institute Inc. All rights reserved. SAS & HADOOP LEARNING MORE SAS Website SAS Education Introduction to SAS & Hadoop 2 Day course requiring some SAS Programming & SQL knowledge DS2 Programming: Essentials 2 Day course, requires intermediate SAS Programming knowledge DS2 Programming Essentials with Hadoop 1 ½ day course, requires intermediate SAS Programming knowledge


Download ppt "Copyright © 2015, SAS Institute Inc. All rights reserved. THE ELEPHANT IN THE ROOM SAS & HADOOP."

Similar presentations


Ads by Google