Leveraging a Hadoop Cluster from SQL Server Integration Services

Leveraging a Hadoop Cluster from SQL Server Integration Services
Kagan Arca Senior Premier Field Engineer Microsoft

Sponsors Main Gold Bronze Media Swag

Agenda Hadoop & HDInsight Overview Automation Data Transfer

Big Data is… Data that you think is valuable, but processing and storing it may not have been traditionally practical or affordable Data that you need to ask questions to learn what questions to ask Data that is produced at “fire hose” rate Wikipedia defines Big Data in terms of the problems posed by the awkwardness of legacy tools in supporting massive datasets: Big Data is a collection of datasets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications

Characteristics of Big Data
Volume Variety Velocity Relational Data Big Data Types of Big Data Big Data includes all types of data: Structured: the data has a schema, or a schema can be easily assigned to it. Semi Structured: has some structure, but typically columns are often missing or rows have their own unique columns. Unstructured: Data that has no structure, like JPGs, PDF Files, audio and video files, etc. Big data also has two inherent characteristics: Time-Based: a piece of data is something know at a certain moment in time , and that time is an important element.For instance, you might live in San Francisco and tweet about a restaurant that you enjoy. If you later move to New York, the fact that you once liked a restaurant in San Franciso does not change. Immutable: because of its connection to a point in time, the truthfulness of the data does not change. We look at changes in Big Data as “new” entries, not “updates” of existing entries.

Do I have Big Data? Sensors Clicks Logs Transactional records
Call centers Medical transcriptions Images Documents Signals from social media Simulations

The World of Data is Changing
By 2015, organizations that build a modern information management system will outperform their peers financially by 20 percent. – Gartner, Mark Beyer, “Information Management in the 21st Century” 10x increase every five years 85% from new data types Data explosion Volume Velocity Variety Easy Accessibility of External Data Hadoop Cloud Cheap, Distributed Storage & Processing

What’s Hadoop? System which: Uses inexpensive local storage to be fast
Uses clusters of commodity hardware to Uses Open Source = HDFS + Map Reduce + entire ecosystem of tools and frameworks

Volume & Variety Enterprise class security, HA & management
Hadoop Hadoop on Windows Server Hadoop on Windows Azure Enterprise class security, HA & management Virtualization Support Seamlessly integrated with Microsoft BI tools Symmetric cloud & on premises experience

Hadoop On … Hadoop On Windows Hadoop on Azure
Dedicated workload cluster HDFS is the persistent data store Hadoop on Azure Transient clusters, keep as long as you want Persistent data in Azure storage HDFS is a cache Hadoop On Windows (Virtualized) Cluster nodes on Hyper-V Multi-workload cluster Persistent data to network attached storage

HDInsight: Visit HadoopOnAzure.com

HDInsight Azure Storage Vault (ASV) Azure Blob Storage
Azure Flat Network Storage Microsoft is porting the Apache Hadoop Framework so that it can run in production on Windows Server and Windows Azure, without Cygwin, a collection of tools which provide a Linux look and feel environment for the Windows environment. This also includes additional components like an interactive console in JavaScript, and another one for HIVE that makes the cluster accessible thru a web interface besides the classic command line. As previously mentioned, on Windows Server and Windows Azure, HDFS is implemented on top of NTFS. On Windows Azure, as a cluster would typically go away and be recreated for new jobs, it is also a good idea to have data stored in Windows Azure blob storage instead of HDFS, even if HDFS remains available. The main difference between the two is that HDFS data will go away when compute nodes are shut down and when the cluster is no longer needed to compute, while Windows Azure blob storage remains available. For this reason, Hadoop on Windows Azure allows the usage of URI scheme ASV (/w the ASV protocol) to access blobs instead of using HDFS. ASV stands for Azure Storage Vault. In the following picture, Windows Azure blob storage content is listed in container “books”, in the folder “fr-fr”. Then files are copied from Windows Azure blob storage to HDFS, and the HDFS folder is listed. Under the command line window, the storage account (deladonnee) content is seen thru CloudXplorer tool (a tool that can be downloaded from the Internet, which is built on top of Windows Azure blob storage API).

An E2E Approach to Big Data
INSIGHTS SELF-SERVICE | COLLABORATIVE | MOBILE | REAL-TIME SHARE AND GOVERN DISCOVER AND RECOMMEND TRANSFORM AND CLEAN DATA ENRICHMENT HADOOP 1 DATA MANAGEMENT SQL Server, PDW, etc … STREAM INSIGHT Volume Variety Velocity

Hadoop Framework Ecosystem
The components that will be used in this presentation are as follows: HDFS is the distributed file system. Map Reduce is the batch job framework. PIG is a high level language (comparing to Map Reduce programming) that can describe jobs executions, and data flows. It also enables the execution of several jobs from one script. HIVE provides HiveQL, a SQL like language on top of Map Reduce. It is very popular because many people master SQL. SQOOP enables data exchange between relational databases and Hadoop clusters. The name starts like SQL and ends like HADOOP.

SSIS Workflow engine to control ETL
Import/Export Data to/from databases Ability to connect to external systems Ability to transform between external systems and databases SQL Server is a set of tools that include an RDBMS system, a multidimensional OLAP and a tabular database engines, as well as other services like a broker service, a batch service (SQL Agent) to name a few. One of these additional components is an ETL tool called SQL Server Integration Services (SSIS). The main goals of an ETL tool is to be able to import and export data to and from databases. This includes the ability to connect to external systems, but also to transform the data between the external systems and the databases. SSIS can be used to import data to and from SQL Server but also between external systems without requiring the data to go to or from SQL Server. For instance, SSIS can be used to move data from an FTP server to a local flat file. SSIS also has a workflow engine to automate the different tasks (data flows, tasks executions, etc.) that need to be executed in an ETL job. An SSIS package execution can itself be one step that is part of an SQL Agent job, and SQL Agent can run different jobs. An SSIS package has one control flow, and as many data flows as necessary. Data flow execution is dictated by the content of the control flow.

Automation SSIS Task Execute Package Component
Data Flow Task Component Script Task Component for Custom Code Execute Process Task for PowerShell

Data Transfer In the picture, Hadoop storage can be HDFS (of course!) but also Windows Azure blob storage (Azure Storage Vault, ASV). Hadoop storage can be used from Hadoop jobs (map reduce, HIVE, PIG, etc.). The Apache Hadoop Framework also includes other URI schemes like S3 and S3N for Amazon Web Services S3 storage. However, they are not considered as Hadoop storage in the scope of this paper. It would not be efficient (for technical and economic reasons) to have a Hadoop cluster on Windows Server or Windows Azure running jobs against S3 data. Thus, in the picture, S3 is considered a data source or a data destination. Also note in the picture that SSIS and the Hadoop cluster can be on the same LAN or they can be separated by Internet. Hosting Hadoop on Windows Server in Amazon Web Services (AWS) would not have the same level of integration as Hadoop on Windows Azure, and it is out of the scope of this white paper.

Data Transfer HDFS ASV Hive ODBC SQOOP Hive ODBC

SQOOP Task source Demo Using the JobClientSimplified class, the SqoopQuery method takes a query and another optional parameter to subscribe to a report progress. In this sample, the goal is to import the users table from a SQL Azure Database named contoso, as the query describes it. The final script task allows to log Hadoop job progression and to wait for completion thanks to the job ID passed in as a parameter.

When to use SSIS? Graphical interface representation of the workflow
PIG and Oozie are more standard ways of executing a number of jobs in Hadoop. All can be combined in easy way SSIS is very well suited to coordinating jobs that run in Hadoop with jobs that run outside Hadoop.

Demo Too much talk, Let’s play with it ! Items to Demo
Hadoop On Windows -Connect SQLCSTCLI2 client and show folder structure to audience by running hdfs commands and show content of stock prices -Create SSIS Package called HdpOnWindows -Upload Data into Hadoop Cluster using SSIS -Run Sqoop Code -ProcessCube and Display Data in SharePoint HDInsight -Create SSIS Package Called HDInsight - create a SQL Azure database - create HIVE metadata - open ODBC port - install ODBC driver - create SSIS package - Run SSIS package - Create and run a SQL Azure Report on top of database PowerShell -Create SSIS Package -Provision Cluster with PowerShell -Load Data -Release Cluster with PowerShell Pig -Show Pig Command and create SSIS Script Task Hive ODBC -Connect Hive Table using ODBC

Sponsors Main Gold Bronze Media Swag

For more Information #sqlsatistanbul

Leveraging a Hadoop Cluster from SQL Server Integration Services

Similar presentations

Presentation on theme: "Leveraging a Hadoop Cluster from SQL Server Integration Services"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Leveraging a Hadoop Cluster from SQL Server Integration Services

Similar presentations

Presentation on theme: "Leveraging a Hadoop Cluster from SQL Server Integration Services"— Presentation transcript:

Similar presentations

About project

Feedback