Modern data warehouse: HDInsight

Modern data warehouse: HDInsight
Server & Tools Business 9/9/2018 Modern data warehouse: HDInsight Bill Ramos| VP Consulting, Advaiya Inc Welcome this introduction session for the Microsoft Big Data Boot Camp. This session sets the stage for the training event. Each session follows a similar format where I’ll introduce the topic and then provide a set of demonstrations on how the technology works. Let’s get started. © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Server & Tools Business
9/9/2018 Lab environment Event code: provided by instructor All files are located at: © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Logistics and Introductions
9/9/2018 Logistics and Introductions Class hours 8am to 5:30pm Rest rooms Meals 8:00 am – Continental Breakfast 12:00 noon – Lunch 2:30pm – afternoon break Internet Speaker Notes: Cover the training facilities and the logistics. Briefly introduce yourself. © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Schedule 8:00 8:15 Continental breakfast 8:30 Software check and verify Azure subscriptions 8:45 Introductions and overview for the day 9:45 Introduction to Big Data with HOL – Setup Azure storage and HDInsight cluster 10:00 Break 11:30 Introduction to Map-Reduce with HOL 12:15 Lunch break and bonus Power Query session 13:30 Introduction to Hive and Hive QL with HOL 13:45 14:30 Developing Big Data Applications with .Net with HOL – Linq to Hive 14:45 15:30 Using Sqoop and Reporting Services 15:45 16:45 Operationalize your Big Data Pipeline 17:00 Wrap up and Survey Spend about 3-5 minutes going over the course schedule. Explain that the schedule is very tight and that keeping the students on track will be very important in order to cover everything in one day. Basically explain what will be covered in each section and what lab they will do. © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Modern data warehouse: HDInsight Introduction to Big Data
Server & Tools Business 9/9/2018 Modern data warehouse: HDInsight Introduction to Big Data Bill Ramos| VP Consulting, Advaiya Inc Welcome this introduction session for the Microsoft Big Data Boot Camp. This session sets the stage for the training event. Each session follows a similar format where I’ll introduce the topic and then provide a set of demonstrations on how the technology works. Let’s get started. © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Agenda Why Big Data? Big Data Lambda Architecture Getting started with Windows Azure HDInsight Service In this introduction session, I’m going to talk about why big data. Next, I’ll introduce the Lambda Architecture. This is community driven architecture that helps provide a framework for how various Big Data components work together for specific scenarios. I’ll also show how the various Microsoft Big Data platform components like HDInsight fit into the Lambda Architecture. I’ll next go over the Windows Azure’s high level architecture and components and then give an overview of the Table and Blog storage components that relate to Big Data solutions. At then end, I’ll demo how to create a Windows Azure storage account and HDInsight cluster. © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

The Business Imperative
1. 2. 3. 4. Human Fault Tolerance Minimize CapEx Hyper Scale on Demand Low Learning Curve

SQL Server 2012 PDW Appliance
9/9/2018 What is Big Data? Data complexity: variety and velocity Petabytes Big Data Log files Spatial and GPS coordinates Data market feeds eGov feeds Weather Text and image Wikis and blogs RFID Devices Social sentiment Audio / video Sensors Clickstream Web 2.0 Web logs Digital marketing Search marketing Recommendations Mobile Collaboration eCommerce Advertising Terabytes ERP / CRM Payables Payroll Inventory Contacts Deal tracking Sales pipeline Key goal of slide: Communicate what Big Data is Slide talk track: ERP, SCM, CRM, and transactional web applications are classic examples of systems processing transactions. Highly structured data in these systems is typically stored in SQL databases. Web 2.0 is about how people and things interact with each other or with your business. Web logs, user clickstreams, social interactions and feeds, and user-generated content are classic places to find interaction data. Ambient data trends are becoming the “Internet of Things”. Mary Meeker has predicted 10B connected devices by Sensors for heat, motion, pressure, and RFID and GPS chips within things such as mobile devices, ATM machines, and even aircraft engines provide just some examples of “things” that output ambient signals… There are multiple types of data, including personal, organizational, public, and private. So we should NOT minimize our thinking to just data that flows through an organization. For example, the mortgage-related data you may have COULD benefit from being blended with external data found in Zillow. Moreover, the government has the Open Data Initiative, which means that more and more data is being made publicly available. Gigabytes Megabytes

9/9/2018 What is Hadoop? HCatalog Oozie HBase/Cassandra/Couch/ MongoDB Hive Mahout R Cascading Pig Flume Sqoop Zookeeper Ambari Avro HBase (column DB) Hadoop = MapReduce + HDFS Distributed, scalable system on commodity hardware composed of: HDFS—distributed file system MapReduce—programming model Others: HBase, R, Pig, Hive, Flume, Mahout, Avro, Zookeeper Key goal of slide: Communicate what Hadoop is. Slide talk track: Everyone has heard of Hadoop. But what is it? And do I need it? Apache Hadoop is an open- source solution framework that supports data-intensive distributed applications on large clusters of commodity hardware. Hadoop is composed of a few parts: Hadoop Distributed File System (HDFS) is the Hadoop file system that stores large files (from gigabytes to terabytes) across multiple machines MapReduce is a programming model that performs filtering, sorting, and other data retrieval commands across a parallel, distributed algorithm. Other parts of Hadoop include HBase, R, Pig, Hive, Flume, Mahout, Avro, and Zookeeper, which are all parts of the Hadoop ecosystem that perform other supplementary functions. Microsoft Confidential

Hadoop capabilities Predictive Analysis Graph Processing
Extract Load Transform Distributed Compute I see the real breakthrough insights coming through when you take what is the traditional "Business Intelligence" and add more capabilities like machine learning, predictive analysis, statistical analysis, large scale graph processing, pattern mining, trend analysis, economic modeling. All of which today are a reality in Hadoop. Predictive Analysis Graph Processing Machine Learning

C# Hadoop is not… A place to learn how to code
A replacement for Data Warehouse A place to learn how to code A place for low latency data C# Da

9/9/2018 Limitations: Analysis with Big Data today Steep learning curve, slow and inefficient Hadoop ecosystem Move HDFS into the warehouse before analysis Learn new skills SQL Today, if you want to gain insight by using Big Data, you need to do one of two things: the business must capture, process, and store an explosion of unstructured or semi-structured data coming from scanners, devices, social media, transactions, web logs, cloud applications, and services. Many business are responding with a dramatic increase in the use of Hadoop clusters. This increase occurs around LEARNING, BUILDING, MANAGING, AND MAINTAINING the ecosystem of Hadoop, like HDFS, MapReduce, Hive, HBase, and others. Expanding IT skills is always a strategic decision that competes with other areas of growth; however, business intelligence is broadly used by the business—and many business leaders don’t want to learn a new or separate BI tool to get access to insights that are more difficult to obtain and slower. An additional downside to standalone Hadoop is that performance suffers and does not seamlessly integrate or take advantage of the broader data warehousing systems—including ETL, SQL optimization, and MPP high availability. Some vendors account for this by building Hadoop- optimized appliances that sit alongside data warehousing appliances. However, for both IT and the business, integration requires additional skills, processes, and time. The other mechanism is to have IT manually move HDFS into the data warehouse and join the data at the warehouse level. This approach loses many of the benefits of Hadoop clusters and significantly increases the cost for extract, transform, and load (ETL). From storage to ETL costs, the rapid change of real-time data is not cost effective. Most businesses will not abandon their relational data warehouses, but instead add a Hadoop ecosystem to create a hybrid environment. Building applications also becomes more costly and takes longer because two separate query and process models must be employed, which increases application development, test, maintenance, and support. HDFS (Hadoop) Build Integrate Manage Maintain Support ETL Warehouse HDFS (Hadoop)

The modern data warehouse
9/9/2018 The modern data warehouse BI & ANALYTICS Self-service Collaboration Corporate Predictive Mobile DATA ENRICHMENT & FEDERATED QUERY Extract, transform, load Single query model Data quality Master data management DATA MANAGEMENT & PROCESSING Non-relational Relational Analytical Streaming Internal & External  Key goal of slide: To convey that the modern data warehouse is something that the traditional data warehouse must evolve to. To have IT agree that their warehouses need to take advantage of these new technologies (specifically focusing on the middle and bottom layer). Slide talk track: To encompass these four trends, we need to evolve our traditional data warehouse to ensure that it does not break. It needs to become the “modern data warehouse.” What is the “modern data warehouse?” This is the new warehouse that is able to excel with these new trends and can be your warehouse now and into the future. The modern data warehouse has the ability to: Handle all types of data. Whether it be your structured, relational data sources or your non- relational data sources, the Modern data warehouse will incorporate Hadoop. It can handle real-time data by using complex event processor technologies. Provide a way to enrich your data with Extract, Transform Load (ETL) capabilities as well as Master Data Management (MDM) and data quality Provide a way for any BI tool or query mechanism to interface with all these different types of data with a single query model that leverages a single query language that users already know (example: SQL). INFRASTRUCTURE OLTP ERP CRM LOB Data sources Non-Relational Data Devices Web Sensors Social © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Big data scenarios for HDInsight
Large amount of logged or archived data – small # of large files Loosely structured data – no fixed schema Data is written once and may only be appended Data sets are read frequently and often in full Examples monitoring supply chains in retail suspicious trading patterns in finance air and water quality from arrays of environmental sensors An exemplary scenario that provides a case for an Hadoop on Windows Azure application is an ad hoc analysis, in batch fashion, on an entire unstructured dataset stored on Windows Azure nodes, which do not require frequent updates. These conditions apply to a wide variety of activities in business, science, and governance. These conditions include, for example, monitoring supply chains in retail, suspicious trading patterns in finance, demand patterns for public utilities and services, air and water quality from arrays of environmental sensors, or crime patterns in metropolitan areas. Hadoop is most suitable for handling a large amount of logged or archived data that does not require frequent updating once it is written, and that is read often, typically to do a full analysis. This scenario is complementary to data more suitably handled by a RDBMS that require lesser amounts of data (Gigabytes instead of Petabytes), and that must be continually updated or queried for specific data points within the full dataset. RDBMS work best with structured data organized and stored according to a fixed schema. MapReduce works well with unstructured data with no predefined schema because it interprets data when being processed.

Traditional DW/BI Environment
Transactional Backroom/Data Warehouse Data Warehouse Reporting ETL OLAP

Tomorrows DW/BI Environment
Transactional Backroom/Data Warehouse Data Warehouse Reporting ETL OLAP New Data Sources Social Networks Social Networks HDInsight Sensor Data Sensor Data Business Critical Log Data Log Data RFID Data RFID Data Automated Data Automated Data

Microsoft Hadoop Vision
Better on Windows and Azure Active Directory System Center .Net Programmability Microsoft Data Connectivity SQL Server / SQL Parallel Data Warehouse Azure Storage / Azure Data Market Microsoft Business Intelligence (BI) Hive ODBC Connectivity BI Tools for Big Data Collaborate with and Contribute to OSS Collaborate with HortonWorks Provide improvements and Windows support back to OSS

9/9/2018 Big Data Lambda Architecture Let’s now look at the Bid Data Lambda Architecture © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Big Data Lambda Architecture
Batch layer Stores master dataset Compute arbitrary views Speed layer Fast, incremental algorithms Batch layer eventually overrides speed layer Serving layer Random access to batch views Updated by batch layer Batch Layer Speed Layer Talk Track: In order to make sense of how various Big Data technologies fit together, the Open Source community has developed what is know as the Big Data Lambda Architecture. The “lambda architecture” provides an architectural model that scales and which has both the advantages of long-term batch processing and the freshness of a real-time system, with data updated in seconds time. The lambda architecture solves the problem of computing arbitrary functions on arbitrary data in real time by decomposing the problem into three layers: the batch layer, the speed layer, and the serving layer. Let’s take a look at each of the three layers. The Batch Layer stores the Master Dataset for you solution – typically in append mode – that is handles new data coming in. The Batch layer is usually: Read only database. No random writes required. It is Horizontal scalable with unrestrained computation and High Latency. Speed Layer - Stream processing and Continuous computation. It provides fast incremental algorithms. Batch layer eventually overrides speed layer. All the complexity is isolated in the Speed layer. If anything goes wrong, it’s auto-corrected. The views are stored in Read & Write database. • MS SQL Server • Column Store • Cassandra • … • Much more complex than a read only view. Service Layer: The service layer provides the merged outcome of data streams coming from the Batch layer and the speed .This layer queries the Batch & Real Time views and merges it. PolyBase is a great fit. Key Points: Lambda Architecture with three layer The Batch Layer -Stores Master Dataset The Speed layer –Stream Processing for real time view The Service Layer-merged outcome of data streams coming from the Batch layer and the speed layer References: Big Data Lambda Architecture: Speaker notes from: Serving Layer

The Batch Layer Stores master dataset (in append mode)
Unrestrained computation Horizontally scalable High latency Batch views Master dataset Incoming data streams Talk Track: The portion of the lambda architecture that precomputes the batch views is called the batch layer. The batch layer stores the master copy of the dataset and precomputes batch views on that master dataset. The master dataset can be thought of us a very large list of records. The batch layer needs to be able to do two things to do its job: store an immutable, constantly growing master dataset, and compute arbitrary functions on that dataset. The key word here is arbitrary. If you’re going to precompute views on a dataset, you need to be able to do so for any view and any dataset. The nice thing about the batch layer is that it’s simple to use. Batch computations are written like single-threaded programs yet automatically parallelize across a cluster of machines. This implicit parallelization makes batch layer computations scale to datasets of any size. It’s easy to write robust, highly scalable computations on the batch layer. The batch view enables you to get the values you need from it very quickly because it’s indexed. Think of technologies like Hadoop and Pig/Hive for use on the Batch layer. Data warehouse database technologies can also be associated with the Batch layer. Key Points: The Batch Layer -Stores Master Dataset and precomputes batch views on that master dataset Store an immutable, constantly growing master dataset, and compute arbitrary functions on that dataset Read only database. No random writes required. References: Big Data Lambda Architecture: lambda-architecture/

The Speed Layer Stream processing of data
Stores a limited window of data Dynamic computation Real-time views Incoming data streams Talk Track: You can think of the speed layer as similar to the batch layer in that it produces views based on data it receives. There are some key differences, though. One big difference is that, in order to achieve the fastest latencies possible, the speed layer doesn’t look at all the new data at once. Instead, it updates the real-time view as it receives new data instead of recomputing them like the batch layer does. The speed layer requires typically requires databases that support random reads and random writes. Because these databases support random writes, they are more complex than the databases you use in the serving layer, both in terms of implementation and operation. Most of the application complexity tends to be isolated in the Speed layer. Technologies typically considered for the speed layer include in-memory transaction databases and complex event processing engines. Key Points: Stream processing. Continuous computation Transactional. Storing a limited window of data. Compensating for the last few hours of data. All the complexity is isolated in the Speed layer. If anything goes wrong, it’s auto-corrected. Some algorithms are hard to implement in real time References: Big Data Lambda Architecture: lambda-architecture/ Process stream Increment views Real-time increments

The Serving Layer Queries the batch and real-time views
Merges the results Batch views Output Querying and merging Talk Track: Finally, the serving layer indexes the batch view and loads it up so it can be efficiently queried to get particular values out of the view. The serving layer is typically considered as a specialized distributed database that loads in batch views, makes them able to be queried, and continuously swaps in new versions of a batch view as they’re computed by the batch layer. A serving layer database only requires batch updates and random reads. Most notably, it does not need to support random writes. The serving layer job is to queries the Batch & Real Time views and merges it. Typically the technologies associated with the serving layer include on-line analytic processing databases like Analysis Services and PowerPivot. It can also be considered as the “last mile” technology for producing usable results for your solutions. Key Points: Service Layer queries the Batch & Real Time views and merges it References: Big Data Lambda Architecture: lambda-architecture/ Real-time views

Microsoft Lambda Architecture Support
Server & Tools Business 9/9/2018 Microsoft Lambda Architecture Support Batch Layer Speed Layer Serving Layer Windows Azure HDInsight Azure Blob storage MapReduce, Hive, Pig, Oozie, SSIS Windows Azure SQL Database Azure tables Memcached/MongoDB SQL Server database engine SQL Server VM: Columnstore indexes Analysis Services StreamInsight Azure Storage Explorer Microsoft Excel Power Query PowerPivot Power View Power Map Reporting Services LINQ to Hive Analysis Services Talk Track: The Microsoft’s Data Platform stack fully supports each of the layers in the Big Data Lambda Architecture. For the batch layer, Microsoft provides multiple options for the storage and processing of batch oriented data. These include Windows Azure HDInsight and Azure Blob Storage to hold the input data. The SQL Server data warehousing capabilities can also be associated with the batch layer. For processing the data and view management, Microsoft supports processing of Hadoop data through MapReduce jobs along with Hive, Pig, and Oozie. For data warehousing, you can use traditional SQL views and stored procedures. For the speed layer, Microsoft supports real-time processing of data through technologies like Windows Azure SQL Database, Azure Tables, Memcached/MongoDB, SQL Server database engine and SQL Server VM along with Columnstore Indexes, Analysis Services, StreamInsight. Finally, with the serving layer, which provides the merged outcome of data streams coming from the Batch layer and the speed layer, you can use tools like PowerPivot, Power View, Power Query, Power Map, Reporting Services, LINQ to Hive and Analysis Services technologies. Key Points: Microsoft provides a complete BI solution, which can be entirely aligned with all the three layers of the Lambda Architecture. References: Big Data Lambda Architecture: lambda-architecture/ © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

E.g. STRUCTURED & UNSTRUCTURED DATA
Extremely large volume of unstructured web logs Ad hoc analysis of logs to prototype patterns Hadoop data cluster feeds large 24TB cube Business users analyze cube data 6 PB Hadoop Cluster 24 TB SQL Server AS Cube Microsoft BI Tools Yahoo, one of the pioneers of Hadoop, stores web log information in a 6 petabyte Hadoop cluster and integrates that with a 24 terabyte SQL Server Analysis Services cube for analysis with common BI tools such as Excel and PowerPivot.

9/9/2018 Yahoo! Batch Layer Speed Layer Serving Layer Apache Hadoop Staging Database SQL Server Analysis Service (SSAS) Microsoft Excel and PowerPivot Other BI Tools and Custom Applications Hadoop Data SQL Server Connector (Hadoop Hive ODBC) Talk Track: Using SQL Server 2008 R2, Yahoo! enhanced its Targeting, Analytics and Optimization (TAO) infrastructure (a powerful, scalable advertising analytics tool), which now takes data from a Hadoop cluster into a third-party database, where it is loaded into a SQL Server 2008 R2 Analysis Services cube. The cube then connects to client applications such as Tableau Desktop business analytics software and in-house custom applications. Employees use this software to create interactive data dashboards and perform ad hoc analysis. Microsoft has developed the SQL Server Connector for Apache Hadoop, which is designed to facilitate efficient data transfer between Hadoop and SQL Server 2008 R2. Key Points: With Big Data technology, Yahoo experienced the following benefits: Improved ad campaign effectiveness and increased advertiser spending. Cube producing 24 terabytes of data quarterly, making it the world’s largest SQL Server Analysis Services cube. Ability to handle more than 3.5 billion daily ad impressions, with hourly refresh rates. References: Microsoft case study: Yahoo! Improves Campaign Effectiveness, Boosts Ad Revenue with Big Data Solution: Third Party Database SQL Server Analysis Services (SSAS Cube) + Custom Applications Microsoft Excel & PowerPivot for Excel © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Ferranti Computer Systems
Server & Tools Business 9/9/2018 Ferranti Computer Systems Batch Layer Speed Layer Serving Layer Windows Azure HDInsight Reactive Extensions (Rx) SQL Server Database (In- Memory OLTP) Microsoft Dynamics AX SQL Server Analysis Services SQL Server Reporting Services Talk Track: Ferranti and Microsoft designed a solution that uses Windows Azure HDInsight Service and nonrelational technologies to perform fast searches on business data and provide the information to the business processes in MECOMS™ (a business support system for the energy and utility industry) and Microsoft Dynamics AX. Searches of the memory-optimized tables are distributed between groups of computers, called clusters, which are managed by HDInsight. In-Memory OLTP makes access to SQL Server databases dramatically faster by optimizing queries and procedures, and moving heavily used tables into application memory—referred to as memory- optimized tables. Reactive Extension (Rx) was implemented to verify and process the incoming raw data, and then to send the aggregated data to SQL Server for quick storage in memory-optimized tables. SQL Server analyzes the aggregated data, and sends the results of the analysis to Microsoft Dynamics AX for demand-side business processes such as scheduling service calls, terminating service, and invoicing. HDInsight also offers full compatibility with Microsoft business intelligence technology such as SQL Server 2012 Analysis Services and SQL Server 2012 Reporting Services. Key Points: With Big Data technology, Ferranti experienced the following benefits: Increased Sustained Database Write Speed to 200 Million Rows in 15 Minutes Discovered ways to access and analyze more of the data generated by the smart meters, providing new business opportunities References: Microsoft Case Studies: Ferranti Computer Systems - Utilities ISV Scales to Meet Customer Needs for Storage and Analysis of Big Data /Ferranti-Computer-Systems/Utilities-ISV-Scales-to-Meet-Customer-Needs-for-Storage-and- Analysis-of-Big-Data/ Reactive Extensions (Rx) Data Feed from Smart Meters Windows Azure HDInsight SQL Server (In-Memory OLTP) Microsoft Dynamics AX SQL Server Analysis Services SQL Server Reporting Services © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Windows Azure Storage Let’s now look at Windows Azure storage. © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Demo 1: Setting up the Windows Azure storage account
Server & Tools Business 9/9/2018 Demo 1: Setting up the Windows Azure storage account Batch Layer Speed Layer Serving Layer Azure Blob storage Windows Azure Management Portal Talk Track: That’s enough talk for now. Let’s get to this sessions demo. For each of the boot camp demos, I’ll put the technologies that I’ll show off in context with the Big Data Lamba architecture. At the end of each presentation, you will get a chance to try out the demos yourself as hands-on-lab exercises. Here, we will setup a Windows Azure storage account that will be used for the batch layer. The blob store information will be served up using the Azure Storage Explorer available on Codeplex. I’ll then show how to access the storage account using the Azure Storage Explorer. In this demo, you will setup a Windows Azure Storage account for your storage related activities. You will also discover some of the new features that Windows Azure Storage Account has to offer. Besides, you will also learn using Azure Storage Explorer for exploring the Windows Azure Storage. Here end-users interact with the Windows Azure Blob storage via the Azure Storage Explorer tool as a front end interface. Azure Portal Windows Azure Blob storage © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Blob Storage Concepts Store large amounts of unstructured text or binary data with the fastest read performance Highly scalable, durable, and available file system Blobs can be exposed publically over HTTP Securely lock down permissions to blobs Blob Container Account Images PIC01.JPG Video VID1.AVI Pages/ Blocks Block/Page PIC02.JPG Contoso Talk Track: Let’s now take a look at the hierarchy of Blob storage The Blob service provides storage for entities, such as binary files and text files. The REST API for the Blob service exposes two resources: Containers Blobs. A container is a set of blobs; every blob must belong to a container. The Blob service defines two types of blobs: Block blobs, which are optimized for streaming. Page blobs, which are optimized for random read/write operations and which provide the ability to write to a range of bytes in a blob. Blobs can be read by calling the Get Blob operation. A client may read the entire blob, or an arbitrary range of bytes. Block blobs less than or equal to 64 MB in size can be uploaded by calling the Put Blob operation. Block blobs larger than 64 MB must be uploaded as a set of blocks, each of which must be less than or equal to 4 MB in size. Page blobs are created and initialized with a maximum size with a call to Put Blob. To write content to a page blob, you call the Put Page operation. The maximum size currently supported for a page blob is 1 TB. Codeplex tools like the Azure Storage Explorer make managing blobs easy. There is also a rich API build to manage storage with PowerShell via the Rest based API. Key Points: The Blob service defines two types of blobs: Block blobs, and Page blobs Accessible via REST APIs, Windows Azure Storage Client library or using Windows Azure drives Stores large amounts of unstructured text or binary data with the fastest read performance Highly scalable, durable, and available file system References: Data Management and Business Analytics: storage/#blob

9/9/2018 Getting started with HDInsight Service Let’s now look at how to get started with the Windows Azure HDInsight Service © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Demo 2: Setting up the Windows Azure HDInsight cluster
Server & Tools Business 9/9/2018 Demo 2: Setting up the Windows Azure HDInsight cluster Batch Layer Speed Layer Serving Layer Windows Azure HDInsight Azure Blob storage HDInsight Console Talk Track: In this demo, I’ll show you how easy it is to setup an HDInsight cluster that uses the Blob Storage as a Hadoop File System. Here, the HDInsight cluster will be part of the Batch layer and I’ll show you the essentials for accessing the cluster using the HDInsight console. A Microsoft HDInsight cluster is associated with a Windows Azure Storage account or some affinity group. End users can use the HDInsight Console to interact with the HDInsight cluster and also the Windows Azure Storage account associated with this cluster. Windows Azure Management Console Windows Azure HDInsight Windows Azure Blob storage © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Demo 3: Loading data into Windows Azure storage for use with HDInsight
Server & Tools Business 9/9/2018 Demo 3: Loading data into Windows Azure storage for use with HDInsight Batch Layer Speed Layer Serving Layer Windows Azure HDInsight Azure Blob storage Powershell Talk Track: In the last demo for this presentation, I’ll show how you can prepare and upload data into the Hadoop cluster – specifically the Windows Azure Blob storage that is associated with our HDInsight cluster. As described in earlier demo, the HDInsight cluster is associated with a Windows Azure Storage account or some affinity group. End users can use the HDInsight Console to interact with the HDInsight cluster and also the Windows Azure Storage account associated with this cluster. PowerShell ISE Console Windows Azure HDInsight CSV files from local disk Windows Azure Blob storage © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Easy Access to Data, Big & Small Let’s now see how Microsoft Big Data solutions allow you to work with any data. © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Easy Access to Data, Big & Small
Server & Tools Business 9/9/2018 Easy Access to Data, Big & Small Search, Access & Shape Simplify access to public & corporate data Easily preview, shape, & format your data Key Features Power Query Windows Azure Marketplace Windows Azure HDInsight Service Parallel Data Warehouse with Polybase Combine with Unstructured Combine and refine data across multiple sources Gain insight across relational, unstructured, & semi-structured data Talk Track: Lets now talk about how technologies like Data Explore in Excel, Window Azure Marketplace and HDInsight service, and Polybase can provide you with easy access to all data – both big and small. With Power Query, you have an intuitive and consistent experience for discovering, combining, and refining any data, including relational, structured and semi-structured, OData, Web, Hadoop, Azure Marketplace, and more. Power Query also provides you with the ability to search for public data from sources such as Wikipedia. The Windows Azure HDInsight Service makes Apache Hadoop available as a service in the cloud, provides a software framework designed to manage, analyze and report on Big Data. As a cloud-based service, it makes these resources available in a simpler, more scalable, and cost efficient environment. As a part of Microsoft’s overall Big Data strategy, SQL Server 2012 Parallel Data Warehouse includes PolyBase, a new breakthrough technology that dramatically simplifies combining non- relational data and traditional relational data for analysis. PolyBase seamlessly provides the benefits of “Big Data” without the complexities. Normally, organization would need to burden IT with pre-populating the data warehouse with Hadoop data, or undergo extensive training on MapReduce in order to query non-relational data. With Polybase, this is made easy, enabling you to rapidly query massive data sets by combining MPP data warehousing performance with Hadoop. Key Points: Power Query: Discover, Search, Transform and Combine data (relational, structured and semi- structured) from across multiple sources. Windows Azure HDInsight Service: Framework to manage, analyze and report on Big Data, using Apache Hadoop services in the cloud. SQL Server 2012 Parallel Data Warehouse (Polybase): Faster ways to combine non-relational data and traditional relational data for analysis. References: Easily Manage & Query Common management of structured & unstructured data Query across relational DB & Hadoop with single T-SQL Query © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Learn more Getting Started with HDInsight Azure HDInsight and Azure Storage /21/azure-hdinsight-and-azure-storage.aspx Talk Track: That’s it for this session. To learn more about what I just showed in this session, check out these to resource links for Getting Started with HDInsight and Azure HDInsight and Azure Storage Thank you! END OF PRESENTATION © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Questions?

Modern data warehouse: HDInsight Introduction to Map Reduce
Server & Tools Business 9/9/2018 Modern data warehouse: HDInsight Introduction to Map Reduce Bill Ramos| VP Consulting, Advaiya Inc PPE Application and Data Platform My name is Bill Ramos. In this session, I’m going to give you a broad overview of how MapReduce works and then show you an example of how to run Map-Reduce jobs using C# and .Net on Windows Azure HDInsight. © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Agenda Understanding MapReduce How MapReduce Works .NET Support for MapReduce Demo In this session, I’m going to ramp you up on MapReduce. You won’t become an expert by the end of this module but you will gain an understanding of the basics. If you want to learn more, partners like Hortonworks offer in-depth training on Hadoop and MapReduce. You’ll also see a demonstration of how Microsoft has extended the ability to create and run MapReduce jobs using .Net and C#. All right, let’s dive in to MapReduce! NEXT SLIDE © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Lambda Architecture Support
Server & Tools Business 9/9/2018 Lambda Architecture Support Batch Layer Speed Layer Serving Layer Windows Azure HDInsight Azure Blob storage MapReduce, Hive, Pig, Oozie, SSIS Windows Azure SQL Database Azure tables Memcached/MongoDB SQL Server database engine SQL Server VM: Columnstore indexes Analysis Services StreamInsight Azure Storage Explorer PowerShell Console Microsoft Excel Power Query PowerPivot Power View Power Map Reporting Services LINQ to Hive Talk Track: In this presentation, we will be looking primarily at the Batch Layer to show how Map-Reduce jobs work with Windows Azure HDInsight using Azure Blob Storage as an extension of the Hadoop Distributed File System. References: Big Data Lambda Architecture: architecture/ © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Hadoop MapReduce Programming framework (library and runtime) for analyzing datasets stored in HDFS Composed of user-supplied Map and Reduce functions: Map() - subdivide and conquer Reduce() - combine and reduce cardinality 1. Divide a large problem into sub-problems. ……… Map() 2. Perform the same function on all sub-problems. Talk Track: Let’s go over the main concepts for MapReduce. Hadoop MapReduce is a programming and runtime framework for analyzing datasets stored in HDFS. It’s designed to let you write applications that rapidly process vast amounts of data in parallel on a large cluster of compute nodes. MapReduce jobs are composed of user-supplied Map and Reduce functions. Map functions take a divide and conquer approach to processing data as fast as possible. Reduce functions combine and reduce cardinality. The MapReduce framework provides all the “glue” and coordinates the execution of the Map and Reduce jobs on the cluster. Map functions take a large problem and divide it into sub-problems and perform the same function on all sub-problems. Reduce functions perform the Combine process, which takes the output from all sub-problems and creates a result set out of it. Often this output is in the form of a delimited text file that can be easily consumed by the speed and serving layers. Key Points: Map: Subdivide tasks into smaller sub-tasks, and sends across various reducers Reduce: Combine the sub-tasks and reduce cardinality References: Using MapReduce with HDInsight: us/manage/services/hdinsight/using-mapreduce-with-hdinsight/ Do work() Do work() Do work() 3. Combine the output from all sub-functions. Reduce() Output © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

MapReduce Rapidly process vast amounts of data in parallel, on a large cluster of compute nodes Framework schedules and monitors tasks, and re-executes failed tasks Typically, both input and output are stored in file system Map Phase Shuffle/Sort Reduce Phase DataNode 1 DataNode 1 Mapper Reducer DataNode 2 DataNode 2 Data is shuffled across the network and sorted Mapper Talk Track: MapReduce is designed to process a massive amount of data across a cluster of machines and can handle data that is stored in directly attached storage on the data nodes or on data stored on a distributed shared file system like Windows Azure Storage. The Hadoop framework within HDInsight schedules and monitors tasks, taking into account the physical attributes of the machines in the cluster to parallelize tasks as much as possible. If a particular task on a node fails due to a software error or hardware failure, it will attempt to re- execute any failed operations based on the configuration of your Hadoop cluster. The inputs and outputs are typically stored in the file system based on process filters, such as compression/decompression routines, binary formats optimized for certain MapReduce jobs, or standard delimited text files. In our example, the framework chose to use the mapper code to divide the dataset across three data node machines for the map phase. Then, based on various factors like the size of the result set, the framework chose to use only two data nodes to reduce the data as part of the Reduce phase. Now let’s look at an example of how a sample data set would be processed as a MapReduce job within the various phases using the framework. Key Points: Understand Map and Reduce How data is shuffled between Mapper and Reducer. References: Using MapReduce with HDInsight: us/manage/services/hdinsight/using-mapreduce-with-hdinsight/ DataNode 3 DataNode 3 Mapper Reducer

Pre-Execution: Submit a Task
INPUT Client app creates a task Task is scheduled in Task Manager Task is dispatched at scheduled time Keyword Content RegionId Complain OMITTED 10 Service Warranty 20 Lawsuit 30 Tax Support Pre-Execution Member 1 Member 2 Member 3 Member N Talk Track: In this example, we’re going to start with a large file or sets of files that contain messages. Each message may contain zero or more occurrences of a specific keyword in the list. Now I’ll show you a quick technique for performing sentiment analysis that uses a MapReduce job to process this master dataset and display the count of each keyword used throughout all the messages. First, you submit a task to the Hadoop head node for processing—either through the Hadoop application-programming interface or via an interactive console. The head node then figures out how to connect to the underlying data and figures out how the data is already partitioned. Then, it looks at how you requested the output format and defines the mapper and reducer tasks. [---CLICK---] The head node schedules the operations to run immediately or at a scheduled time using the TaskManager component, which dispatches the job at the scheduled time. Let’s take a look at our example dataset. You can see that there are repeating data values for the RegionId column. The head node can use this as a potential partition value to parallelize the tasks. We will then count the number of occurrences of the keywords. If you’re familiar with Transact-SQL, the MapReduce task is equivalent to executing the query you see here over the Hadoop cluster. [---NEXT SLIDE---] Reducer 1 Reducer 2 Reducer 3 Reducer 4 Reducer 5 Reducer m SELECT Keyword, SUM(Occurrence) FROM Messages CROSS APPLY KeyWordCount() WHERE <Predicate> GROUP BY Keyword Data Summary OUTPUT This process is equivalent to executing the following query over the federation: SELECT Keyword, SUM(Occurrence) FROM Messages CROSS APPLY KeyWordCount() WHERE Predicate GROUP BY Keyword

Execution: Map Task is distributed to all member nodes
INPUT Task is distributed to all member nodes Each member node now becomes a Mapper Keyword Content RegionId Complain OMITTED 10 Service Warranty 20 Lawsuit 30 Tax Support Pre-Execution Keyword Content RegionId Complain OMITTED 10 Service Warranty Keyword Content RegionId Service OMITTED 20 Warranty Lawsuit Keyword Content RegionId Complain OMITTED 30 Tax Support Member 1 Mapper 1 Member 2 Mapper 2 Member 3 Mapper 3 Member N Mapper N Talk Track: Now let’s take a look at how Hadoop processes the Map program. [---CLICK---] First, the TaskManager distributes portions of the master dataset to all the member nodes based on the partition value. In this case the value was partitioned by RegionId. Each member node takes on the role of a mapper. The mapper output is cached into a temporary table that’s used in the next phase. There’s also a streaming mode that can be used to send the results to the next step in chucks. The mappers will then push the results to the reducers for the shuffle and reduce phase. [---CLICK/EXIT SLIDE---] Reducer 1 Reducer 2 Reducer 3 Reducer 4 Reducer 5 Reducer m Data Summary OUTPUT Map Function: SELECT Keyword, SUM(Occurrence) FROM Messages CROSS APPLY KeyWordCount() GROUP BY Keyword

Execution: Shuffle and Reduce
INPUT Mapper function executes over all rows in its partition Mappers push results to the Reducers Reducers start processing the output from Mappers Pre-Execution Keyword Occurrence RegionId Complain 19 10 Service 23 Warranty 22 Keyword Occurrence RegionId Service 44 20 Warranty 25 Lawsuit 7 Keyword Occurrence RegionId Complain 38 30 Tax 23 Support 69 Mapper 1 Mapper 2 Mapper 3 Mapper N Complain 19 10 Service 44 20 Complain 38 30 Service 23 10 Warranty 25 20 Tax 23 30 Warranty 22 10 Lawsuit 7 20 Support 69 30 Talk Track: The next phase of process is called shuffle and reduce. [---CLICK---] First, the mapper function executes over all the rows in its partition as a parallel operation. The mappers then push the results of the counts per keyword to the reducers. The reducers then process the output from the mappers. This will become clear when I demonstrate an actual MapReduce job a little later. The output is sent to the Data Summary unit for final processing. Reducer 1 Reducer 2 Reducer 3 Reducer 4 Reducer 5 Reducer m Data Summary OUTPUT Reduce Function: SELECT Keyword, SUM(Occurrence) FROM Cache WHERE Predicate Note that Cache will change depending on the Reducer

Post-Processing: Data Summary
INPUT Reducers carry out their operation in parallel Output from each Reducer is summed into one temporary table Output results are published into output file Pre-Execution Mapper 1 Mapper 2 Mapper 3 Mapper N Talk Track: In this last phase… [---CLICK---] …the reducers finish up carrying out their operations in parallel. You can see we have the totals for each of the keywords that each reducer computed. The Data Summary Unit then takes the results from the reducers to create a temporary table for the results. The temporary table is sent to the desired output stream format, where you can apply statistical models to the results, generate graphs or charts, or pipe the results into another MapReduce process by persisting the temporary table and then chaining the task with another MapReduce program. Now you should have a better understanding of the overall MapReduce process. Reducer 1 Reducer 2 Reducer 3 Keyword Occurrence Support 69 Service 67 Warranty 47 Complain 57 Lawsuit 7 Tax 23 Reducer 4 Reducer 5 Reducer m Complain 57 Service 67 Warranty 47 Lawsuit 7 Support 69 Tax 23 Data Summary OUTPUT

9/9/2018 Demo: The “Hello World” of Map Reduce Supplied sample on HDInsight Written in Java Source code at Demo In this demo, we’ll show you how to run the Word Count Map Reduce program that is one of the HDInsight samples. This Map Reduce program is a Java program that determines the count of words within the source directory. Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum. © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 In this demo, we’ll show you how to run the Word Count Map Reduce program that is one of the HDInsight samples. This Map Reduce program is a Java program that determines the count of words within the source directory. © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Hadoop Streaming Mapper Reducer stdin stdin Any language Any language In this demo, we’ll show you how to run the Word Count Map Reduce program that is one of the HDInsight samples. This Map Reduce program is a Java program that determines the count of words within the source directory. stdout stdout © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

HDInsight .NET Support for MapReduce
“NuGet” Microsoft .NET MapReduce API for Hadoop Execute job through Powershell Collect the result on HDFS or directly into WASB storage Talk Track: You can use .NET and languages like C# to create Map Reduce jobs. First, install the “NuGet” Microsoft .NET MapReduce API for Hadoop. It provides an implementation of a HadoopJob. Next, execute the job via PowerShell. Finally, you just need to collect your result on HDFS. Now let’s check out a demo. Key Points: ‘NuGet’ for MapReduce APIs in .NET Write in Visual Studio, run on HDInsight cluster References: Using the Hadoop .NET SDK with the HDInsight Service us/manage/services/hdinsight/howto-net-libraries/

9/9/2018 Demo: Creating a C# Mapper Program Reads in census data thru stdin Transforms the age group value into string Builds key based on STNAME, CTYNAME, AGEGRP and value of RESPOP Outputs key-value pair to stdout In this demo, we’ll show you how to create a Mapper program in C#. © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Demo: Creating a C# Reducer Program Reads in key-value pairs through stdin Sums the RESPPOP for the key values Outputs the results stdout In this demo, we’ll show you how to create a Reducer program in C#. © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Demo: Running a C# Map Reduce Job Upload the exes to WASB Create the job definition Start the job and wait Copy the results to local computer to view data In this demo, we’ll run the map reduce job © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Learn more Microsoft .NET SDK for Hadoop on CodePlex Talk Track: Check out the links to learn more about using the Hadoop .NET SDK with the HDInsight Service as well as HDInsight with PowerShell. © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Server & Tools Business Bill Ramos| VP Consulting, Advaiya Inc
9/9/2018 Modern data warehouse: HDInsight Power Query with Windows Azure Storage Bill Ramos| VP Consulting, Advaiya Inc Welcome this introduction session for the Microsoft Big Data Boot Camp. This session sets the stage for the training event. Each session follows a similar format where I’ll introduce the topic and then provide a set of demonstrations on how the technology works. Let’s get started. © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Modern data warehouse: HDInsight Introduction to Hive and HiveQL Bill Ramos| VP Consulting, Advaiya Inc PPE Application and Data Platform In this presentation, you’ll learn how to take advantage of your knowledge of Transact-SQL or other SQL languages to run MapReduce jobs on an HDInsight cluster using Hive. © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Agenda Hive architecture Hive operations Demos Let’s get started with an overview of the Hive architecture and how it works on top of a Hadoop cluster. After that we’ll go over the data model for Hive tables. Then, we’ll follow up with the common operations you can accomplish with Hive and how to connect client tools like Microsoft Excel using the Hive ODBC driver. Finally, we’ll run through a series of demos that will help put everything into perspective. © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Hive architecture Built on top of Hadoop to provide data management, querying, and analysis Access and query data through simple SQL-like statements, called Hive queries In short, Hive complies, Hadoop executes Hive ODBC JDBC Hive web interface (HWI) Command line interface (CLI) Metastore Thrift server Compiler, Optimizer, Executor Hadoop Talk Track: Hive was initially developed by Facebook so that their developers could process data across their Hadoop file system using a SQL-like query language that they called HiveQL. HiveQL looks a lot like ANSI SQL. That means if you know Transact-SQL you’ll feel comfortable learning Hive. Because the results look like a standard relational database result set, various vendors have created ODBC drivers that interact with Hive results. With Hive, you can execute statements using either the Hive command-line interface on the Hadoop cluster or via an interactive web console like the Hive Interactive console in HDInsight. Applications can send queries and return results via either ODBC or JDBC drivers. Internally, the Hive compile, optimizer, and executor translate HiveQL statements into a directed graph of MapReduce jobs that are submitted to the Hadoop cluster’s head node for execution. As part of the optimization phase, developers can create customized mappers and reducers to extend the functionality of HiveQL. In short, Hive compiles and Hadoop executes. Key Points: HiveQL (Hive Query language): T-SQL like language to query Hadoop data in Hive. Can store/access data in text filed directly in the Azure Storage Account. References: Using Hive with HDInsight: hive-with-hdinsight/ Head node Name node Data nodes/task nodes

Create, load, and query Hive tables
HiveQL includes data definition language, data import/export and data manipulation language statements See display/Hive/LanguageManual Create table Import data into Hive table Talk Track: Now let’s take a look at a standard workflow for using Hive. First, you create a table on top of your data. If you want your data to be preserved at the location of the source table, you can create an EXTERNAL table. If you exclude the EXTERNAL clause, the data for the table is moved into the Hive data warehouse. Data can be serialized as delimited text files or as binary sequence files. The advantage to using text files is that tools like Excel can directly load the data from the Hadoop file system or Azure Blob Storage used by the HDInsight cluster. Sequence files provide compression and performance advantages over the text files, but they can only be consumed by external programs using the JDBC or ODBC interface. For .NET developers, you can also consume Hive data in the form of a table using the LINQ to Hive library. HiveQL also has CREATE VIEW syntax to help simplify developer queries that can be referenced in a SELECT command like a table. Data can be imported and exported into the table with IMPORT and EXPORT commands. You can also use the LOAD DATA INPATH command to associate a table with data in your Hadoop cluster. Finally, Hive offers a rich set of SELECT command features for querying data such as the ORDER BY, GROUP BY, JOIN, UNION and sub query clauses. For guidance on syntax and usage, refer to the Hive language manual on Apache.org. Key Points: In Hive EXTERNAL tables, data remains outside the Hadoop cluster. T-SQL skills can be used to query Hive Data. References: Using Hive with HDInsight: hive-with-hdinsight/ Query data using SQL-like statement

Demo 1: Create and Load Hive Tables
Server & Tools Business 9/9/2018 Demo 1: Create and Load Hive Tables Batch layer Speed layer Serving layer Windows Azure HDInsight Hive PowerShell Console Talk Track: In this demo, Bill will show you how to use Hive and HDInsight as part of the Batch layer of the Lambda architecture to create a new master dataset and then validate the results using the HDInsight Hive console. [---CLICK---] End-user can create the Hive tables using the Hive Interactive console of the HDInsight cluster. New Hive tables gets created inside the HDInsight cluster. Based on requirements, Hive table can be either partitioned hive tables, or can use CASE Statements, or can be in form of Bucketed tables. The results are visible on the Hive Terminal (Remote access to HDInsight Cluster) or the HIVE Interactive Console (HDInsight Web-Interface) Table partitioning Partitioned Hive table Hive table CASE statement Query results Bucketed table “Cluster by” clause Join Query results Hive table © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Connecting Hive Data to Excel We’ll now look at working with Hive data in Excel © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Using the Hive ODBC driver
Connector to HDInsight Hive available as part of HDInsight Hadoop clusters Enable business intelligence, analytics, and reporting on data in Hive Configure Hive ODBC driver Hive ODBC data source Talk Track: The Hive ODBC Driver is a software library that implements the Open Database Connectivity (ODBC) API standard for the Hive database management system. This enables ODBC compliant applications to interact seamlessly with Hive through a standard interface. The Microsoft Hive ODBC driver is a connector to Hive running on HDInsight clusters. The Microsoft ODBC driver for Hive enables Business Intelligence, Analytics, and Reporting on data in Apache Hive. Now let’s check out a demo. Key Points: The Hive ODBC driver allows access to Hive data via ODBC Connections. References: How to Connect Excel to Windows Azure HDInsight via HiveODBC: Load Hive tables into PowerPivot for Excel

Demo 2: Using the Hive ODBC driver
Server & Tools Business 9/9/2018 Demo 2: Using the Hive ODBC driver Batch Layer Speed Layer Serving Layer Hive Microsoft Excel PowerPivot Talk Track: Now that we have our data in a Hive table, let’s see how to install, configure, and use the Hive ODBC driver to load a table into PowerPivot for Excel. [---CLICK---] After establishing the ODBC connection, end users can request to access the Hive data using PowerPivot (from within the Excel workbook). Hive data moved into the Excel workbook via Hive ODBC connection. PowerPivot for Excel Hive table Hive ODBC connection © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Demo 3: Using Power Query with Hive Results
Server & Tools Business 9/9/2018 Demo 3: Using Power Query with Hive Results Batch Layer Speed Layer Serving Layer Azure Blob storage Microsoft Excel Power Query Talk Track: The new Power Query preview for Excel lets you to import data from a variety of sources and shape it so that you can analyze the results using familiar Excel features. In this example, I’ll show how you can use the Power Query to import Hive data stored on the Windows Azure Blob storage into Excel. [---CLICK---] After establishing the ODBC connection, end users can request to access the Hive data using Power Query (from within the Excel worksheet). Hive data moved into the Excel workbook via ODBC connection. Azure Blob storage files Power Query for Excel with PivotChart © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Learn more HDInsight PowerShell for Hive How to Connect Excel to Windows Azure HDInsight via HiveODBC Talk Track: Check out the link to learn more about and get started using HDInsight Interactive JavaScript and Hive Consoles. More information is also available on connecting Excel to Windows Azure HDInsight via HiveODBC. References: HDInsight PowerShell for Hive: rringTitle=Home How to Connect Excel to Windows Azure HDInsight via HiveODBC : © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Questions?

9/9/2018 Modern data warehouse: HDInsight Developing big data applications with .net Bill Ramos| VP Consulting, Advaiya Inc PPE Application and Data Platform Welcome to the Day 2, Module 1 session of Big Data Boot Camp, “Developing Big Data Applications for .NET Developers.” © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Agenda LINQ to HIVE Demo Talk Track: In this session, you will learn about LINQ to Hive, which help .NET developers use tools they already know to create applications that interact with HDInsight clusters more effectively. You’ll learn how to use the LINQ to Hive extensions that are part of the Microsoft .NET SDK for Hadoop to create and compile LINQ queries to use against Hive data. © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

LINQ to Hive client libraries
Creates and compiles LINQ queries to use against Hive data Translates C# or F# LINQ queries into HiveQL queries and executes them on the Hadoop cluster C#, F#, and .NET LINQ to Hive client libraries HiveQL query Talk Track: Lets first start with LINQ to HIVE. LINQ to Hive lets you author Hive queries using LINQ. The LINQ is compiled to Hive and then executed on your Hadoop cluster. This job is submitted using the LINQ to HIVE client libraries. In short, the LINQ to Hive client library translates C# or F# LINQ queries into HiveQL queries and executes them on the Hadoop cluster. Key Points: LINQ to HIVE References: content/blob/master/ITPro/Services/hdinsight/using-hdinsight-sdk.md Hadoop cluster

Demo 1: Working with LINQ Queries
Server & Tools Business 9/9/2018 Demo 1: Working with LINQ Queries Batch Layer Speed Layer Serving Layer HDInsight Hive LINQ to Hive Talk Track: In this demo, I’ll show you how .NET developers can use LINQ to Hive to create applications that interact with the HDInsight cluster—all using tools you already know. You’ll also learn how to use the LINQ to Hive extensions that are part of the Microsoft .NET SDK for Hadoop to create and compile LINQ queries to use against Hive data. |----Click----| A Hive table on Microsoft HDInsight cluster. As soon as a LINQ to HIVE query is fired from the Hadoop command prompt via a C# Visual Studio code, it sends the request to the HIVE table. As the HIVE table gets a request from the LINQ to HIVE query, it sends the output back to the Hadoop command prompt. LINQ to HIVE query LINQ to Hive query Hive table Output © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Learn more LINQ to Hive wikipage?title=LINQ%20to%20Hive&referringTitle=Home Using the Hadoop .NET SDK with the HDInsight Service hdinsight/howto-net-libraries/ That’s it for this session. To learn more about what I just showed you, check out these resource links for LINQ to Hive, the Hadoop .NET SDK, and Reactive Extensions. Thank you! END OF PRESENTATION © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Questions?

9/9/2018 Modern data warehouse: HDInsight Using Sqoop and Reporting Services Bill Ramos| VP Consulting, Advaiya Inc PPE Application and Data Platform In this presentation, you’ll learn how to use the reporting capabilities of SQL Server against Hadoop data. © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Agenda Working with Sqoop in Windows Azure HDInsight Reporting Services Demo Here’s what we’re going to cover today. First, I’ll show you how to use Sqoop to move data between a relational database like Window Azure SQL Database and Hadoop. Then, we’ll talk about how to use SQL Server Reporting Services to create a report from the data you got from Hadoop. From there, we’ll jump into Windows Azure SQL Reporting Services technology and learn how to host reports on Windows Azure. Finally, we’ll wrap things up with a demo for each of the technologies we covered. © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 SQOOP Lets first talk about SQOOP © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Using Sqoop to Move Data
A tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases Command Enterprise Data Warehouse Document Based Systems Relational Database Hadoop Talk Track: Let’s start with Sqoop, Sqoop is an open-source connectivity framework that facilitates transfer between multiple relational database management systems (that is, RDBMS) and HDFS. Sqoop uses MapReduce programs to import and export data. The imports and exports are performed in parallel with fault tolerance. The source or target files being used by SQOOP can be delimited text files (for example, with commas or tabs separating each field), or binary SequenceFiles containing serialized record data. Key Points: SQOOP Map Tasks References: NEXT SLIDE Specifics items to be covered: SQOOP Import/Export Commands Reference/Source: Sqoop Map Task HDFS/HBase/ Hive

Demo 1: Using SQOOP to Copy Data
Server & Tools Business 9/9/2018 Demo 1: Using SQOOP to Copy Data Batch Layer Speed Layer Serving Layer Windows Azure HDInsight Azure Blob storage Hive and Sqoop Windows Azure SQL Database PowerShell Console Talk Track: In this demo, we’re going to see how Sqoop is used to move data from a Hive table into a Windows Azure SQL database. |----Click----| SQOOP Export command is provided on the Hadoop command line tool. The Export command sends request to the HDFS/Hive/HBase cluster to export the data. Once the request is processed, it starts performing MapReduce and start exporting the data to SQL Database. Similarly, the same process can be applied for data import. Data import can be performed using SQOOP Import command which will import the data from SQL Database to the HDFS cluster. HDFS/Hive/HBase Sqoop Command Export SQL Database Sqoop Command Import Sqoop Export command In PowerShell © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Windows Azure SQL Reporting Deprecated as of Oct 2013 Discontinued for existing customers on Oct 31, Replaced by SQL Server Reporting Services (SSRS) running on Windows Azure Virtual Machines Windows Azure SQL Reporting is a cloud-based reporting service for the Windows Azure Platform built on SQL Server Reporting Services technologies. Important SQL Reporting service is available to current subscribers, but should not be used for new software development projects. The service will be discontinued on October 31, See this FAQ for details. An alternative to SQL Reporting is to use one or more instances of SQL Server Reporting Services (SSRS) running on Windows Azure Virtual Machines (VM). Using a VM, you can deploy an operational reporting solution in the cloud that supports either the Native or SharePoint mode feature set. A VM with SQL Server 2008 R2 or 2012 supports all Reporting Services features, including all supported data sources, customization and extensibility, and scheduled report execution and delivery. See SQL Reporting Sunset FAQ © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Benefits of SSRS on Azure VMs Extensible report server for custom reporting Scheduled report execution and delivery Integration with hybrid solutions Faster performance SSRS on a VM supports custom code and assembly references in a report. If business requirements for a report include unique complex function evaluation or proprietary visual controls, you can provide these capabilities in code embedded in a report file or in an assembly added to the report server. Similarly, developers can replace or supplement report server operations by adding custom extensions. See Custom Code and Assembly References in Expressions and Reporting Services Extensions for details. Scheduled report execution and delivery In addition to on-demand reporting, SSRS on a VM supports scheduled report processing so that you can retrieve data on a schedule, allowing you to control query execution on a remote database and the timing of data transfer on your network. Scheduled reports can be delivered in various output formats, to destinations other than a report server, such as or a file share, where the report is saved as PDF, Excel, or MHTML. See Schedules and Subscription and Delivery. Integration with hybrid solutions You can join a Windows Azure VM to your corporate network, adding capacity quickly, without the burden of hardware procurement and provisioning. Joining a Windows Azure VM to your domain requires a virtual network and dedicated VPN routing device. See Windows Azure Virtual Network Overview for more information. Faster performance Customers who have performed side-by-side testing experience better performance using SSRS on a VM. Performance gains are attributed to having the report server catalog reside on the local disk of the VM. The gain was more apparent on report servers handling larger workloads. See Use PowerShell to Create a Windows Azure VM with SSRS © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Learn more Apache Sqoop Reference Hadoop on Windows Azure - Working With Data Running SSRS with Windows Azure VMs Check out these links for more information on the topics we covered in this module. © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Questions?

9/9/2018 Modern data warehouse: HDInsight Operationalize your big data pipeline Bill Ramos| VP Consulting, Advaiya Inc PPE Application and Data Platform Welcome to the Day 2, Module 1 session of Big Data Boot Camp, “Developing Big Data Applications for .NET Developers.” © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Agenda Microsoft .NET SDK for Hadoop WebHDFS Client WebHCat Windows PowerShell Integration This session shows how you can use the Microsoft .NET SDK for Hadoop to run MapReduce jobs. Specifically, we’ll explore using the WebHDFS Client .NET APIs to perform basic task integration, the WebHCat APIs to schedule execution tasks, and the HDInsights cmdlets in PowerShell to manage cluster activities. © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Microsoft .NET SDK for Hadoop Let’s start with the Microsoft .NET SDL for Hadoop © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Microsoft .NET SDK For Hadoop
.NET client libraries for Hadoop Write MapReduce in Visual Studio using C# or F# Debug against local data Job Tracker .NET Hadoop SDK Talk Track: Lets first understand about the Microsoft .NET SDK For Hadoop. The Microsoft .NET SDK for Hadoop provides .NET client libraries that make it easier to work with Hadoop (Map Reduce)from .NET. Using this SDK, developers can quickly and easily build simple .NET-based applications that run Hive queries using the Windows Azure HDInsight Service. This enables developers to use their .NET skills to perform jobs in an HDInsight cluster. Key Points: Microsoft .NET SDK for Hadoop References: DBI-B221_Bakhshi.pptx ( 5f06%2d07%2fBreakouts&FolderCTID=0x E65FFB7F3BD44D88BFF47F89775C98) Microsoft Visual Studio Slave Nodes

SDK components MapReduce library LINQ to Hive client library
WebClient library WebHDFS client library WebHCat client library Microsoft Visual Studio install-package Microsoft.Hadoop.MapReduce install-package Microsoft.Hadoop.Hive install-package Microsoft.Hadoop.WebClient Talk Track: Now let’s talk about the various components of the .NET SDK and take a brief look at how it works. The SDK includes the MapReduce library, which simplifies writing MapReduce jobs in .NET languages using the Hadoop streaming interface. It also includes the LINQ to Hive client library, which translates C# or F# LINQ queries into HiveQL queries and executes them on the Hadoop cluster. This library can also execute arbitrary HiveQL queries from a .NET application. Finally, the SDK includes the WebClient library, which contains client libraries for WebHDFS and WebHCat. The WebHDFS client library works with files in HDFS and Windows Azure Blog storage, while the WebHCat client library manages the scheduling and execution of jobs in an HDInsight cluster. It’s easy to run and use the .NET SDK. First, in a Visual Studio C# Console Application, install the required client libraries. Then, develop an application using the features available with the installed libraries. That’s it. Now just run the application. Key Points: MapReduce library, for writing MapReduce jobs in .NET languages LINQ to Hive client library WebHDFS client library and WebHCat client library References: C# Applications (using client libraries) Mapper Reducer Deploy and run

WebClient Libraries in .NET
WebHDFS client library: works with files in HDFS and Windows Azure Blob storage WebHCat client library: manages the scheduling and execution of jobs in an HDInsight cluster WebHDFS WebHCat Scalable REST API Move files in and out and delete from HDFS Perform file and directory functions HDInsight job scheduling and execution Talk Track: Now , we will see the difference in the functionality of the two libraries of WebClient i.e. WebHDFS and WebHCat WebHDFS is the web service interface for HDFS. This scalable REST API enables easy access to HDFS. You can move files in and out and delete from HDFS, taking advantage of the parallelism of the cluster. You can also perform numerous file and directory functions. In addition, the WebHCat client library manages the scheduling and execution of jobs in an HDInsight cluster. It is important to note that there is a difference in the functionality of these two libraries: WebHDFS can be used to create Hive tables, while WebHCat is usually used to run queries on those Hive tables. Key Points: Difference in the functionality of WebHDFS and WebHCat References: General: About WebHDFS: DBI-B221_Bakhshi.pptx

Demo 1: Creating a Hive Table Using WebHDFS Client
Server & Tools Business 9/9/2018 Demo 1: Creating a Hive Table Using WebHDFS Client Batch Layer Speed Layer Serving Layer .NET Application (WebHDFS) Windows Azure HDInsight Copy data from base machine to Azure Storage Windows Azure Blob storage Talk Track: Let’s move on to the demos. In this first demo, I’ll show you how to use the Microsoft .NET SDK for Hadoop to run MapReduce jobs. Specifically, we’ll use the WebHDFS Client .NET APIs to preform basic task integration. First Click- .NET Application with WebHDFS interact with HDInsight Cluster to copy data from base machine to Azure Storage Second Click – and then loading the data into Hive Tables Key Points: .NET application (WebHDFS) to interact with HDInsight cluster References: Day 3 - Module 1 - Operationalize your Big Data Pipeline Hive table Load data .NET application (WebHDFS) to interact with HDInsight cluster © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Demo 2: Performing a Remote Job with WebHCat
Server & Tools Business 9/9/2018 Demo 2: Performing a Remote Job with WebHCat Batch Layer Speed Layer Serving Layer .NET Application (WebHCat) Windows Azure HDInsight Talk Track: Next we’ll explore how to use WebHCat to provide an abstract view of data on a Hadoop customer to coordinate the activities of different tools. First Click-.NET application(WebHCat) to query the Hive table using . NET code Second Click- Get the output of the query Key Points: To interact with Hive Tables using .NET application (WebHCat) References: Day 3 - Module 1 - Operationalize your Big Data Pipeline .NET application (WebHCat) to interact with Hive tables Query the Hive data using .NET code Query output Hive table © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Windows PowerShell Integration Let’s now look at how PowerShell works with the SDK © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Windows PowerShell Integration
Manage an HDInsight cluster using a local management console PowerShell scripts to build projects, import data into HDFS, and run samples Repeatable management through scripting Develop PowerShell scripts Run on local management console Talk Track: Lets see how you can manage an HDInsight cluster using a local management console through the use of Windows PowerShell With PowerShell scripts, you can perform tasks like building projects, importing data into HDFS, and running jobs. Key Points: Manage an HDInsight cluster using a local management console through PowerShell References: powershell.aspx Manage HDInsight cluster

Demo 3: Integrating PowerShell with HDInsight
Server & Tools Business 9/9/2018 Demo 3: Integrating PowerShell with HDInsight Batch Layer Speed Layer Serving Layer PowerShell Integration Windows Azure HDInsight Talk Track: Finally, in this last demo, I’ll show you how to use the HDInsights cmdlets in PowerShell to manage cluster activities. On First Click- HDInsight cmdlets in PowerShell query the HDInsight Cluster Second Clock- Get the desired output Key Points: Use of PowerShell cmdlets to manage HDInsight Cluster References: Day 3 - Module 1 - Operationalize your Big Data Pipeline New-MapReduceStreamingJob -Input "/example/data/gutenberg/davinci.txt" -Output "/example/data/streamingoutput/wc.txt" -Mapper cat.exe -Reducer wc.exe -File "hdfs:///example/apps/wc.exe,hdfs:///example/apps/cat.exe" [-Define <# delimited key=value pairs>] Windows PowerShell Create a Cluster Run MapReduce Program Delete the Customer Windows Azure HDInsight View Progress View Results on Azure Storage © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/9/2018 Learn more Microsoft .NET SDK For Hadoop Managing Your HDInsight Cluster with PowerShell That’s it for this session. To learn more about what I just showed you, check out these resource links for the Hadoop .NET SDK and Windows PowerShell. Thank you! END OF PRESENTATION © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Questions?

9/9/2018 Check out the MVA recordings END OF PRESENTATION © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

http://bit.ly/1d2QLWS Survey
or

Modern data warehouse: HDInsight

Similar presentations

Presentation on theme: "Modern data warehouse: HDInsight"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Modern data warehouse: HDInsight

Similar presentations

Presentation on theme: "Modern data warehouse: HDInsight"— Presentation transcript:

Similar presentations

About project

Feedback