Big Data Analytics with HDInsight

Slides:



Advertisements
Similar presentations
Opportunities with Cloud, Data and Devices
Advertisements

R and HDInsight in Microsoft Azure
Power BI Sites and Mobile BI. What You Will Learn Sharing and Collaboration Introducing Power BI Exploring Power BI Features and Services Partner Opportunities.
Roger Breu SQL Server PDW Solution Sales Microsoft Western Europe Microsoft Solutions for Big Data | Oct 17th 2013 From Numbers.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
BIG DATA – WHAT’S THE BIG DEAL The call would start soon, please be on mute. Thanks for your time and patience.
Hadoop on Azure 101 What is the Big Deal? Dennis Mulder Solution Architect Microsoft Corporation.
Introduction to Big Data and Hadoop Name Title Microsoft Corporation.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Big data analytics Rafal Lukawiecki Strategic Consultant Project Botticelli
Breaking points of traditional approach What if you could handle big data?
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
Azure HDInsight And Excel Analyze unstructured data at scale, then visualize! George Walters Sr. Technical Solutions Professional, Data Platform Microsoft.
Microsoft Partner since 2011
Unlock your Big Data with Analytics and BI on Office365 Brian Culver ● SharePoint Fest Seattle● BI102 ● August 18-20, 2015.
Microsoft Ignite /28/2017 6:07 PM
Business Insights Play briefing deck.
Azure.
Connected Infrastructure
Data Platform and Analytics Foundational Training
Hybrid Management and Security
Data Platform and Analytics Foundational Training
Data Platform and Analytics Foundational Training
Data Platform Modernization
5/9/2018 7:28 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS.
Connected Living Connected Living What to look for Architecture
Data Platform and Analytics Foundational Training
Data Platform and Analytics Foundational Training
Smart Building Solution
Microsoft Machine Learning & Data Science Summit
Business Critical Application Platform
Hybrid Management and Security
Hadoop in the Enterprise
Orchestrating Data and Services with Azure Data Factory
Microsoft Azure: The only consistent Hybrid Cloud
Enable the Hybrid Data Platform
Smart Building Solution
Hadoopla: Microsoft and the Hadoop Ecosystem
Connected Living Connected Living What to look for Architecture
Microsoft Build /22/ :52 PM © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY,
Power BI Premium overview
Connected Infrastructure
Data Platform and Analytics Foundational Training
Configuration Management with Azure Automation DSC
9/13/2018 © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks.
Business Critical Application Platform
Azure.
HDInsight makes Hadoop Easy
9/19/2018 5:55 AM How Microsoft does IT: Modern Cloud management with Operations Management Suite Seth Malcolm IT Showcase © Microsoft Corporation. All.
Dev and Test Environments in the Cloud
9/21/2018 3:41 AM BRK3180 Architect your big data solutions with SQL Data Warehouse & Azure Analysis Services Josh Caplan & Matt Usher Program Managers.
Turning back time … … to 1998.
Overview of Azure Data Lake Store
Data Platform Modernization
Microsoft Virtual Academy
Server & Tools Business
11/22/2018 1:43 PM THR3005 How to provide business insight from your data using Azure Analysis Services Peter Myers Bitwise Solutions © Microsoft Corporation.
Microsoft Connect /22/2018 9:50 PM
Matt Masson Software Development Engineer Microsoft Corporation
Michael Tejedor | Sr. Product Marketing Manager
Microsoft Virtual Academy
Context about the Data Warehouse
2/25/2019.
Building and running HPC apps in Windows Azure
HDInsight Tools for Visual Studio
Server & Tools Business
Microsoft Data Insights Summit
Customer 360.
Presentation transcript:

Big Data Analytics with HDInsight Microsoft Big Data Fundamentals HDInsight makes Hadoop Easy Hive and Tez - Querying/Curating Big Data Pig, Sqoop, Oozie and Mahout – Working with Hadoop projects HBase - A new paradigm Storm Essentials

2/15/2018 Big Data Analytics with HDInsight Module 1 – Microsoft Big Data Fundamentals Matt Winkler Nishant Thacker Principal PM Manager Technical Product Manager Microsoft Microsoft © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Microsoft Big Data Fundamentals Why Hadoop Getting started with HDInsight Why HDInsight

Why Hadoop Breaking points of traditional approach Introducing Apache Hadoop

Breaking points of traditional approach 2/15/2018 Breaking points of traditional approach Source Systems ETL Data warehouse BI & analytics Staging OLTP ERP CRM LOB Dashboards Reporting © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Breaking points of traditional approach 2/15/2018 Breaking points of traditional approach 50x Data growth 2010-2020 1Trillion Web pages 40ZB Digital Universe 2020 Increasing data volumes 1 Source Systems ETL Data warehouse BI & analytics Staging OLTP ERP CRM LOB Dashboards Reporting © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Breaking points of traditional approach 2/15/2018 Breaking points of traditional approach 204M Emails sent every minute 231B US Ecommerce in 2012 340M Tweets sent every day Increasing data volumes 1 Real-time data 2 Source Systems ETL Data warehouse BI & analytics Staging OLTP ERP CRM LOB Dashboards Reporting © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Breaking points of traditional approach 2/15/2018 Breaking points of traditional approach Increasing data volumes 1 Real-time data 2 Source systems ETL Data warehouse BI & analytics Staging OLTP ERP CRM LOB Dashboards Reporting New data Devices Web Sensors Social New data types 3 15x Machine generated data 2020 2.4M Facebook content per minute 1.3M Hours on Skype per hour © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Breaking points of traditional approach 2/15/2018 Breaking points of traditional approach Increasing data volumes 1 Real-time data 2 Source systems OLTP ERP CRM LOB ETL Data warehouse BI & analytics Staging New data Devices Web Sensors Social Dashboards Reporting New data types 3 Cloud-born data 4 $100B spend on cloud 40% CRM sold are SaaS 50% large orgs have hybrid by 2017 © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

What if you could handle big data? Log files Spatial & GPS coordinates Data market feeds eGov feeds Weather Text/image Click stream Wikis/blogs Sensors/RFID/ devices Social sentiment Audio/video Petabytes Web 2.0 Web Logs Digital Marketing Search Marketing Recommendations Advertising Mobile Collaboration eCommerce Terabytes ERP/CRM Payables Payroll Inventory Contacts Deal Tracking Sales Pipeline Gigabytes Megabytes Data complexity: variety and velocity

Why Hadoop Breaking points of traditional approach Introducing Apache Hadoop

Introducing Apache Hadoop Apache Open Source Project Highly scalable distributed file system (HDFS) Distributed processing on data nodes Data volumes Data variety Data velocity

Data volume Hadoop stores files in a distributed file system Storage and computation is distributed across many servers Files can be spread out over multiple nodes Hadoop can store very large amounts of data Combined storage resource can grow with demand from a few nodes to thousands of nodes Scales out linearly Very large files supported including those larger than the capacity of a single node Files

Data variety Hadoop stores files (non-relational store) Sentiment Files could have a variety of semi-structured or unstructured data Previously, these files may not have been seen as providing value or insights Today, new business questions and insights are being uncovered through data science Sentiment Understand how your customers feel about your brand and products— right now Clickstream Capture and analyze website visitors’ data trails and optimize your website Sensors Discover patterns in data streaming automatically from remote sensors and machines Geographic Analyze location-based data to manage operations where they occur Server logs Research logs to diagnose process failures and prevent security breaches Unstructured Understand patterns in files across millions of web pages, emails, and documents

Data velocity Hadoop can stream live data and process them in real-time Hadoop can act as scalable event stream ingestion Hadoop can do near real-time in-stream processing Data input Event broker Stream processing Outgoing Applications Devices HTTP Incoming Outgoing

Hadoop is a platform with portfolio of projects Governed by Apache Software Foundation (ASF) Comprises core services of MapReduce, HDFS, and YARN In addition to the core, includes functions across: Data services which allow you to manipulate and move data (Hive, HBase, Pig, Flume, Sqoop) Operational services which help manage the cluster (Ambari, Falcon, and Oozie) Governance and integration Data workflow, lifecycle and governance Falcon Sqoop Flume NFS WebHDFS YARN: data operating system Script Pig Search Solr SQL Hive/Tez, HCatalog Nosql Hbase Accumulo Stream Storm Others Spark, in-memory, ISV engines 1 ° N Batch Map reduce Data access HDFS (Hadoop Distributed File System) Data management Authentication Authorization Accounting Data protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Security Operations Provision, manage, and monitor Ambari Zookeeper Scheduling Oozie

A Hadoop distribution is a package of projects Tested for consistency across entire package Hadoop and YARN Tez Pig Hive and HCatalog HBase Phoenix Accumulo Storm Mahout Solr Falcon Sqoop Flume Ambari Oozie Zookeeper Knox HDP 2.1 April 2014 0.4.0 0.12.1 0.13.0 0.98.0 4.0.0 1.5.1 0.9.1 0.9.0 4.7.2 0.5.0 1.4.4 1.4.0 3.4.5 .0.4.0 2.4.0 HDP 2.0 October 2013 2.2.0 0.12.0 0.96.1 0.8.0 1.4.4 1.3.0 3.3.2 3.4.5 .0.4.0 HDP 1.3 May 2013 1.1.2 011.0 0.11.0 0.94.6 0.7.0 1.4.3 1.3.1 1.2.5 3.3.2 3.4.5 .0.4.0 Data management Data access Governance and integration Operations Security

Getting Started with HDInsight Introducing Azure HDInsight 100% Apache Hadoop Powered by the cloud Immersive insights Getting started with use cases

Microsoft + Hortonworks Promoting open Hadoop 2/15/2018 Microsoft + Hortonworks Promoting open Hadoop Engineering alignment Corporate alignment Field alignment © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

HDInsight supports Hive SQL-like queries on Hadoop data in HDInsight HDInsight provides easy-to-use graphical query interface for Hive HiveQL is a SQL-like language (subset of SQL) Hive structures include well-understood database concepts such as tables, rows, columns, partitions Compiled into MapReduce jobs that are executed on Hadoop Dramatic performance gains with Stinger/Tez Stinger is a Microsoft, Hortonworks and OSS driven initiative to bring interactive queries with Hive Brings query execution engine technology from Microsoft SQL Server to Hive Performance gains up to 100x Microsoft contribution to Apache code Hadoop 2.0 1400s 44.3s 35.1s Sample Query Hive 10 HDP 1.3 / Hive 11 HDP 2.0 32x Speedup 40X Speedup HDP 2.1 15s 100x

HDInsight supports HBase NoSQL database on data in HDInsight Columnar, NoSQL database Runs on top of the Hadoop Distributed File System (HDFS) Provides flexibility in that new columns can be added to column families at any time Data Node Task Tracker Name Node Job Tracker HMaster Coordination Region Server

HDInsight supports Mahout Machine learning library A library of machine learning algorithms to execute on data in HDFS Algorithms are not dependent on size of data and can scale with large datasets Library includes: Collaborative Filtering, Classification, Clustering, Dimensionality Reduction, Topic Models

HDInsight supports Storm Stream analytics for near-real-time processing Consumes millions of real-time events from a scalable event broker (i.e.; Apache Kafka, Azure Event Hub) Performs time-sensitive computation Output to persistent stores, dashboards or devices Bolt Spout

Autotomatic Geo-Redundancy HDInsight Auto Replicates Data Automatic geo-replication of data Data only replicates within the same geo-political (i.e., country, region) Auto Geo-Redundant

Deployed in minutes HDInsight in the cloud bypasses deployment expertise Hadoop is non-trivial to install and get up and running Education gap in IT community regarding Hadoop HDInsight is deployed in minutes Spin up any number of Hadoop nodes on-demand Up and running in a few clicks (and within minutes) Deployed in minutes

$£€¥ Low Cost HDInsight is billed by usage Billed for usage Clusters can be deleted when no longer used No additional price for support Azure Support includes Hadoop support What usually costs thousands of dollars per node is included $£€¥

Connect cloud Hadoop with on-premises HDInsight Cloud On-premises Hadoop Software Appliances APS Hybrid = On-premises + Cloud Hortonworks On-Prem Hadoop Moves Data To HDInsight Analytics Platform System can query HDInsight and join with on-prem

Scenarios for deploying Hadoop as hybrid Cloud Develop/POC HDInsight Cloud Bursting HDInsight Cloud Backup/archive HDInsight Cloud On-premises Hadoop Software Appliances APS

Bringing Hadoop to a billion people Excel as the BI tool for everyone Power BI for collaboration & new experiences 1 Billion Microsoft Office users Connect to HDInsight Analyze Visualize Office 365 is our fastest-growing commercial product ever Share Ask Access Scalable, manageable, trusted

2/15/2018 City of Barcelona use HDInsight to collect, analyze, and generate insights with data collected from social media feeds, GPS signals, and data from government systems. “We can gain the insight needed to distribute bicycles in different ways so that people can use them to connect with other forms of transportation such as busses and trains…(creating) a more sustainable model.” Lluis Sanz Marco Director of Information Municiple City of Barcelona © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Hadoop scenario 1—pre-process ETL 2/15/2018 Hadoop scenario 1—pre-process ETL Shift the pre-processing of ETL in staging data warehouse to Hadoop Shifts high cost data warehousing to lower cost Hadoop clusters Source Systems ETL Data warehouse BI & analytics OLTP ERP CRM LOB Dashboards Reporting New Data Devices Web Sensors Social © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Hadoop scenario 2—hot and cold storage 2/15/2018 Hadoop scenario 2—hot and cold storage Offloading large volume of historical data into cold storage with Hadoop Keep data warehouse for hot data to allow BI and analytics When data from cold storage is needed, it can be moved back into the warehouse Source Systems ETL Data warehouse BI & analytics OLTP ERP CRM LOB Hot data in DW Staging Dashboards Reporting New Data Devices Web Sensors Social Cold data in Hadoop © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Industry use cases of Hadoop Financial services Retail Telecom Manufacturing New account risk screens Fraud prevention Trading risk Maximize deposit spread Insurance underwriting Accelerate loan processing 360° view of the customer Analyze brand sentiment Localized, personalized promotions Website optimization Optimal store layout Call detail records (CDRs) Infrastructure investment Next product to buy (NPTB) Real-time bandwidth allocation New product development Supplier consolidation Supply chain and logistics Assembly line quality assurance Proactive maintenance Crowd source quality assurance Healthcare Utilities, oil and gas Public sector Genomic data for medical trials Monitor patient vitals Reduce re-admittance rates Store medical research data Recruit cohorts for pharmaceutical trials Smart meter stream analysis Slow oil well decline curves Optimize lease bidding Compliance reporting Proactive equipment repair Seismic image processing Analyze public sentiment Protect critical networks Prevent fraud and waste Crowd source reporting for repairs to infrastructure Fulfill open records requests

Why HDInsight? Challenges with implementing Hadoop Why Hadoop with cloud?

Challenges with implementing Hadoop Up-front HW costs Capacity planning Hadoop expertise

Why Hadoop in the cloud? Benefits of Cloud No HW costs $0 Unlimited scale Pay what you need Deployed in minutes Benefits of Cloud Unlimited elastic scale Auto geo redundancy No hardware costs Pay only for what you need

© 2014 Microsoft Corporation. All rights reserved © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.