Download presentation
Presentation is loading. Please wait.
1
Big Data Analytics with HDInsight
Microsoft Big Data Fundamentals HDInsight makes Hadoop Easy Hive and Tez - Querying/Curating Big Data Pig, Sqoop, Oozie and Mahout – Working with Hadoop projects HBase - A new paradigm Storm Essentials
2
2/15/2018 Big Data Analytics with HDInsight Module 1 – Microsoft Big Data Fundamentals Matt Winkler Nishant Thacker Principal PM Manager Technical Product Manager Microsoft Microsoft © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
3
Microsoft Big Data Fundamentals Why Hadoop
Getting started with HDInsight Why HDInsight
4
Why Hadoop Breaking points of traditional approach
Introducing Apache Hadoop
5
Breaking points of traditional approach
2/15/2018 Breaking points of traditional approach Source Systems ETL Data warehouse BI & analytics Staging OLTP ERP CRM LOB Dashboards Reporting © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
6
Breaking points of traditional approach
2/15/2018 Breaking points of traditional approach 50x Data growth 1Trillion Web pages 40ZB Digital Universe 2020 Increasing data volumes 1 Source Systems ETL Data warehouse BI & analytics Staging OLTP ERP CRM LOB Dashboards Reporting © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
7
Breaking points of traditional approach
2/15/2018 Breaking points of traditional approach 204M s sent every minute 231B US Ecommerce in 2012 340M Tweets sent every day Increasing data volumes 1 Real-time data 2 Source Systems ETL Data warehouse BI & analytics Staging OLTP ERP CRM LOB Dashboards Reporting © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
8
Breaking points of traditional approach
2/15/2018 Breaking points of traditional approach Increasing data volumes 1 Real-time data 2 Source systems ETL Data warehouse BI & analytics Staging OLTP ERP CRM LOB Dashboards Reporting New data Devices Web Sensors Social New data types 3 15x Machine generated data 2020 2.4M Facebook content per minute 1.3M Hours on Skype per hour © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
9
Breaking points of traditional approach
2/15/2018 Breaking points of traditional approach Increasing data volumes 1 Real-time data 2 Source systems OLTP ERP CRM LOB ETL Data warehouse BI & analytics Staging New data Devices Web Sensors Social Dashboards Reporting New data types 3 Cloud-born data 4 $100B spend on cloud 40% CRM sold are SaaS 50% large orgs have hybrid by 2017 © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
10
What if you could handle big data?
Log files Spatial & GPS coordinates Data market feeds eGov feeds Weather Text/image Click stream Wikis/blogs Sensors/RFID/ devices Social sentiment Audio/video Petabytes Web 2.0 Web Logs Digital Marketing Search Marketing Recommendations Advertising Mobile Collaboration eCommerce Terabytes ERP/CRM Payables Payroll Inventory Contacts Deal Tracking Sales Pipeline Gigabytes Megabytes Data complexity: variety and velocity
11
Why Hadoop Breaking points of traditional approach
Introducing Apache Hadoop
12
Introducing Apache Hadoop
Apache Open Source Project Highly scalable distributed file system (HDFS) Distributed processing on data nodes Data volumes Data variety Data velocity
13
Data volume Hadoop stores files in a distributed file system
Storage and computation is distributed across many servers Files can be spread out over multiple nodes Hadoop can store very large amounts of data Combined storage resource can grow with demand from a few nodes to thousands of nodes Scales out linearly Very large files supported including those larger than the capacity of a single node Files
14
Data variety Hadoop stores files (non-relational store) Sentiment
Files could have a variety of semi-structured or unstructured data Previously, these files may not have been seen as providing value or insights Today, new business questions and insights are being uncovered through data science Sentiment Understand how your customers feel about your brand and products— right now Clickstream Capture and analyze website visitors’ data trails and optimize your website Sensors Discover patterns in data streaming automatically from remote sensors and machines Geographic Analyze location-based data to manage operations where they occur Server logs Research logs to diagnose process failures and prevent security breaches Unstructured Understand patterns in files across millions of web pages, s, and documents
15
Data velocity Hadoop can stream live data and process them in real-time Hadoop can act as scalable event stream ingestion Hadoop can do near real-time in-stream processing Data input Event broker Stream processing Outgoing Applications Devices HTTP Incoming Outgoing
16
Hadoop is a platform with portfolio of projects
Governed by Apache Software Foundation (ASF) Comprises core services of MapReduce, HDFS, and YARN In addition to the core, includes functions across: Data services which allow you to manipulate and move data (Hive, HBase, Pig, Flume, Sqoop) Operational services which help manage the cluster (Ambari, Falcon, and Oozie) Governance and integration Data workflow, lifecycle and governance Falcon Sqoop Flume NFS WebHDFS YARN: data operating system Script Pig Search Solr SQL Hive/Tez, HCatalog Nosql Hbase Accumulo Stream Storm Others Spark, in-memory, ISV engines 1 N Batch Map reduce Data access HDFS (Hadoop Distributed File System) Data management Authentication Authorization Accounting Data protection Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox Security Operations Provision, manage, and monitor Ambari Zookeeper Scheduling Oozie
17
A Hadoop distribution is a package of projects
Tested for consistency across entire package Hadoop and YARN Tez Pig Hive and HCatalog HBase Phoenix Accumulo Storm Mahout Solr Falcon Sqoop Flume Ambari Oozie Zookeeper Knox HDP 2.1 April 2014 0.4.0 0.12.1 0.13.0 0.98.0 4.0.0 1.5.1 0.9.1 0.9.0 4.7.2 0.5.0 1.4.4 1.4.0 3.4.5 .0.4.0 2.4.0 HDP 2.0 October 2013 2.2.0 0.12.0 0.96.1 0.8.0 1.4.4 1.3.0 3.3.2 3.4.5 .0.4.0 HDP 1.3 May 2013 1.1.2 011.0 0.11.0 0.94.6 0.7.0 1.4.3 1.3.1 1.2.5 3.3.2 3.4.5 .0.4.0 Data management Data access Governance and integration Operations Security
18
Getting Started with HDInsight
Introducing Azure HDInsight 100% Apache Hadoop Powered by the cloud Immersive insights Getting started with use cases
19
Microsoft + Hortonworks Promoting open Hadoop
2/15/2018 Microsoft + Hortonworks Promoting open Hadoop Engineering alignment Corporate alignment Field alignment © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
20
HDInsight supports Hive
SQL-like queries on Hadoop data in HDInsight HDInsight provides easy-to-use graphical query interface for Hive HiveQL is a SQL-like language (subset of SQL) Hive structures include well-understood database concepts such as tables, rows, columns, partitions Compiled into MapReduce jobs that are executed on Hadoop Dramatic performance gains with Stinger/Tez Stinger is a Microsoft, Hortonworks and OSS driven initiative to bring interactive queries with Hive Brings query execution engine technology from Microsoft SQL Server to Hive Performance gains up to 100x Microsoft contribution to Apache code Hadoop 2.0 1400s 44.3s 35.1s Sample Query Hive 10 HDP 1.3 / Hive 11 HDP 2.0 32x Speedup 40X Speedup HDP 2.1 15s 100x
21
HDInsight supports HBase
NoSQL database on data in HDInsight Columnar, NoSQL database Runs on top of the Hadoop Distributed File System (HDFS) Provides flexibility in that new columns can be added to column families at any time Data Node Task Tracker Name Node Job Tracker HMaster Coordination Region Server
22
HDInsight supports Mahout
Machine learning library A library of machine learning algorithms to execute on data in HDFS Algorithms are not dependent on size of data and can scale with large datasets Library includes: Collaborative Filtering, Classification, Clustering, Dimensionality Reduction, Topic Models
23
HDInsight supports Storm
Stream analytics for near-real-time processing Consumes millions of real-time events from a scalable event broker (i.e.; Apache Kafka, Azure Event Hub) Performs time-sensitive computation Output to persistent stores, dashboards or devices Bolt Spout
24
Autotomatic Geo-Redundancy
HDInsight Auto Replicates Data Automatic geo-replication of data Data only replicates within the same geo-political (i.e., country, region) Auto Geo-Redundant
25
Deployed in minutes HDInsight in the cloud bypasses deployment expertise Hadoop is non-trivial to install and get up and running Education gap in IT community regarding Hadoop HDInsight is deployed in minutes Spin up any number of Hadoop nodes on-demand Up and running in a few clicks (and within minutes) Deployed in minutes
26
$£€¥ Low Cost HDInsight is billed by usage
Billed for usage Clusters can be deleted when no longer used No additional price for support Azure Support includes Hadoop support What usually costs thousands of dollars per node is included $£€¥
27
Connect cloud Hadoop with on-premises
HDInsight Cloud On-premises Hadoop Software Appliances APS Hybrid = On-premises + Cloud Hortonworks On-Prem Hadoop Moves Data To HDInsight Analytics Platform System can query HDInsight and join with on-prem
28
Scenarios for deploying Hadoop as hybrid
Cloud Develop/POC HDInsight Cloud Bursting HDInsight Cloud Backup/archive HDInsight Cloud On-premises Hadoop Software Appliances APS
29
Bringing Hadoop to a billion people
Excel as the BI tool for everyone Power BI for collaboration & new experiences 1 Billion Microsoft Office users Connect to HDInsight Analyze Visualize Office 365 is our fastest-growing commercial product ever Share Ask Access Scalable, manageable, trusted
30
2/15/2018 City of Barcelona use HDInsight to collect, analyze, and generate insights with data collected from social media feeds, GPS signals, and data from government systems. “We can gain the insight needed to distribute bicycles in different ways so that people can use them to connect with other forms of transportation such as busses and trains…(creating) a more sustainable model.” Lluis Sanz Marco Director of Information Municiple City of Barcelona © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
31
Hadoop scenario 1—pre-process ETL
2/15/2018 Hadoop scenario 1—pre-process ETL Shift the pre-processing of ETL in staging data warehouse to Hadoop Shifts high cost data warehousing to lower cost Hadoop clusters Source Systems ETL Data warehouse BI & analytics OLTP ERP CRM LOB Dashboards Reporting New Data Devices Web Sensors Social © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
32
Hadoop scenario 2—hot and cold storage
2/15/2018 Hadoop scenario 2—hot and cold storage Offloading large volume of historical data into cold storage with Hadoop Keep data warehouse for hot data to allow BI and analytics When data from cold storage is needed, it can be moved back into the warehouse Source Systems ETL Data warehouse BI & analytics OLTP ERP CRM LOB Hot data in DW Staging Dashboards Reporting New Data Devices Web Sensors Social Cold data in Hadoop © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
33
Industry use cases of Hadoop
Financial services Retail Telecom Manufacturing New account risk screens Fraud prevention Trading risk Maximize deposit spread Insurance underwriting Accelerate loan processing 360° view of the customer Analyze brand sentiment Localized, personalized promotions Website optimization Optimal store layout Call detail records (CDRs) Infrastructure investment Next product to buy (NPTB) Real-time bandwidth allocation New product development Supplier consolidation Supply chain and logistics Assembly line quality assurance Proactive maintenance Crowd source quality assurance Healthcare Utilities, oil and gas Public sector Genomic data for medical trials Monitor patient vitals Reduce re-admittance rates Store medical research data Recruit cohorts for pharmaceutical trials Smart meter stream analysis Slow oil well decline curves Optimize lease bidding Compliance reporting Proactive equipment repair Seismic image processing Analyze public sentiment Protect critical networks Prevent fraud and waste Crowd source reporting for repairs to infrastructure Fulfill open records requests
34
Why HDInsight? Challenges with implementing Hadoop
Why Hadoop with cloud?
35
Challenges with implementing Hadoop
Up-front HW costs Capacity planning Hadoop expertise
36
Why Hadoop in the cloud? Benefits of Cloud No HW costs $0
Unlimited scale Pay what you need Deployed in minutes Benefits of Cloud Unlimited elastic scale Auto geo redundancy No hardware costs Pay only for what you need
37
© 2014 Microsoft Corporation. All rights reserved
© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.