Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Platform and Analytics Foundational Training

Similar presentations


Presentation on theme: "Data Platform and Analytics Foundational Training"— Presentation transcript:

1 Data Platform and Analytics Foundational Training
Microsoft C+E Technology Training Data Platform and Analytics Foundational Training Solution Area Data Analytics Solution Big Data Technology Hadoop on Azure [Speaker Name]

2 What is Big Data?

3 The business imperative
1. Increasing data volumes 2. Increasing complexity of data and analysis 3. Changing economics and emerging technologies

4 A new set of questions LIVE DATA FEEDS SOCIAL & Web ANALYTICS
What’s the social sentiment for my brand or products? LIVE DATA FEEDS How do I better predict future outcomes? SOCIAL & Web ANALYTICS Advanced ANALYTICS How do I optimize my fleet based on weather and traffic patterns?

5 What is big data? Big data solutions deal with the complexities of:
VOLUME (Size) VARIETY (Structure) VELOCITY (Speed)

6 What is big data? Big data Web 2.0 ERP/CRM
Log files Spatial & GPS coordinates Data market feeds eGov feeds Weather Text/image Clickstream Wikis/blogs Sensors/RFID/ devices Social sentiment Audio/video Petabytes Web 2.0 Web logs Digital marketing Search marketing Recommendations Advertising Mobile Collaboration eCommerce Terabytes ERP/CRM Payables Payroll Inventory Contacts Deal tracking Sales pipeline Gigabytes Megabytes Data complexity: variety and velocity

7 Common big-data customer scenarios
IT infrastructure optimization Legal discovery Social network analysis Traffic flow optimization Web app optimization Churn analysis Natural resource exploration Weather forecasting Healthcare outcomes Fraud detection Life sciences research Advertising analysis Equipment monitoring Smart meter monitoring Store now, question later

8 Introducing Apache Hadoop

9 Introducing Apache Hadoop
Apache Open Source Project Highly scalable distributed file system (HDFS) Distributed processing on data nodes Key attributes: Open source Highly scalable Runs on commodity hardware Redundant and reliable (no data loss) Batch processing centric—using a “Map-Reduce” processing paradigm

10 C# Hadoop is not… A replacement for data warehouse
A place to learn how to code C# A place for low-latency data

11 Business applications of Hadoop
Financial services New account risk screens Fraud prevention Trading risk Maximum deposit spread Insurance underwriting Accelerated loan processing Retail 360° view of customer Analysis of brand sentiment Localized, personalized promotions Website optimization Optimal store layout Telecom Call detail records (CDRs) Infrastructure investment Next product to buy (NPTB) Real-time bandwidth allocation New product development Manufacturing Supplier consolidation Supply chain and logistics Assembly-line quality assurance Proactive maintenance Crowdsource quality assurance Healthcare Genomic data for medical trials Patient vitals monitoring Reduced readmittance rates Storage of medical research data Recruitment of cohorts for pharmaceutical trials Utilities, oil, and gas Smart meter-stream analysis Slow oil-well decline curves Optimized lease bidding Compliance reporting Proactive equipment repair Seismic image processing Public sector Analysis of public sentiment Protected critical networks Fraud and waste prevention Crowdsource reporting for repairs to infrastructure Fulfillment of open records requests

12 Hadoop Components

13 YARN (cluster resource manager)
Hadoop – What is it? A highly reliable, distributed, and parallel programming framework for analyzing big data A Java-based, open source Apache project Capable of running on a variety of hardware platforms, including clusters of commodity hardware The Hadoop core includes: A scalable, reliable file system (HDFS) A framework that enables development of programs based on MapReduce (MR) or directed acyclic graph (DAG) model YARN, a distributed resource manager that allocates and controls access to resource of cluster manager In addition to the core, Hadoop has a rich ecosystem that supports SQL/NoSQL, streaming, real-time, and interactive applications HDFS YARN (cluster resource manager) MapReduce Tez (redundant, reliable storage) Hadoop core (data processing framework)

14 Hadoop MapReduce concept
Divide large problem into sub-problems Programming framework (library and runtime) for analyzing data sets stored in HDFS ……… Map() Perform same function on all sub-problems Composed of user-supplied Map and Reduce functions: Do work() Map(): Subdivide and conquer Combine output from all sub-functions Reduce() Reduce(): Combine and reduce cardinality Output

15 Introducing Azure HDInsight

16 Azure HDInsight – What is it?
A standard Apache Hadoop distribution offered as a managed service on Microsoft Azure Based on Hortonworks Data Platform (HDP) Provisioned as clusters on Azure that can run on Windows or Linux servers Offers capacity-on-demand, pay-as-you-go pricing model Integrates with: Azure Blob Storage and Azure Data Lake Store for Hadoop File System (HDFS) Azure Portal for management and administration Visual Studio for application development tooling In addition to the core, HDInsight supports the Hadoop ecosystem Hive

17 HDInsight and Hadoop ecosystem
Pipeline/ workflow (Oozie) JavaScript C#, F#, .NET Graph (Pegasus) Stats processing (RHadoop) Machine Learning (Mahout) (ODBC/SQOOP/REST) Data integration Relational (SQL Server) Legend Red = Core Hadoop Blue = Data processing Gray = Microsoft integration points and value adds Orange = Data movement Green = Packages Metadata (HCatalog) PDW PolyBase Event-driven processing Real-time processing (Storm) NoSQL Database (HBase) Scripting (Pig) Query (Hive) Distributed processing (MapReduce) Event pipeline (Event hub/ flume) (Excel, Power BI, SSAS) Business Intelligence YARN Distributed storage (HDFS) Monitoring & deployment (System Center) World's data (Azure Data Marketplace) Azure Storage Vault (ASV) Active Directory (Security)

18 HDInsight: Built for Windows or Linux
Managed and supported by Microsoft Familiarity of Windows Reuse of common tools, documentation, samples from Hadoop/Linux ecosystem Addition of Hadoop projects that were authored on Linux to HDInsight Easier transition from on-premises to cloud

19 HDInsight supports Hive
SQL-like queries on Hadoop data in HDInsight HDInsight provides easy-to-use graphical query interface for Hive HiveQL is a SQL-like language (subset of SQL) Hive structures include well-understood database concepts such as tables, rows, columns, partitions Compiled into MapReduce jobs that are executed on Hadoop Dramatic performance gains with Stinger/Tez Stinger is a Microsoft, Hortonworks, and OSS-driven initiative to bring interactive queries with Hive Query execution engine technology from Microsoft SQL Server to Hive Performance gains up to 100x Microsoft contribution to Apache code Hadoop 2.0 1400s 44.3s 35.1s Sample Query Hive 10 HDP 1.3 / Hive 11 HDP 2.0 32x Speedup 40X Speedup HDP 2.1 15s 100x

20 HDInsight supports HBase
NoSQL database on data in HDInsight Columnar, NoSQL database Runs on top of Hadoop Distributed File System (HDFS) Provides flexibility for new columns to be added to column families at any time Data node TaskTracker Name node JobTracker HMaster Coordination Region server

21 HDInsight supports Mahout
Machine learning library A library of machine learning algorithms to execute on data in HDFS Algorithms are not dependent on size of data and can scale with large data sets Library includes: collaborative filtering, classification, clustering, dimensionality reduction, topic models HDInsight supports Storm

22 HDInsight supports Storm
Stream Analytics for near real-time processing Consumes millions of real-time events from scalable event broker (i.e., Apache Kafka, Azure Event Hub) Performs time-sensitive computation Outputs to persistent stores, dashboards, or devices Customizable with Java + .NET Deeply integrated to Visual Studio Event queuing system Collection Presentation and action Event producers Transformation Long-term storage Event hubs Storage adapters Stream processin g Cloud gateways (web APIs) Field gateways Applications Search and query Data analytics (Excel) Web/thick client dashboards Live Dashboards Apache Storm on HDInsight Devices to take action Kafka/ RabbitMQ/ ActiveMQ Web and social Devices Sensors Azure Stream Analytics HDFS Azure DBs Azure Storage HBase

23 HDInsight supports Spark
In-memory processing on multiple workloads Single execution model for multiple tasks (SQL Query, Spark Streaming, Machine Learning, and Graph) Processing up to 100x faster performance Developer friendly (Java, Python, Scala) BI tool of choice (Power BI, Tableau, Qlik, SAP) Notebook experience (Jupyter/iPython, Zeppelin) Spark SQL Spark Streaming Machine Learning MLib Graph GraphX

24 Microsoft makes Hadoop easier
Deep Visual Studio integration Debug Hive jobs through Yarn logs or troubleshoot Storm topologies Visualize Hadoop clusters, tables, and storage Submit Hive queries, Storm topologies (C# or Java spouts/bolts) IntelliSense

25 Azure HDInsight Positioning

26 Why Microsoft Azure? On-premises Hadoop
Azure Storage HDInsight Data Factory ML Stream Analytics Database DocumentDB Search Event Hubs Why Microsoft Azure? On-premises Hadoop Software Appliances Azure facts >4 trillion objects in Azure 300,000-1M+ requests per second Double compute and storage every 6 months

27 No hardware challenges
HDInsight in the cloud bypasses hardware costs Hardware acquisition Hardware maintenance Performance tuning HDInsight in the cloud bypasses capacity planning Spin up any number of Hadoop nodes on demand Go from tens to thousands of nodes No HW costs $0 Unlimited scale

28 Mission-critical, enterprise-ready
Managed Hadoop, backed by SLA Three nines of availability: 99.9% uptime HDInsight auto-replicates data Automatic geo-replication of data Data only replicates within same geo-political (i.e., country, region) Mission-critical Hadoop

29 Maintenance done for you
Minimal IT resources for upgrades/patching OS patching and security updates done automatically Minimal IT resources to update Hadoop versions Hadoop versions are rapidly releasing throughout year Always be on latest version of Hadoop, without effort HDInsight on Hadoop 2.2 April 2014 HDInsight on Hadoop 1.1.2 Oct 2013 HDInsight on Hadoop 2.4 June 2014 O/S upgrades O/S patching HDInsight adds latest version of Hadoop for you

30 $£€¥ Low cost No additional price for support
HDInsight is billed by usage Billed for usage Clusters can be deleted when no longer used No additional price for support Azure Support includes Hadoop support What usually costs thousands of dollars per node is included $£€¥

31 © 2016 Microsoft Corporation. All rights reserved
© 2016 Microsoft Corporation. All rights reserved. Microsoft, Windows, Microsoft Azure, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION


Download ppt "Data Platform and Analytics Foundational Training"

Similar presentations


Ads by Google