Introduction to Azure Data Lake

Slides:



Advertisements
Similar presentations
Matthew Winter and Ned Shawa
Advertisements

Breaking points of traditional approach What if you could handle big data?
AZ PASS User Group Azure Data Factory Overview Josh Sivey, Solution Partner October
Azure.
Big Data from Microsoft Azure Robert Turnage Data Solutions Architect
11/7/2017 2:56 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Microsoft Machine Learning & Data Science Summit
OMOP CDM on Hadoop Reference Architecture
BUILD BIG DATA ENTERPRISE SOLUTIONS FASTER ON AZURE HDINSIGHT
Connected Infrastructure
AuraPortal Cloud Helps Empower Organizations to Organize and Control Their Business Processes via Applications on the Microsoft Azure Cloud Platform MICROSOFT.
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
4/18/2018 6:56 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Data Platform and Analytics Foundational Training
Business Continuity & Disaster Recovery
Connected Living Connected Living What to look for Architecture
Data Platform and Analytics Foundational Training
Smart Building Solution
Introduction to Distributed Platforms
Microsoft Machine Learning & Data Science Summit
Working With Azure Batch AI
Partner Logo Veropath Offers a Next-Gen Expense Management SaaS Technology Solution, Built Specifically to Harness Big Data Analytics Capabilities in Azure.
Spark Presentation.
Smart Building Solution
Connected Living Connected Living What to look for Architecture
Microsoft Build /22/ :52 PM © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY,
Couchbase Server is a NoSQL Database with a SQL-Based Query Language
Wonderware Online Cost-Effective SaaS Solution Powered by the Microsoft Azure Cloud Platform Delivers Industrial Insights to Users and OEMs MICROSOFT AZURE.
Connected Infrastructure
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Data Platform and Analytics Foundational Training
Servoy Delivers-High Productivity Platform to Design, Build and Deliver Business Applications with a Superior Experience on Microsoft Azure Partner Logo.
9/13/2018 © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks.
Veeam Backup Repository
Azure.
Cloudy with a Chance of Data
Welcome to SQL Saturday Denmark
02 | Design and implement database
Business Continuity & Disaster Recovery
9/21/2018 3:41 AM BRK3180 Architect your big data solutions with SQL Data Warehouse & Azure Analysis Services Josh Caplan & Matt Usher Program Managers.
Enterprise security for big data solutions on Azure HDInsight
Turning back time … … to 1998.
Overview of Azure Data Lake Store
Migrating Your BI Platform To Azure
Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.
Scalable SoftNAS Cloud Protects Customers’ Mission-Critical Data in the Cloud with a Highly Available, Flexible Solution for Microsoft Azure MICROSOFT.
Voice Analytics on Microsoft Azure Allows Various Customers to Get the Most Out of Conversations with Clients Through Efficient Content Analysis MICROSOFT.
Utilizing the Capabilities of Microsoft Azure, Skipper Offers a Results-Based Platform That Helps Digital Advertisers with the Marketing of Their Mobile.
Auth0 Is Identity Made Simple for Developers, Built by Developers and Supported by the High Availability and Performance of Microsoft Azure MICROSOFT AZURE.
Server & Tools Business
Microsoft Connect /22/2018 9:50 PM
DeFacto Planning on the Powerful Microsoft Azure Platform Puts the Power of Intelligent and Timely Planning at Any Business Manager’s Fingertips Partner.
Data Security for Microsoft Azure
Accelerate Your Self-Service Data Analytics
Unitrends Enterprise Backup Solution Offers Backup and Recovery of Data in the Microsoft Azure Cloud for Better Protection of Virtual and Physical Systems.
CloneManager® Helps Users Harness the Power of Microsoft Azure to Clone and Migrate Systems into the Cloud Cost-Effectively and Securely MICROSOFT AZURE.
Microsoft Connect /24/ :05 AM
Appcelerator Arrow: Build APIs in Minutes. Connect to Any Data Source
Media365 Portal by Ctrl365 is Powered by Azure and Enables Easy and Seamless Dissemination of Video for Enhanced B2C and B2B Communication MICROSOFT AZURE.
XtremeData on the Microsoft Azure Cloud Platform:
Azure Machine Learning on Databricks
Understanding Azure Data Engineering Options Finding Clarity in a Vast & Changing Landscape Cameron Snapp.
Big-Data Analytics with Azure HDInsight
Server & Tools Business
Moving your on-prem data warehouse to cloud. What are your options?
SQL Server 2019 Bringing Apache Spark to SQL Server
Dimension Load Patterns with Azure Data Factory Data Flows
Architecture of modern data warehouse
Presentation transcript:

Introduction to Azure Data Lake Oskari Heikkinen Introduction to Azure Data Lake

Sponsors

Machine Learning & Data Science Conference 9/8/2019 1:53 AM Oskari Heikkinen Director, Microsoft Azure at CGI Microsoft P-TSP oskari.heikkinen@cgi.com +358 40 561 8481 Cloud Analytics https://www.linkedin.com/in/oskariheikkinen/ © 2015 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Compute Storage

Azure Data Lake Background: Cosmos at Microsoft

Azure Data Lake Storage (Gen 1) Store ANY DATA in its native format HADOOP FILE SYSTEM (HDFS) for the cloud ENTERPRISE GRADE No limits to SCALE Optimized for analytic workload PERFORMANCE Azure Data Lake Storage (Gen 1) A hyper scale repository for big data analytics workloads

Data Lake Storage (Gen 1): Basics Unlimited Storage Unlimited store account size Individual files can be size of petabytes Optimized for Analytics Built for running analytics systems that require massive throughput Optimized for parallel computation over petabytes of data High Availability Automatically replicates your data Three copies within a single region 99,9% SLA

Data Lake Storage (Gen 1): Data Security Encryption TLS for Data in Transit Transparent server-side encryption Service managed keys or Azure Key Vault and customer-managed keys Authentication & authorization Azure Active Directory POSIX-style Access Control Lists on folders and files Auditing Audit logs for all operations Audit logs can be analysed with U-SQL

Data Lake Storage (Gen 1) A LARGE FILE Files are split into Extents. Extents can be up to 2GB in size. For availability and reliability, extents are replicated (3 copies). Enables parallelized read 1 2 3 4

Large files provide parallelism opportunities Extent Vertex Extent Vertex Extent Vertex Extent Vertex Extent Vertex Extent Vertex

Parallel writing Front-end machines for a web service Azure Data lake Log files Simultaneous uploads Azure Data lake

Azure Data Lake Storage (Gen 1) Architecture

Key takeaway?

Data Lake Storage Gen1 Azure Blob Storage Scenarios Structure Optimized for Analytics General purpose bulk storage Structure Hierarchy on File System Flat namespace object store Size limits No* ~4,77 TB per file, 500 TB per storage account Geo-redundancy LRS LRS, ZRS, GRS, RA-GRS HDFS Client Yes Yes

Data Lake Storage Gen1 Azure Blob Storage Authentication Authorization Azure Active Directory Access Keys / SAS Tokens Authorization POSIX-style ACLs Access Keys / SAS Tokens Data Encryption Transparent Server-side Encryption Storage Service Encryption Connection protocols HTTPS HTTP / HTTPS Firewall Yes Yes

Data Lake Storage Gen 2

Data Lake Gen2: Combining the best of both?

Data Lake Gen2: Combining the best of both?

Data Lake Gen2: Combining the best of both?

Blob Storage Data Lake Gen1 Data Lake Gen2 Authentication Structure Access Keys/SAS Tokens Azure AD Azure AD Authentication Structure Flat namespace Hierarchical File System Both ~4,77 TB per file, 500 TB per account No* ~4,77 TB per file Size limits Geo-redundancy LRS, ZRS, GRS, RA-GRS LRS LRS, ZRS, GRS, RA-GRS Hot/Cold Storage Tiers Yes No Yes Price* 16,6€ / TB 32,9€ / TB 16,6€ / TB *Prices per month in West Europe for LRS on 24.2.2019

Storage Best Practices Microsoft Build 2016 9/8/2019 1:53 AM Storage Best Practices Design folder hierarchy structure Split into several services Service level limits Gen2: disaster recovery © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Services for processing Big Data

HDInsight

Azure HDInsight Hadoop as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open Source Hortonworks data platform Cluster up and running in 20 minutes Supported by Microsoft with 99.9% SLA Familiar BI tools for analysis Open source notebooks for interactive data science 63% lower TCO than deploying Hadoop on-premise* Hadoop as a Service on Azure *IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”

History Why do we have Big Data technologies today? MapReduce RDBMS Microsoft Build 2016 9/8/2019 1:53 AM History Why do we have Big Data technologies today? MapReduce RDBMS Data volume Petabyte scale Gigabyte scale Access mode Batch Interactive, batch Updates Write once, read many Write many, read many Structure Schema-on-read Schema-on-write Integrity Low High Scaling Linear Nonlinear © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Apache Hive: Enterprise Data Warehousing Machine Learning & Data Science Conference 9/8/2019 1:53 AM Apache Hive: Enterprise Data Warehousing 2015 Hive introduces ACID 2006 Hive incubated at Facebook 2012 ODBC/JDBC drivers released 2013 Hive introduces Tez, vectorization, ORC 2010 Top level Apache project 2016 In-memory through LLAP © 2015 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Execution engines and LLAP

Azure DataBricks

Spark as a Service on Azure Azure Databricks Azure Databricks is a first party service on Azure. Unlike with other clouds, it is not an Azure Marketplace or a 3rd party hosted service. Azure Databricks is integrated seamlessly with Azure services: Azure Portal: Service an be launched directly from Azure Portal Azure Storage Services: Directly access data in Azure Blob Storage and Azure Data Lake Store Azure Active Directory: For user authentication, eliminating the need to maintain two separate sets of users in Databricks and Azure. Azure SQL DW and Azure Cosmos DB: Enables you to combine structured and unstructured data for analytics Apache Kafka for HDInsight: Enables you to use Kafka as a streaming data source or sink Azure Billing: You get a single bill from Azure Azure Power BI: For rich data visualization Eliminates need to create a separate account with Databricks. Spark as a Service on Azure

Spark Structured Streaming Apache Spark An unified, open source, parallel, data processing framework for Big Data Analytics Spark Unifies: Batch Processing Interactive SQL Real-time processing Machine Learning Deep Learning Graph Processing Yarn Mesos Standalone Scheduler Spark MLlib Machine Learning Spark Structured Streaming Stream processing

General Spark Cluster Architecture Data Sources (HDFS, SQL, NoSQL, …) Cluster Manager Worker Node Cache Task Driver Program SparkContext ‘Driver’ runs the user’s ‘main’ function and executes the various parallel operations on the worker nodes. The results of the operations are collected by the driver The worker nodes read and write data from/to Data Sources including HDFS. Worker node also cache transformed data in memory as RDDs (Resilient Distributed Datasets). Worker nodes and the Driver Node execute as VMs in public clouds (AWS, Azure).

Catalyst query optimizer

DEMO: HDInsight & DataBricks

External Metastore

External Metastore

Call to Action Read how these work: HDFS YARN Spark Learning by doing: Start playing around with the services 

Thank you! 