Modelling data & best practices for Azure Cosmos DB SQL API

Slides:



Advertisements
Similar presentations
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Advertisements

Service Manager for MSPs
Architecting a Large-Scale Data Warehouse with SQL Server 2005 Mark Morton Senior Technical Consultant IT Training Solutions DAT313.
Distributed Data Stores – Facebook Presented by Ben Gooding University of Arkansas – April 21, 2015.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Training Workshop Windows Azure Platform. Presentation Outline (hidden slide): Technical Level: 200 Intended Audience: Developers Objectives (what do.
Practical Database Design and Tuning. Outline  Practical Database Design and Tuning Physical Database Design in Relational Databases An Overview of Database.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
Technology Drill Down: Windows Azure Platform Eric Nelson | ISV Application Architect | Microsoft UK |
IHP Im Technologiepark Frankfurt (Oder) Germany IHP Im Technologiepark Frankfurt (Oder) Germany ©
NO SQL for SQL DBA Dilip Nayak & Dan Hess.
Practical Database Design and Tuning
Data Platform and Analytics Foundational Training
Windows Azure SQL Federation
Indexing Structures for Files and Physical Database Design
Record Storage, File Organization, and Indexes
Note to partner: Feel free to use your own photos and brand imagery.
Databases Chapter 16.
SharePoint Solutions Architect, Protiviti
Azure Cosmos DB: Design Patterns and Case Studies
Microsoft /2/2018 3:42 PM BRK3129 Query Big Data using the Expanded T-SQL footprint with PolyBase in SQL Server 2016 Casey Karst Program Manager.
Globally distributed, secure MongoDB with Azure Cosmos DB
Informatica PowerCenter Performance Tuning Tips
Physical Database Design and Performance
Azure Cosmos DB Venitta J Microsoft Connect /6/2018 4:36 PM
Julie Strauss Senior Program Manager Microsoft
COMP 430 Intro. to Database Systems
Methodology – Monitoring and Tuning the Operational System
SELECT * FROM Azure Cosmos DB
NOSQL databases and Big Data Storage Systems
Database Performance Tuning and Query Optimization
Exploring Azure Event Grid
A developers guide to Azure SQL Data Warehouse
Microsoft Build /20/2018 5:17 AM © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY,
Azure SQL Data Warehouse Scaling: Configuration and Guidance
9/22/2018.
Microsoft Build /9/2018 5:08 AM © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY,
11/18/2018 2:14 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
Let's make a complex dataset simple using Azure Cosmos DB
Azure DocumentDB Overview and Offline Development
Azure Event Grid with Custom Events
A developers guide to Azure SQL Data Warehouse
Cardinality Estimates in SQL Server 2014
Teaching slides Chapter 8.
Physical Database Design
20 Questions with Azure SQL Data Warehouse
Practical Database Design and Tuning
Explore the Azure Cosmos DB with .NET Core 2.0
Selected Topics: External Sorting, Join Algorithms, …
Power-up NoSQL with Azure Cosmos DB
Learn. Imagine. Build. .NET Conf
MANAGING DATA RESOURCES
Methodology – Monitoring and Tuning the Operational System
Let's make a complex dataset simple using Azure Cosmos DB
Chapter 11 Database Performance Tuning and Query Optimization
Azure Cosmos DB with SQL API .Net SDK
Data modelling for Power BI using brand new Analysis Services Features
04 | Performance and the Premium SKU
Building global and highly-available services using Windows Azure
Azure Cosmos DB Technical Deep Dive
Data Modeling.
Partitioning.
Request Units & Billing
Global Distribution.
Course Instructor: Supriya Gupta Asstt. Prof
Make it real: Help your customers comply with the GDPR
Developer Intro to Cosmos DB
Cosmic DBA Cosmos DB for SQL Server Admins and Developers
The Database World of Azure
Azure Cosmos DB – FY20 Top Use Cases
Presentation transcript:

Modelling data & best practices for Azure Cosmos DB SQL API BRK3184 Modelling data & best practices for Azure Cosmos DB SQL API 2019 – Manchester Central Theo van Kraay – Data Solution Architect, Microsoft

Why is data modelling important? Performance & Cost

How many of you have…? Used Cosmos DB? 4/22/2019 8:03 AM How many of you have…? Used Cosmos DB? Proof of Concept? Production? Modelled data in relational databases? Modelled data in non-relational databases? © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Agenda Cosmos DB (intro) Data modelling strategies Choosing the right partition key Optimize your Cosmos DB Query performance Indexing © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Azure Cosmos DB A globally distributed, massively scalable, multi-model database service Guaranteed low latency at the 99th percentile Elastic scale out of storage & throughput Five well-defined consistency models Turnkey global distribution Comprehensive SLAs

Azure Cosmos DB A globally distributed, massively scalable, multi-model database service Key-value Column-family Document Graph Guaranteed low latency at the 99th percentile Elastic scale out of storage & throughput Five well-defined consistency models Turnkey global distribution Comprehensive SLAs

Azure Cosmos DB Table API MongoDB A globally distributed, massively scalable, multi-model database service MongoDB Table API Key-value Column-family Document Graph Guaranteed low latency at the 99th percentile Elastic scale out of storage & throughput Five well-defined consistency models Turnkey global distribution Comprehensive SLAs

4/22/2019 8:03 AM Data modelling © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Data modelling Is just as important as with relational data! There’s still a schema – just enforced at the application level Plan upfront for best performance.

Normalize everything 2 Extremes ORM SQL Embed as 1 piece NoSQL

Contoso Restaurant menu Menu Item ------------ ID Item Name Item Description Item Category ID Category Category Name Category Description Relational modelling – Each menu item has a reference to a category. { "ID": 1, "ItemName": "hamburger", "ItemDescription": "cheeseburger, no cheese", "Category" { "Name": "sandwiches" "Description": "2 pieces of bread + filling“ } Non-relational modeling – each menu item is a self-contained document

Modelling challenges “Where are my joins?!?” 4/22/2019 8:03 AM Modelling challenges “Where are my joins?!?” #1: To de-normalize, or normalize? To embed, or to reference? #2: Put data types in same collection, or different? © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Modeling challenge #1: To embed or reference? { "menuID": 1, "menuName": "Lunch menu", "items": [ {"ID": 1, "ItemName": "hamburger", "ItemDescription":...} {"ID": 2, "ItemName": "cheeseburger", "ItemDescription":...} ] } Embed { "menuID": 1, "menuName": "Lunch menu", "items": [ {"ID": 1} {"ID": 2} ] } Reference {"ID": 1, "ItemName": “hamburger", "ItemDescription":...} {"ID": 2, "ItemName": “cheeseburger", "ItemDescription":...}

When to embed #1 “Data that is queried together, should live together” E.g. in Recipe, ingredients are always queried with the item { "ID": 1, "ItemName": "hamburger", "ItemDescription": "cheeseburger, no cheese", "Category": "sandwiches", "CategoryDescription": "2 pieces of bread + filling", "Ingredients": [ {"ItemName": "bread", "calorieCount": 100, "Qty": "2 slices"}, {"ItemName": "lettuce", "calorieCount": 10, "Qty": "1 slice"} {"ItemName": "tomato","calorieCount": 10, "Qty": "1 slice"} {"ItemName": "patty", "calorieCount": 700, "Qty": "1"} }

When to embed #2 Child data is dependent on a parent { "id": "Order1", "customer": "Customer1", "orderDate": "2018-09-26", "itemsOrdered": [ {"ID": 1, "ItemName": "hamburger", "Price":9.50, "Qty": 1} {"ID": 2, "ItemName": "cheeseburger", "Price":9.50, "Qty": 499} ] } Items Ordered depends on Order

When to embed #3 1:1 relationship { "id": "1", "name": "Alice", "email": "alice@contoso.com", “phone": “555-5555" “loyaltyNumber": 13838359, "addresses": [ {"street": "1 Contoso Way", "city": "Seattle"}, {"street": "15 Fabrikam Lane", "city": "Orlando"} ] } All customers have email, phone, loyalty number for1:1 relationship

When to embed #4, #5 Similar rate of updates – does the data change at the same pace? 1:few relationships { "id": "1", "name": "Alice", "email": "alice@contoso.com", "addresses": [ {"street": "1 Contoso Way", "city": "Seattle"}, {"street": "15 Fabrikam Lane", "city": "Orlando"} ] } Email, addresses don’t change too often

When to embed - summary Data from entities is queried together Child data is dependent on a parent 1:1 relationship Similar rate of updates – does the data change at the same pace 1:few – the set of values is bounded Usually embedding provides better read performance Follow-above to minimize trade-off for write perf

Demo

Modeling challenge #1: To embed or reference? { "menuID": 1, "menuName": "Lunch menu", "items": [ {"ID": 1, "ItemName": "hamburger", "ItemDescription":...} {"ID": 2, "ItemName": "cheeseburger", "ItemDescription":...} ] } { "menuID": 1, "menuName": "Lunch menu", "items": [ {"ID": 1} {"ID": 2} ] } Reference {"ID": 1, "ItemName": “hamburger", "ItemDescription":...} {"ID": 2, "ItemName": “cheeseburger", "ItemDescription":...}

When to reference #1 1 : many (unbounded relationship) Embed { { "id": "1", "name": "Alice", "email": "alice@contoso.com", "Orders": [ "id": "Order1", "orderDate": "2018-09-18", "itemsOrdered": [ {"ID": 1, "ItemName": "hamburger", "Price":9.50, "Qty": 1} {"ID": 2, "ItemName": "cheeseburger", "Price":9.50, "Qty": 499}] }, ... "id": "OrderNfinity", "orderDate": "2018-09-20", {"ID": 1, "ItemName": "hamburger", "Price":9.50, "Qty": 1}] }] } { "id": "1", "name": "Alice", "email": "alice@contoso.com", "Orders": ["Order1", "Order2"..."OrderNfinity"] } Embed

When to reference #1 1 : many (unbounded relationship) { "id": "Order1", "orderDate": "2018-09-18", "itemsOrdered": [ {"ID": 2, “Name": "cheeseburger", "Price":9.50, "Qty": 499}] } … { "id": “Order100", "orderDate": "2018-09-20", {"ID": 1, “Name": "hamburger", "Price":9.50, "Qty": 1}] }, { "id": "1", "name": "Alice", "email": "alice@contoso.com", "Orders": ["Order1",..“Order100"] }

When to reference #2 Data changes at different rates #2 { "id": "1", "name": "Alice", "email": "alice@contoso.com", "stats":[ {"TotalNumberOrders": 100}, {"TotalAmountSpent": 550}] } Number of orders, amount spent will likely change faster than email, so reference these

When to reference #3 many : many relationships { "id": "speaker1", "name": "Alice", "email": "alice@contoso.com", "sessions":[ {"id": "session1"}, {"id": "session2"} ] } { "id": "session1", "name": "Modelling Data 101", "speakers":[ {"id": "speaker1"}, {"id": "speaker2"} ] } { "id": "speaker2", "name": "Bob", "email": "bob@contoso.com", "sessions":[ {"id": "session1"}, {"id": "session4"} ] }

When to reference #4 What is referenced, is heavily referenced by many others { "id": "speaker1", "name": "Alice", "email": "alice@contoso.com", "sessions":[ {"id": "session1"}, {"id": "session2"} ] } { "id": "session1", "name": "Modelling Data 101", "speakers":[ {"id": "speaker1"}, {"id": "speaker2"} ] } { "id": “attendee1", "name": “Eve", "email": “eve@contoso.com", “bookmarkedSessions":[ {"id": "session1"}, {"id": "session4"} ] } Here, session is referenced by speakers and attendees Allows you to update Session independently

When to reference summary 1 : many (unbounded relationship) many : many relationships Data changes at different rates What is referenced, is heavily referenced by many others Typically provides better write performance But may require more network calls for reads

But wait, you can do both! Speaker Session { "id": "speaker1", "name": "Alice", "email": "alice@contoso.com", “address”: “1 Microsoft Way” "sessions":[ {"id": "session1"}, {"id": "session2"} ] } { "id": "s1", "name": "Modelling Data 101", "speakers":[ {"id": "speaker1“, “name”: “Alice”}, {"id": "speaker2“, “name”: “Bob”} ] } Speaker Session Embed frequently used data, but use reference to get all info

Modelling Challenge #2: Heterogenous vs homogenous data Should I put my data in the same collection? Or multiple collections? Homogenous: 1 type per collection Heterogenous: Multiple types per collection

To improve performance, consider multiple types in same collection Introduce “type” property

Demo – Benefits of Introducing “type” Property

Choosing the Right Partitioning Key 4/22/2019 8:03 AM Choosing the Right Partitioning Key © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

top-most design decision in Cosmos DB Cosmos DB Container (e.g. Collection) Partitioning Scheme: top-most design decision in Cosmos DB

Why is choice of partition key so important? Enables your data in Cosmos DB to scale Large impact on performance of system

What can go wrong? Hot partitions Exceed 10GB logical partition limit Choice forces many cross-partition queries for workload

Partitioning – behind the scenes Logical partition: Stores all data associated with the same partition key value Physical partition: Fixed amount of reserved SSD-backed storage + compute. Cosmos DB distributes logical partitions among a smaller number of physical partitions. From your perspective: define 1 partition key per container

Cosmos DB Container (e.g. Collection) Partition Key: User Id

Cosmos DB Container (e.g. Collection) Partition Key: User Id Logical Partitioning Abstraction hash(User Id) Behind the Scenes: Physical Partition Sets Psuedo-random distribution of data over range of possible hashed values

Physical Partition Sets hash(User Id) Behind the Scenes: Physical Partition Sets hash(User Id) Range 1 Range 2 Range n Dharma Andrew Shireesh Karthik Rimma Mike …. Bob Alice Carol … … Physical Partition 1 Physical Partition 2 Physical Partition n Frugal # of Partitions based on actual storage and throughput needs (yielding scalability with low total cost of ownership)

Physical Partition Sets hash(User Id) Behind the Scenes: Physical Partition Sets hash(User Id) Range 1 Range 2 Range n Dharma Andrew Shireesh Karthik Rimma Mike …. Bob Alice Carol … … Physical Partition 1 Physical Partition 2 Physical Partition n What happens when partitions need to grow?

Physical Partition Sets hash(User Id) Behind the Scenes: Physical Partition Sets hash(User Id) Range 1 Range 2 Range X1 Range X2 Range X Partition Ranges can be dynamically sub-divided To seamlessly grow database as the application grows While sedulously maintaining high availability Partition X Dharma Shireesh Karthik Rimma Alice Carol … + Partition X1 Partition X2

Physical Partition Sets hash(User Id) Behind the Scenes: Physical Partition Sets hash(User Id) Range 1 Range 2 Range X1 Range X2 Partition Ranges can be dynamically sub-divided To seamlessly grow database as the application grows While sedulously maintaining high availability Best of All: Partition management is completely taken care of by the system You don’t have to lift a finger… the database takes care of you. Partition X Dharma Shireesh Karthik Rimma Alice Carol … + Partition X1 Partition X2

Partition Design IMPORTANT TO SELECT THE “RIGHT” PARTITION KEY Partition keys acts as a means for efficiently routing queries and as a boundary for multi-record transactions. KEY MOTIVATIONS Distribute Requests Distribute Storage Intelligently Route Queries for Efficiency

Partition Design EXAMPLE SCENARIO Contoso Connected Car is a vehicle telematics company. They are planning to store vehicle telemetry data from millions of vehicles every second in Azure Cosmos DB to power predictive maintenance, fleet management, and driver risk analysis. The partition key we select will be the scope for multi- record transactions. WHAT ARE A FEW POTENTIAL PARTITION KEY CHOICES? Vehicle Model Current Time Device Id Composite Key – Device ID + Current Time Example – Contoso Connected Car

VEHICLE MODEL (e.g. Model A) Partition Key Choices VEHICLE MODEL (e.g. Model A) Most auto manufactures only have a couple dozen models. This will create a fixed number of logical partition key values; and is potentially the least granular option. Depending how uniform sales are across various models – this introduces possibilities for hot partition keys on both storage and throughput. CURRENT MONTH (e.g. 2018-04) Auto manufacturers have transactions occurring throughout the year. This will create a more balanced distribution of storage across partition key values. However, most business transactions occur on recent data creating the possibility of a hot partition key for the current month on throughput. Rodel Prisma Turlic Coash Storage Distribution Throughput Distribution 2018-03 2018-04 2018-05 2018-06 Storage Distribution Throughput Distribution Example – Contoso Connected Car

COMPOSITE KEY (Device ID + Time) Partition Key Choices DEVICE ID (e.g. Device123) Each car would have a unique device ID. This creates a large number of partition key values and would have a significant amount of granularity. Depending on how many transactions occur per vehicle, it is possible to a specific partition key that reaches the storage limit per partition key COMPOSITE KEY (Device ID + Time) This composite option increases the granularity of partition key values by combining the current month and a device ID. Specific partition key values have less of a risk of hitting storage limitations as they only relate to a single month of data for a specific vehicle. Throughput in this example would be distributed more to logical partition key values for the current month. C49E27EB FE53547A E84906BE 4376B4BC Storage Distribution Throughput Distribution C49E27EB-2018-05 C49E27EB-2018-06 4376B4BC-2018-05 4376B4BC-2018-06 Storage Distribution Throughput Distribution Example – Contoso Connected Car

Best Practices: Design goals for choosing good partition key Distribute the overall request + storage volume Avoid “hot” partition keys Partition Key is scope for [efficient] queries and transactions Queries can be intelligently routed via partition key Omitting partition key on query requires fan-out

Steps for success Ballpark scale needs (size/throughput) Understand the workload # of reads/sec vs writes per sec Use 80/20 rule to help optimize bulk of workload For reads – understand top X queries (look for common filters) For writes – understand transactional needs General Tips Don’t be afraid of having too many partition keys Partitions keys are logical More partition keys => more scalability

Performance tuning

General performance tips Use single CosmosClient Provide partition key Use direct mode if possible Relax consistency as needed Use point reads – readDocumentAsync() if possible Tune indexing policy – for faster writes Cross-partition queries Tune: MaxDegreeOfParallelism - #partitions that can be queried in parallel MaxBufferedItemCount - #number items buffered client side – useful for x-partition aggregates

Thank you! Email us: askcosmosdb@microsoft.com