A Deep-Dive with Azure DocumentDB: Partitioning, Data Modelling, and Geo Replication Andrew Liu andrl@microsoft.com.

A Deep-Dive with Azure DocumentDB: Partitioning, Data Modelling, and Geo Replication
Andrew Liu

Session objectives and takeaways
5/8/ :06 PM Session objectives and takeaways Yesterday's Session: Why do I <3 Azure DocumentDB and what does it solve? (Hint: Volume, Velocity, Variety) Overview and evolution of NoSQL landscape What is Azure DocumentDB and when is it a good fit? Objectives for Today: Everything you need to know to be successful on DocumentDB! Best Practices including: Data Modeling, Queries & Indexing, Geo-Replication Roadmap!    © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

A brief recap for those who missed yesterday…

Gartner’s 3Vs of Big Data
Volume How can my app deal with massive volume of data & throughput? How do I elastically scale my database? Velocity How do I write responsive apps? How do I make data available where my users are? How do I write highly available apps? Variety How do I deal with schema changes? How do I iterate rapidly? What data models work at scale?

Common scenarios + use cases
Retail Product Catalog Ordering and Payment Pipelines Personalization Customer 360 View IoT / Sensor Data Telemetry + Event Store Telematics Device Registry Ad Technology + Social Analytics User behavior telemetry Recommendations Gaming Multiplayer Games Social Gameplay Leaderboards Game Analytics

DocumentDB Capabilities
5/8/ :06 PM DocumentDB Capabilities Elastic and limitless global scale Independently scale throughput and storage - locally and globally Transparent partition management and routing SQL and JavaScript – schema free Automatic tree path based indexing No schemas or secondary indices required upfront SQL and JavaScript language integrated queries Hash, range, and spatial Multi-document, JavaScript language integrated transactions Guaranteed low latency <10ms reads/<15ms P99. Requests are served from local region Write optimized, latch-free database engine designed for SSDs and low latency access. Synchronous and automatic document indexing at sustained ingestion rates Multiple consistency levels Multiple well defined consistency levels Intuitive programming model for relaxed consistency models Clear PACELC tradeoffs and 99.99% availability SLAs © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

DocumentDB 101 (ish)

Architecture (Behind the Scenes)
region datacenter datacenter federation DocumentDB service is manifested as an overlay network with ring topology (aka federation) Resources are partitioned; they span federations, datacenters and regions Partitions are made highly available by replica-sets A replica in-turn hosts the DocumentDB database engine and implements the replication protocol and local persistence federation FD resource partition-set Partition-set partition partition replica physical logical

Resource Model = … Partition set 1 Replica-set Resources
US-East US-West N Europe Partitions Partition set Local distribution Global distribution = DocumentDB Collection Resources identified by their logical and stable URI Represented as JSON documents Partitioned and across span machines, clusters and regions 1 3 Partitioning Model Grid Partitioning – horizontal based on hash/range and vertical across regions Each partition made highly available via a replica set Resource model Stateless interaction (HTTP and TCP) Hierarchical overlay atop partitioning model 2

Let’s talk about… Modeling Data Query and Indexing Global Distribution
Microsoft Ignite 2016 5/8/ :06 PM Let’s talk about… Modeling Data Query and Indexing Global Distribution Tips and Best Practices Everything you need to know to build Blazing fast, planet-scale applications! © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Collections != Tables Collections do NOT enforce schema
Co-locate multiple types in a collection Annotate documents with a "type" property

Co-locating types in the same collection
Ability to query across multiple entity types with a single network request.

Ability to query across multiple entity types with a single network request.
For example, we have two types of documents: cat and person. { "id": "Ralph", "type": "Cat", "familyId": "Liu", "fur": { "length": "short", "color": "brown" } } { "id": "Andrew", "type": "Person", "familyId": "Liu", "worksOn": "DocumentDB" }

SELECT * FROM c WHERE c.familyId = "Liu"
Ability to query across multiple entity types with a single network request. For example, we have two types of documents: cat and person. { "id": "Ralph", "type": "Cat", "familyId": "Liu", "fur": { "length": "short", "color": "brown" } } { "id": "Andrew", "type": "Person", "familyId": "Liu", "worksOn": "DocumentDB" } We can query both types of documents without needing a JOIN simply by running a query without a filter on type: SELECT * FROM c WHERE c.familyId = "Liu"

SELECT * FROM c WHERE c.familyId = "Liu" AND c.type = "Person"
Ability to query across multiple entity types with a single network request. For example, we have two types of documents: cat and person. { "id": "Ralph", "type": "Cat", "familyId": "Liu", "fur": { "length": "short", "color": "brown" } } { "id": "Andrew", "type": "Person", "familyId": "Liu", "worksOn": "DocumentDB" } If we wanted to filter on type = “Person”, we can simply add a filter on type to our query: SELECT * FROM c WHERE c.familyId = "Liu" AND c.type = "Person"

Co-locating types in the same collection
Ability to query across multiple entity types with a single network request. Ability to perform transactions across multiple types Cost: every collection has one or more physical partitions underneath

Let's talk about partitioning.

Two Dimensions: Throughput and Storage

Measuring Throughput (Request Units)
5/8/ :06 PM Measuring Throughput (Request Units) Request Unit/sec (RU) is the normalized currency % IOPS % CPU % Memory Min RU/sec Max RU/sec Incoming Requests Replica Quiescent Rate limit No throttling Requests get rate limited if they exceed the SLA READ GET Document Documents INSERT POST REPLACE PUT Operations consume request units (RUs) Query … Customers pay for reserved request units by the hour Replica gets a fixed budget of request units © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Partitioning Model Collection …. … 5/8/2018 11:06 PM Partition 1
Partition n … Collection © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Partitioning Model Partition Key = city …. …. … … … 5/8/2018 11:06 PM
Houston London Chicago New Delhi Mumbai Paris …. … …. New York Boston Berlin … … Partition 1 Partition 2 Partition i Partition n © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Overall request volume should scale across Partition Keys
5/8/ :06 PM Overall request volume should scale across Partition Keys …. … …. … … … Partition 1 Partition 2 Partition i Partition n © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Overall request volume should scale across Partition Keys
…. … …. … … … Partition 1 Partition 2 Partition i Partition n

Individual queries should minimize cross-partition lookups
5/8/ :06 PM Individual queries should minimize cross-partition lookups …. … …. … … … Partition 1 Partition 2 Partition i Partition n © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Partition Key Design Goals
Load balances the overall request volume (at any given time) Avoids having to fan out a given query request across all partitions The SDK is "smart" – just be sure to include partitionKey in WHERE clause For a given partition key value – try to stay under <1GB and <1,000 RU/sec Partition key choice ideally reflects transactional needs

Choosing a Partition Key
Understand the data access patterns and optimize for most frequently run operations Look at your Top N queries for commonly filtered fields General Tips: Avoid HOT partitions Aim for high cardinality… More partition key values = happiness

Let’s talk about object model
"With great power comes great responsibility“ - Uncle Ben DocumentDB gives you the power of a schema-agnostic database. Just because you can de-normalize… doesn't meant to do so blindly.

How do approaches differ?

Data normalization ORM

Data normalization ORM Come as you are

Modeling Data: The Relational Way
Person PersonContactDetailLnk ContactDetail Id PersonId Id ContactDetailId Address ContactDetailType Id Id

Modeling Data: The Document Way
5/8/2018 Modeling Data: The Document Way { "id": "0ec1ab0c-de08-4e42-a ", "addresses": [ { "street": "1 Redmond Way", "city": "Redmond", "state": "WA", "zip": 98052} ], "contactDetails": [ {"type": "home", "detail": “ "}, {"type": " ", "detail": ... } Person Id Addresses Address … Address … ContactDetails ContactDetail … © 2015 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

To embed, or to reference, that is the question

Data modeling with denormalization
{ "id": "1", "firstName": "Thomas", "lastName": "Andersen", "addresses": [ "line1": "100 Some Street", "line2": "Unit 1", "city": "Seattle", "state": "WA", "zip": } ], "contactDetails": [ {" {"phone": " ", "extension": 5555} ] } Try model your entity as a self- contained document Generally, use embedded data models when: There are "contains" relationships between entities There are one-to-few relationships between entities Embedded data changes infrequently Embedded data won’t grow without bounds Embedded data is integral to data in a document Denormalizing typically provides for better read performance

Data modeling with referencing
In general, use normalized data models when: Write performance is more important than read performance Representing one-to-many relationships Can representing many-to-many relationships Related data changes frequently Provides more flexibility than embedding More round trips to read data Address document { "id": "address_xyz", "userid": "xyz", "address" : { … } User document { "id": "xyz", "username: "user xyz" } Contact details document { "id: "contact_xyz", "userid": "xyz", " " : "phone" : " " } Normalizing typically provides better write performance

Hybrid models No magic bullet Author document Hybrid Approach:
Model on a property-level (as opposed to record-level) Optimize your data model for your workload… (as opposed to blindly following types) Segment data based on mutability Author document { "id": "1", "firstName": "Thomas", "lastName": "Andersen", "countOfBooks": 3, "books": [1, 2, 3], "images": [ {"thumbnail": " {"profile": " ] } Book document { "id": 1, "name": "DocumentDB 101", "authors": [ {"id": 1, "name": "Thomas Andersen", "thumbnail": " {"id": 2, "name": "William Wakefield", "thumbnail": " ] }

Query and Indexing

Documents as Trees JSON document as tree
JavaScript Object Literals JSON serializable values (aka JSON Infoset) { "locations": [ { "country": "Germany", "city": "Berlin" }, { "country": "France", "city": "Paris" } ], "headquarter": "Belgium", "exports":[{ "city": "Moscow" },{ "city": "Athens"}] }; locations headquarter exports 1 country Germany city Berlin France Paris Moscow Athens Belgium JSON document as tree

Query SELECT C.locations function businessLogic() { FROM company C
WHERE C.headquarter = "Belgium" function businessLogic() { var country = "Belgium"; __.filter(function(x){return x.headquarter===country;});} SQL JavaScript { "locations": [ { "country": "Germany", "city": "Berlin" }, { "country": "France", "city": "Paris" } ], "headquarter": "Belgium", "exports": [{ "city": "Moscow" }, { "city": "Athens" }] }; { "locations": [{ "country": "Germany", "city": "Bonn", "revenue": 200 } ], "headquarter": "Italy", "exports": [ { "city": "Berlin","dealers": [{"name": "Hans"}] }, { "city": "Athens" } ] }; locations headquarter exports locations headquarter exports Italy 1 1 Belgium 1 country city dealers city country city country city city revenue city city Berlin Athens Germany Bonn 200 Germany Berlin France Paris Moscow Athens Input documents name Hans { "results": [ "locations": {"country":"Germany","city":"Berlin"}, {"country":"France","city":"Paris"} ] } results locations 1 country city country city Query result Germany Berlin France Paris

Query SELECT location.city, GermanTax(location.revenue) AS Tax
{"id":"GermanTax", "body": "function GermanTax(income) { if(income < 1000) return income * 0.1; else if(income < 10000) return income * 0.2; return income * 0.4; }" } SELECT location.city, GermanTax(location.revenue) AS Tax FROM location IN company.locations WHERE location.revenue > 100 UDF { "locations": [ { "country": "Germany", "city": "Berlin" }, { "country": "France", "city": "Paris" } ], "headquarter": "Belgium", "exports": [{ "city": "Moscow" }, { "city": "Athens" }] }; { "locations": [{ "country": "Germany", "city": "Bonn", "revenue": 200 }], "headquarter": "Italy", "exports": [{"city": "Berlin","dealers": [{"name":"Hans"}]}, {"city":"Athens"}] }; locations headquarter exports Italy 1 locations headquarter exports country city dealers city city revenue 1 Belgium 1 Berlin Athens Germany Bonn 200 country city country city city city name Germany Berlin France Paris Input documents Moscow Athens Hans { "results": [ {"city":"Bonn","Tax":20} ] } results city Tax Query result 20 Bonn

Schema Agnostic Indexing
Support for rich hierarchical, relational and analytical queries Different path encodings depending on index type Support for multi-tenancy requires fixed upper bound on index size Logically the index is a union of all the document trees Structure contributed by the interior nodes, instance values are the leaves Columnar index for fast scans Structural information and instance values are normalized into a unifying concept of JSON-Path Common structure Germany location country coordinates Germany location country Range (>, <, !=) & ORDERBY queries Wildcard queries Spatial queries 1 2 Terms Postings List $/location/0/ 1, 2 location/0/country/ location/0/city/ 0/country/Germany 1/country/France 2 … 0/city/Moscow 0/dealers/0 Dynamic Encoding of Postings List (E-WAH/differential)

Queries that use the index
5/8/ :06 PM Queries that use the index Equality: = Range: <, >, <=, >= ORDER BY String operators: STARTSWITH Spatial operators: ST_WITHIN, ST_DISTANCE, ST_INTERSECTS, … Array operators: ARRAY_CONTAINS Schema operators: IS_DEFINED, IS_NUMBER, IS_STRING, … © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Indexing Policies Configuration Level Options Automatic Per collection
5/8/2018 Indexing Policies Configuration Level Options Automatic Per collection True (default) or False Override with each document write Indexing Mode Consistent or Lazy Lazy for eventual updates/bulk ingestion Included and excluded paths Per path Individual path or recursive includes (? And *) Indexing Type Support Hash (Default) and Range Hash for equality, range for range queries Indexing Precision Supports 3 – 7 per path Tradeoff storage, query RUs and write RUs © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Indexing Paths Path Description/use case /
5/8/2018 Indexing Paths Path Description/use case / Default path for collection. Recursive and applies to whole document tree. /"prop"/? Serve queries like the following (with Hash or Range types respectively): SELECT * FROM collection c WHERE c.prop = "value" SELCT * FROM collection c WHERE c.prop > 5 /"prop"/* All paths under the specified label. /"prop"/"subprop"/ Used during query execution to prune documents that do not have the specified path. /"prop"/"subprop"/? Serve queries (with Hash or Range types respectively): SELECT * FROM collection c WHERE c.prop.subprop = "value" SELECT * FROM collection c WHERE c.prop.subprop > 5 © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Global Distribution

Multi-region DocumentDB databases
5/8/ :06 PM Multi-region DocumentDB databases = DocumentDB Collection … Replica-set US-East US-West India Partitions Partition set Global distribution Local distribution Primary Replica-sets … 2M RUs Secondary Replica-sets A DocumentDB collection Total RUs = Provisioned RUs x Number of regions In this example: 2M RUs x 3 regions = 6M RUs © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Programmable data consistency
Strong consistency, High latency Eventual consistency, Low latency “Its hard to write distributed apps.”

Consistency Levels PACELC Theorem and the associated tradeoffs
5/8/ :06 PM Consistency Levels PACELC Theorem and the associated tradeoffs © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Consistency Levels Strong, Eventual, Bounded Staleness, and Session P
5/8/ :06 PM Consistency Levels Strong, Eventual, Bounded Staleness, and Session LEFT TO RIGHT  Weaker Consistency, Better Read scalability, Lower write latency Strong Bounded Staleness Session Eventual Client Client Client Client Client P S P S S S P S S P S S Consistent Prefix reads. Reads lag behind writes by K prefixes or T interval Monotonic reads, writes and Read your writes guarantee © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

General Tips

General Tips: Low latency
Server Instance DocumentClient _client Create a singleton instance of DocumentClient for an app server instance void ServerStart() { ... await _client.OpenAsync(); } Warm up DocumentClient cache by calling DocumentClient.OpenAsync() upon start of your app server ConnectionPolicy policy = new ConnectionPolicy { Protocol = Protocol.Tcp, Mode = ConnectionMode.Direct }; return new DocumentClient(endpoint, key, policy); Use Direct Connectivity and TCP for .NET SDK Use Direct Connectivity and HTTPS for Java SDK

General Tips: Throughput
Use relaxed consistency levels for efficient utilization of provisioned throughput Subscribe for changes via change feed APIs instead of polling and reading the entire feed GET x-ms-max-item-count: 1 If-None-Match: "28535" A-IM: Incremental feed x-ms-documentdb-partitionkeyrangeid: 16 ... If you intend to use DocumentDB as a KV store, you can tell them system to drop the secondary indexes. This will also save storage. POST .../colls { ... indexingPolicy : { IndexingMode : "None" … }

Roadmap 2017

Change Feed Distributed replication log
Keep your cache or data warehouse up to date Perform notifications on changes Perform streaming aggregation Lambda pattern with significantly lower TCO Single scalable database solution for both ingestion and query

Aggregates at global scale
Low latency aggregates at any scale Supported via Updatable, column store index at global scale Deeply integrated with latch free, log structured database engine Preview now available

Spark connector for DocumentDB
RDD and Dataset-based connectors available Native integration with Spark SQL Direct mapping to DocumentDB partitions Natively leverage DocumentDB index Predicate pushdown Public release in H1 CY2017

Pricing and scaling improvements
Enable bursting up to 10x for spiky workloads Reduced starting price for partitioned collections (4x) Create up to 10 TB collections without support ticket Deprecating S1 – S3 offers Bursting available H1 2017

Graph APIs SQL and Gremlin query
Independently scalable graph engine using TinkerPop Optimized query engine for relationship traversals Schema freedom for ad-hoc expansion of attributes on nodes & edges Limitless scale to support massive graphs Same NoSQL stack

Session objectives and takeaways
5/8/ :06 PM Session objectives and takeaways Everything you need to know to be successful on DocumentDB! Best Practices including: Data Modeling, Partitioning, Geo-Replication Roadmap!    © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Continue your Ignite learning path
5/8/ :06 PM Continue your Ignite learning path Visit Channel 9 to access a wide range of Microsoft training and event recordings Head to the TechNet Eval Centre to download trials of the latest Microsoft products Visit Microsoft Virtual Academy for free online training visit © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Q&A askdocdb@microsoft Follow @DocumentDB Use #DocumentDB
5/8/ :06 PM Q&A Use #DocumentDB documentdb.com #azure-documentdb © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

A Deep-Dive with Azure DocumentDB: Partitioning, Data Modelling, and Geo Replication Andrew Liu andrl@microsoft.com.

Similar presentations

Presentation on theme: "A Deep-Dive with Azure DocumentDB: Partitioning, Data Modelling, and Geo Replication Andrew Liu andrl@microsoft.com."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Deep-Dive with Azure DocumentDB: Partitioning, Data Modelling, and Geo Replication Andrew Liu andrl@microsoft.com.

Similar presentations

Presentation on theme: "A Deep-Dive with Azure DocumentDB: Partitioning, Data Modelling, and Geo Replication Andrew Liu andrl@microsoft.com."— Presentation transcript:

Similar presentations

About project

Feedback