Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.

Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012

Windows Azure Storage (WAS) 2  A scalable cloud storage system  In production since November 2008  used inside Microsoft for applications such as  social networking search, serving video, music and game content, managing medical records and more  Thousands of customers outside Microsoft  Anyone can sign up over the Internet to use the system.

WAS Abstractions 3  Blobs – File system in the cloud  Tables – Massively scalable structured storage  Queues – Reliable storage and delivery of messages  A common usage pattern is incoming and outgoing data being shipped via Blobs, Queues providing the overall workflow for processing the Blobs, and intermediate service state and final results being kept in Tables or Blobs.

Design goals 4  Highly Available with Strong Consistency  Provide access to data in face of failures/partitioning  Durability  Replicate data several times within and across data centers  Scalability  Need to scale to exabytes and beyond  Provide a global namespace to access data around the world  Automatically load balance data to meet peak traffic demands

Global Partitioned Namespace 5  http(s)://AccountName..core.windows.net/PartitionName/ ObjectName  can be a blob, table or queue.  AccountName is the customer selected account name for accessing storage.  The Account name specifies the data center where the data is stored.  An application may use multiple AccountNames to store its data across different locations.  PartitionName locates the data once a request reaches the storage cluster  When a PartitionName holds many objects, the ObjectName identifies individual objects within that partition  The system supports atomic transactions across objects with the same PartitionName value  The ObjectName is optional since, for some types of data, the PartitionName uniquely identifies the object within the account.

Storage Stamps 6  A storage stamp is a cluster of N racks of storage nodes.  Each rack is built out as a separate fault domain with redundant networking and power.  Clusters typically range from 10 to 20 racks with 18 disk- heavy storage nodes per rack.  The first generation storage stamps hold approximately 2PB of raw storage each.  The next generation stamps hold up to 30PB of raw storage each.

High Level Architecture 7 Storage Stamp LB Storage Location Service Storage Location Service Data access Partition Layer Front-EndsFront-Ends Stream Layer Intra-stamp replication Storage Stamp LB Partition Layer Front-EndsFront-Ends Stream Layer Intra-stamp replication Inter-stamp (Geo) replication Access blob storage via the URL: http://.blob.core.windows.net/

Storage Stamp Architecture – Stream Layer 8  Append-only distributed file system  All data from the Partition Layer is stored into files (extents) in the Stream layer  An extent is replicated 3 times across different fault and upgrade domains  With random selection for where to place replicas  Checksum all stored data  Verified on every client read  Re-replicate on disk/node/rack failure or checksum mismatch MM Extent Nodes (EN) Paxos MM MM Stream Layer (Distributed File System)

Storage Stamp Architecture – Partiton Layer 9  Provide transaction semantics and strong consistency for Blobs, Tables and Queues  Stores and reads the objects to/from extents in the Stream layer  Provides inter-stamp (geo) replication by shipping logs to other stamps  Scalable object index via partitioning PartitionServerPartitionServerPartitionServerPartitionServerPartitionServerPartitionServerPartitionServerPartitionServer PartitionMasterPartitionMaster Lock Service Partition Layer

Storage Stamp Architecture – Front End Layer 10  Stateless Servers  Authentication + authorization  Request routing

Storage Stamp Architecture 11 MM Extent Nodes (EN) Paxos Front End Layer FEFE Incoming Write Request MM MM PartitionServerPartitionServerPartitionServerPartitionServerPartitionServerPartitionServerPartitionServerPartitionServer PartitionMasterPartitionMaster FEFEFEFEFEFEFEFE Lock Service Ack Partition Layer StreamLayer

Partition Layer – Scalable Object Index 12  100s of Billions of blobs, entities, messages across all accounts can be stored in a single stamp  Need to efficiently enumerate, query, get, and update them  Traffic pattern can be highly dynamic  Hot objects, peak load, traffic bursts, etc  Need a scalable index for the objects that can  Spread the index across 100s of servers  Dynamically load balance  Dynamically change what servers are serving each part of the index based on load

Scalable Object Index via Partitioning 13  Partition Layer maintains an internal Object Index Table for each data abstraction  Blob Index: contains all blob objects for all accounts in a stamp  Table Entity Index: contains all table entities for all accounts in a stamp  Queue Message Index: contains all messages for all accounts in a stamp  Scalability is provided for each Object Index  Monitor load to each part of the index to determine hot spots  Index is dynamically split into thousands of Index RangePartitions based on load  Index RangePartitions are automatically load balanced across servers to quickly adapt to changes in load

Storage Stamp PartitionServerPartitionServer PartitionServerPartitionServer PartitionServerPartitionServer Partition Master Partition Layer – Index Range Partitioning Front-End Server PS 2PS 3 PS 1 A-H: PS1 H’-R: PS2 R’-Z: PS3 A-H: PS1 H’-R: PS2 R’-Z: PS3 Partition Map Blob Index Partition Map A-H R’-Z H’-R

Partition Layer – RangePartition 15  A RangePartition uses a Log-Structured Merge-Tree to maintain its persistent data.  RangePartition consists of its own set of streams in the stream layer, and the streams belong solely to a given RangePartition  Metadata Stream – The metadata stream is the root stream for a RangePartition.  The PM assigns a partition to a PS by providing the name of the RangePartition’s metadata stream  Commit Log Stream – Is a commit log used to store the recent insert, update, and delete operations applied to the RangePartition since the last checkpoint was generated for the RangePartition.  Row Data Stream – Stores the checkpoint row data and index for the RangePartition.

Stream Layer 16  Append-Only Distributed File System  Streams are very large files  Has file system like directory namespace  Stream Operations  Open, Close, Delete Streams  Rename Streams  Concatenate Streams together  Append for writing  Random reads

Extent E2Extent E3 Block Stream Layer Concepts Block  Min unit of write/read  Checksum  Up to N bytes (e.g. 4MB) Extent  Unit of replication  Sequence of blocks  Size limit (e.g. 1GB)  Sealed/unsealed Stream  Hierarchical namespace  Ordered list of pointers to extents  Append/Concatenate Block Extent E4 Stream //foo/myfile.data Ptr E1Ptr E2 Ptr E3Ptr E4 Extent E1

Creating an Extent SM Stream Master Paxos Partition Layer EN 1EN 2EN 3EN Create Stream/Extent Allocate Extent replica set PrimarySecondary ASecondary B EN1 Primary EN2, EN3 Secondary

Replication Flow SM Paxos Partition Layer EN 1EN 2EN 3EN Append PrimarySecondary ASecondary B Ack EN1 Primary EN2, EN3 Secondary

Providing Bit-wise Identical Replicas Want all replicas for an extent to be bit-wise the same, up to a committed length Want to store pointers from the partition layer index to an extent+offset Want to be able to read from any replica Replication flow All appends to an extent go to the Primary Primary orders all incoming appends and picks the offset for the append in the extent Primary then forwards offset and data to secondaries Primary performs in-order acks back to clients for extent appends Primary returns the offset of the append in the extent An extent offset can commit back to the client once all replicas have written that offset and all prior offsets have also already been completely written This represents the committed length of the extent

Dealing with Write Failures Failure during append 1. Ack from primary lost when going back to partition layer  Retry from partition layer can cause multiple blocks to be appended (duplicate records) 2. Unresponsive/Unreachable Extent Node (EN)  Append will not be acked back to partition layer  Seal the failed extent  Allocate a new extent and append immediately Stream //foo/myfile.dat Ptr E1Ptr E2 Ptr E3Ptr E4 Extent E5 Ptr E5 Extent E1Extent E2Extent E3 Extent E4

Extent Sealing (Scenario 1) SM Stream Master Paxos Partition Layer EN 1EN 2EN 3EN 4 Append PrimarySecondary ASecondary B Ask for current length 120 Sealed at 120 Seal Extent

Extent Sealing (Scenario 1) SM Stream Master Paxos Partition Layer EN 1EN 2EN 3EN 4 PrimarySecondary ASecondary B Sync with SM 120 Sealed at 120 Seal Extent

Extent Sealing (Scenario 2) SM Paxos Partition Layer EN 1EN 2EN 3EN 4 Append PrimarySecondary ASecondary B Ask for current length 120 Sealed at 100 Seal Extent 100 Seal Extent

Extent Sealing (Scenario 2) SM Paxos Partition Layer EN 1EN 2EN 3EN 4 PrimarySecondary ASecondary B Sync with SM Sealed at 100 Seal Extent 100

Providing Consistency for Data Streams SM EN 1EN 2EN 3 PrimarySecondary ASecondary B Partition Server Network partition PS can talk to EN3 PS can talk to EN3 SM cannot talk to EN3 SM cannot talk to EN3  For Data Streams, Partition Layer only reads from offsets returned from successful appends  Committed on all replicas  Row and Blob Data Streams  Offset valid on any replica Safe to read from EN3

Providing Consistency for Log Streams SM EN 1EN 2EN 3 PrimarySecondary ASecondary B Partition Server Check commit length  Logs are used on partition load  Commit and Metadata log streams  Check commit length first  Only read from  Unsealed replica if all replicas have the same commit length  A sealed replica Check commit length Seal Extent Use EN1, EN2 for loading Network partition PS can talk to EN3 PS can talk to EN3 SM cannot talk to EN3 SM cannot talk to EN3

Summary  Highly Available Cloud Storage with Strong Consistency  Scalable data abstractions to build your applications  Blobs – Files and large objects  Tables – Massively scalable structured storage  Queues – Reliable delivery of messages  More information at:  http://www.sigops.org/sosp/sosp11/current/2011- Cascais/11-calder-online.pdf

Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.

Similar presentations

Presentation on theme: "Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.

Similar presentations

Presentation on theme: "Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012."— Presentation transcript:

Similar presentations

About project

Feedback