Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester, 2011-2012.

Slides:



Advertisements
Similar presentations
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
Advertisements

The google file system Cs 595 Lecture 9.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
Google File System 1Arun Sundaram – Operating Systems.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Large Scale Sharing GFS and PAST Mahesh Balakrishnan.
The Google File System.
Google File System.
Northwestern University 2007 Winter – EECS 443 Advanced Operating Systems The Google File System S. Ghemawat, H. Gobioff and S-T. Leung, The Google File.
Case Study - GFS.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
WINDOWS AZURE STORAGE 11 de Mayo, 2011 Gisela Torres – Windows Azure MVP Aventia-Renacimiento Twitter:
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Inside Windows Azure Storage Name Title Microsoft Corporation.
Windows Azure SQL Database and Storage Name Title Organization.
1 The Google File System Reporter: You-Wei Zhang.
Distributed Systems Tutorial 11 – Yahoo! PNUTS written by Alex Libov Based on OSCON 2011 presentation winter semester,
Managing Multi-User Databases AIMS 3710 R. Nakatsu.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Components of Windows Azure - more detail. Windows Azure Components Windows Azure PaaS ApplicationsWindows Azure Service Model Runtimes.NET 3.5/4, ASP.NET,
 Anil Nori Distinguished Engineer Microsoft Corporation.
Larisa kocsis priya ragupathy
Austin code camp 2010 asp.net apps with azure table storage PRESENTED BY CHANDER SHEKHAR DHALL
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
Data in the Cloud – I Parallel Databases The Google File System Parallel File Systems.
Data centers Account Container Blobs Table Entities Queue Messages
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Presenters: Rezan Amiri Sahar Delroshan
The Google File System by S. Ghemawat, H. Gobioff, and S-T. Leung CSCI 485 lecture by Shahram Ghandeharizadeh Computer Science Department University of.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
GFS : Google File System Ömer Faruk İnce Fatih University - Computer Engineering Cloud Computing
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture Chunkservers Master Consistency Model File Mutation Garbage.
Bigtable: A Distributed Storage System for Structured Data
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
W4118 Operating Systems Instructor: Junfeng Yang.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Bigtable A Distributed Storage System for Structured Data.
1 CMPT 431© A. Fedorova Google File System A real massive distributed file system Hundreds of servers and clients –The largest cluster has >1000 storage.
Windows Azure storage(WAS)
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Google File System.
CSE-291 (Cloud Computing) Fall 2016
Google Filesystem Some slides taken from Alan Sussman.
Google File System CSE 454 From paper by Ghemawat, Gobioff & Leung.
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
EECS 498 Introduction to Distributed Systems Fall 2017
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
The Google File System (GFS)
THE GOOGLE FILE SYSTEM.
by Mikael Bjerga & Arne Lange
The Google File System (GFS)
Windows Azure Storage Andrew Edwards | Principal SDE
Presentation transcript:

Distributed Systems Tutorial 9 – Windows Azure Storage written by Alex Libov Based on SOSP 2011 presentation winter semester,

Windows Azure Storage (WAS) 2  A scalable cloud storage system  In production since November 2008  used inside Microsoft for applications such as  social networking search, serving video, music and game content, managing medical records and more  Thousands of customers outside Microsoft  Anyone can sign up over the Internet to use the system.

WAS Abstractions 3  Blobs – File system in the cloud  Tables – Massively scalable structured storage  Queues – Reliable storage and delivery of messages  A common usage pattern is incoming and outgoing data being shipped via Blobs, Queues providing the overall workflow for processing the Blobs, and intermediate service state and final results being kept in Tables or Blobs.

Design goals 4  Highly Available with Strong Consistency  Provide access to data in face of failures/partitioning  Durability  Replicate data several times within and across data centers  Scalability  Need to scale to exabytes and beyond  Provide a global namespace to access data around the world  Automatically load balance data to meet peak traffic demands

Global Partitioned Namespace 5  http(s)://AccountName..core.windows.net/PartitionName/ ObjectName  can be a blob, table or queue.  AccountName is the customer selected account name for accessing storage.  The Account name specifies the data center where the data is stored.  An application may use multiple AccountNames to store its data across different locations.  PartitionName locates the data once a request reaches the storage cluster  When a PartitionName holds many objects, the ObjectName identifies individual objects within that partition  The system supports atomic transactions across objects with the same PartitionName value  The ObjectName is optional since, for some types of data, the PartitionName uniquely identifies the object within the account.

Storage Stamps 6  A storage stamp is a cluster of N racks of storage nodes.  Each rack is built out as a separate fault domain with redundant networking and power.  Clusters typically range from 10 to 20 racks with 18 disk- heavy storage nodes per rack.  The first generation storage stamps hold approximately 2PB of raw storage each.  The next generation stamps hold up to 30PB of raw storage each.

High Level Architecture 7 Storage Stamp LB Storage Location Service Storage Location Service Data access Partition Layer Front-EndsFront-Ends Stream Layer Intra-stamp replication Storage Stamp LB Partition Layer Front-EndsFront-Ends Stream Layer Intra-stamp replication Inter-stamp (Geo) replication Access blob storage via the URL:

Storage Stamp Architecture – Stream Layer 8  Append-only distributed file system  All data from the Partition Layer is stored into files (extents) in the Stream layer  An extent is replicated 3 times across different fault and upgrade domains  With random selection for where to place replicas  Checksum all stored data  Verified on every client read  Re-replicate on disk/node/rack failure or checksum mismatch MM Extent Nodes (EN) Paxos MM MM Stream Layer (Distributed File System)

Storage Stamp Architecture – Partiton Layer 9  Provide transaction semantics and strong consistency for Blobs, Tables and Queues  Stores and reads the objects to/from extents in the Stream layer  Provides inter-stamp (geo) replication by shipping logs to other stamps  Scalable object index via partitioning PartitionServerPartitionServerPartitionServerPartitionServerPartitionServerPartitionServerPartitionServerPartitionServer PartitionMasterPartitionMaster Lock Service Partition Layer

Storage Stamp Architecture – Front End Layer 10  Stateless Servers  Authentication + authorization  Request routing

Storage Stamp Architecture 11 MM Extent Nodes (EN) Paxos Front End Layer FEFE Incoming Write Request MM MM PartitionServerPartitionServerPartitionServerPartitionServerPartitionServerPartitionServerPartitionServerPartitionServer PartitionMasterPartitionMaster FEFEFEFEFEFEFEFE Lock Service Ack Partition Layer StreamLayer

Partition Layer – Scalable Object Index 12  100s of Billions of blobs, entities, messages across all accounts can be stored in a single stamp  Need to efficiently enumerate, query, get, and update them  Traffic pattern can be highly dynamic  Hot objects, peak load, traffic bursts, etc  Need a scalable index for the objects that can  Spread the index across 100s of servers  Dynamically load balance  Dynamically change what servers are serving each part of the index based on load

Scalable Object Index via Partitioning 13  Partition Layer maintains an internal Object Index Table for each data abstraction  Blob Index: contains all blob objects for all accounts in a stamp  Table Entity Index: contains all table entities for all accounts in a stamp  Queue Message Index: contains all messages for all accounts in a stamp  Scalability is provided for each Object Index  Monitor load to each part of the index to determine hot spots  Index is dynamically split into thousands of Index RangePartitions based on load  Index RangePartitions are automatically load balanced across servers to quickly adapt to changes in load

Storage Stamp PartitionServerPartitionServer PartitionServerPartitionServer PartitionServerPartitionServer Partition Master Partition Layer – Index Range Partitioning Front-End Server PS 2PS 3 PS 1 A-H: PS1 H’-R: PS2 R’-Z: PS3 A-H: PS1 H’-R: PS2 R’-Z: PS3 Partition Map Blob Index Partition Map A-H R’-Z H’-R

Partition Layer – RangePartition 15  A RangePartition uses a Log-Structured Merge-Tree to maintain its persistent data.  RangePartition consists of its own set of streams in the stream layer, and the streams belong solely to a given RangePartition  Metadata Stream – The metadata stream is the root stream for a RangePartition.  The PM assigns a partition to a PS by providing the name of the RangePartition’s metadata stream  Commit Log Stream – Is a commit log used to store the recent insert, update, and delete operations applied to the RangePartition since the last checkpoint was generated for the RangePartition.  Row Data Stream – Stores the checkpoint row data and index for the RangePartition.

Stream Layer 16  Append-Only Distributed File System  Streams are very large files  Has file system like directory namespace  Stream Operations  Open, Close, Delete Streams  Rename Streams  Concatenate Streams together  Append for writing  Random reads

Extent E2Extent E3 Block Stream Layer Concepts Block  Min unit of write/read  Checksum  Up to N bytes (e.g. 4MB) Extent  Unit of replication  Sequence of blocks  Size limit (e.g. 1GB)  Sealed/unsealed Stream  Hierarchical namespace  Ordered list of pointers to extents  Append/Concatenate Block Extent E4 Stream //foo/myfile.data Ptr E1Ptr E2 Ptr E3Ptr E4 Extent E1

Creating an Extent SM Stream Master Paxos Partition Layer EN 1EN 2EN 3EN Create Stream/Extent Allocate Extent replica set PrimarySecondary ASecondary B EN1 Primary EN2, EN3 Secondary

Replication Flow SM Paxos Partition Layer EN 1EN 2EN 3EN Append PrimarySecondary ASecondary B Ack EN1 Primary EN2, EN3 Secondary

Providing Bit-wise Identical Replicas Want all replicas for an extent to be bit-wise the same, up to a committed length Want to store pointers from the partition layer index to an extent+offset Want to be able to read from any replica Replication flow All appends to an extent go to the Primary Primary orders all incoming appends and picks the offset for the append in the extent Primary then forwards offset and data to secondaries Primary performs in-order acks back to clients for extent appends Primary returns the offset of the append in the extent An extent offset can commit back to the client once all replicas have written that offset and all prior offsets have also already been completely written This represents the committed length of the extent

Dealing with Write Failures Failure during append 1. Ack from primary lost when going back to partition layer  Retry from partition layer can cause multiple blocks to be appended (duplicate records) 2. Unresponsive/Unreachable Extent Node (EN)  Append will not be acked back to partition layer  Seal the failed extent  Allocate a new extent and append immediately Stream //foo/myfile.dat Ptr E1Ptr E2 Ptr E3Ptr E4 Extent E5 Ptr E5 Extent E1Extent E2Extent E3 Extent E4

Extent Sealing (Scenario 1) SM Stream Master Paxos Partition Layer EN 1EN 2EN 3EN 4 Append PrimarySecondary ASecondary B Ask for current length 120 Sealed at 120 Seal Extent

Extent Sealing (Scenario 1) SM Stream Master Paxos Partition Layer EN 1EN 2EN 3EN 4 PrimarySecondary ASecondary B Sync with SM 120 Sealed at 120 Seal Extent

Extent Sealing (Scenario 2) SM Paxos Partition Layer EN 1EN 2EN 3EN 4 Append PrimarySecondary ASecondary B Ask for current length 120 Sealed at 100 Seal Extent 100 Seal Extent

Extent Sealing (Scenario 2) SM Paxos Partition Layer EN 1EN 2EN 3EN 4 PrimarySecondary ASecondary B Sync with SM Sealed at 100 Seal Extent 100

Providing Consistency for Data Streams SM EN 1EN 2EN 3 PrimarySecondary ASecondary B Partition Server Network partition PS can talk to EN3 PS can talk to EN3 SM cannot talk to EN3 SM cannot talk to EN3  For Data Streams, Partition Layer only reads from offsets returned from successful appends  Committed on all replicas  Row and Blob Data Streams  Offset valid on any replica Safe to read from EN3

Providing Consistency for Log Streams SM EN 1EN 2EN 3 PrimarySecondary ASecondary B Partition Server Check commit length  Logs are used on partition load  Commit and Metadata log streams  Check commit length first  Only read from  Unsealed replica if all replicas have the same commit length  A sealed replica Check commit length Seal Extent Use EN1, EN2 for loading Network partition PS can talk to EN3 PS can talk to EN3 SM cannot talk to EN3 SM cannot talk to EN3

Summary  Highly Available Cloud Storage with Strong Consistency  Scalable data abstractions to build your applications  Blobs – Files and large objects  Tables – Massively scalable structured storage  Queues – Reliable delivery of messages  More information at:  Cascais/11-calder-online.pdf