G22.3250-001 Robert Grimm New York University (with some slides by Steve Gribble) Distributed Data Structures for Internet Services.

Slides:



Advertisements
Similar presentations
The google file system Cs 595 Lecture 9.
Advertisements

Petal and Frangipani. Petal/Frangipani Petal Frangipani NFS “SAN” “NAS”
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google Jaehyun Han 1.
The Google File System Authors : Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung Presentation by: Vijay Kumar Chalasani 1CS5204 – Operating Systems.
1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.
High Availability Group 08: Võ Đức Vĩnh Nguyễn Quang Vũ
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Distributed Processing, Client/Server, and Clusters
Virtual Synchrony Jared Cantwell. Review Multicast Causal and total ordering Consistent Cuts Synchronized clocks Impossibility of consensus Distributed.
G Robert Grimm New York University Sprite LFS or Let’s Log Everything.
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
G Robert Grimm New York University Recoverable Virtual Memory.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
AP 12/00 From Object-Oriented Programming to Component Software OO Languages: –Ada, Smalltalk, Java, C++ Class versus Object: –Express existence of objects.
G Robert Grimm New York University SGI’s XFS or Cool Pet Tricks with B+ Trees.
G Robert Grimm New York University Porcupine.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
G Robert Grimm New York University Scalable Network Services.
EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Copyright ©2009 Opher Etzion Event Processing Course Engineering and implementation considerations (related to chapter 10)
G Robert Grimm New York University Scalable Network Services.
G Robert Grimm New York University Sprite LFS or Let’s Log Everything.
Wide-area cooperative storage with CFS
CSE 490dp Resource Control Robert Grimm. Problems How to access resources? –Basic usage tracking How to measure resource consumption? –Accounting How.
Distributed storage for structured data
Case Study - GFS.
Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation.
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
1 The Google File System Reporter: You-Wei Zhang.
Tanenbaum & Van Steen, Distributed Systems: Principles and Paradigms, 2e, (c) 2007 Prentice-Hall, Inc. All rights reserved DISTRIBUTED SYSTEMS.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
CS Storage Systems Lecture 14 Consistency and Availability Tradeoffs.
2/1/00 Porcupine: a highly scalable service Authors: Y. Saito, B. N. Bershad and H. M. Levy This presentation by: Pratik Mukhopadhyay CSE 291 Presentation.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Chapter 20 Distributed File Systems Copyright © 2008.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
DISTRIBUTED COMPUTING Introduction Dr. Yingwu Zhu.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
GFS : Google File System Ömer Faruk İnce Fatih University - Computer Engineering Cloud Computing
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Process Architecture Process Architecture - A portion of a program that can run independently of and concurrently with other portions of the program. Some.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
A Recovery-Friendly, Self-Managing Session State Store Benjamin Ling and Armando Fox
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Distributed Handler Architecture Beytullah Yildiz
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Lecture 24: GFS.
1 Information Retrieval and Use De-normalisation and Distributed database systems Geoff Leese September 2008, revised October 2009.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.
CS294, YelickDataStructs, p1 CS Distributed Data Structures
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Free Transactions with Rio Vista Landon Cox April 15, 2016.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Cluster-Based Scalable
Free Transactions with Rio Vista
The Case for a Session State Storage Layer
Chapter 12: File System Implementation
Applying Control Theory to Stream Processing Systems
Noah Treuhaft UC Berkeley ROC Group ROC Retreat, January 2002
Shen, Yang, Chu, Holliday, Kuschner, and Zhu
Free Transactions with Rio Vista
The Design and Implementation of a Log-Structured File System
Presentation transcript:

G Robert Grimm New York University (with some slides by Steve Gribble) Distributed Data Structures for Internet Services

Altogether Now: The Three Questions  What is the problem?  What is new or different or notable?  What are the contributions and limitations?

Clusters, Clusters, Clusters  Let’s broaden the goals for cluster-based services  Incremental scalability  High availability  Operational manageability  And also data consistency  But what to do if the data has to be persistent?  TACC works best for read-only data  Porcupine works best for a limited group of services  , news, bulletin boards, calendaring

Enter Distributed Data Structures (DDS)  In-memory, single site application interface  Persistent, distributed, replicated implementation  Clean consistency model  Atomic operations (but no transactions)  Independent of accessing nodes (functional homogeneity)

DDS’s as an Intermediate Design Point  Relational databases  Strong guarantees (ACID)  But also high overhead, complexity  Logical structure very much independent of physical layout  Distributed data structures  Atomic operations, one-copy equivalence  Familiar, frequently used interface: hash table, tree, log  Distributed file systems  Weak guarantees (e.g., close/open consistency)  Low-level interface with little data independence  Applications impose structure on directories, files, bytes

Design Principles  Separate concerns  Service code implements application  Storage management is reusable, recoverable  Appeal to properties of clusters  Generally secure and well-administered  Fast network, uninterruptible power  Design for high throughput and high concurrency  Use event-driven implementation  Make it easy to compose components  Make it easy to absorb bursts (in event queues)

Assumptions  No network partitions within cluster  Highly redundant network  DDS components are fail-stop  Components implemented to terminate themselves  Failures are independent  Messaging is synchronous  Bounded time for delivery  Workload has no extreme hotspots (for hash table)  Population density over key space is even  Working set of hot keys is larger than # of cluster nodes

Distributed Hash Tables (in a Cluster…)

DHT Architecture

Cluster-Wide Metadata Structures

Metadata Maps Why is two-phase commit acceptable for DDS’s?

Recovery

Experimental Evaluation  Cluster of 28 2-way SMPs and 38 4-way SMPs  To a total of MHZ Pentium CPUs  2-way SMPs: 500 MB RAM, 100 Mbs switched Ethernet  4-way SMPs: 1 GB RAM, 1 Gbs switched Ethernet  Implementation written in Jāvā  Sun’s JDK 1.1.7v3, OpenJIT, Linux user-level threads  Load generators run within cluster  80 nodes necessary to saturate 128 storage bricks

Scalability: Reads and Writes

Graceful Degradation (Reads)

Unexpected Imbalance (Writes) What’s going on?

Capacity

Recovery Behavior 1 brick fails | Recovery | GC in action Buffer cache warm up Normal

So, All Is Good?

Assumptions Considered Harmful!  Central insight, based on experience with DDS  “Any system that attempts to gain robustness solely through precognition is prone to fragility”  In other words  Complex systems are so complex that they are impossible to understand completely, especially when operating outside their expected range

Assumptions in Action  Bounded synchrony  Timeout four orders of magnitude higher than common case round trip time  But garbage collection may take a very long time  The result is a catastrophic drop in throughput  Independent failures  Race condition in two-phase commit caused latent memory leak (10 KB/minute under normal operation)  All bricks failed predictably within minutes of each other  After all, they were started at about the same time  The result is a catastrophic loss of data

Assumptions in Action (cont.)  Fail-stop components  Session layer uses synchronous connect() method  Another graduate student adds firewalled machine to cluster, resulting in nodes locking up for 15 minutes at a time  The result is a catastrophic corruption of data

What Can We Do?  Systematically overprovision the system  But doesn’t that mean predicting the future, again?  Use admission control  But this can still result in livelock, only later…  Build introspection into the system  Need to easily quantify behavior in order to adapt  Close the control loop  Make the system adapt automatically (but see previous)  Plan for failures  Use transactions, checkpoint frequently, reboot proactively

What Do You Think?