Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 1 Dynamo: Amazon ’ s Highly Available Key- value Store まとめ 星野 喬 2008 年 03 月 05 日
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 2 概要 背景と目的 特徴と利点 アーキテクチャ 評価結果 関連研究 まとめ コメント
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 3 背景と目的 Amazon ’ s e-commerce services must serve –best seller lists, shopping carts, customer preferences, session management, sales rank, product catalog –for tens of millions customers at peak times The platform is required to provide –highly reliability and availability –highly scalability Relational database is inefficient –complex querying and management functionality –expensive hardware and highly skilled personnel Dynamo can manage the tradeoffs between –availability, consistency, cost-effectiveness and performance
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 4 Dynamo の特徴 Interface –key-value storage system –writable always –application-assisted conflict resolution –SLA considers latency of services at the 99.9 th percentile Architecture –consistent hashing (zero-hop DHT) –object versioning –quorum-like synchronization –gossip based distributed failure detection
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 5 Dynamo ’ s advantages Table 1: Summary of techniques used in Dynamo and their advantages. ProblemTechniqueAdvantage PartitioningConsistent HashingIncremental Scalability High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates. Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available. Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background. Membership and failure detection Gossip-based membership protocol and failure detection. Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information.
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 6 Details Data access interface –get(key) key: MD5 128bit hash –put(key, context, object) context: version etc. Replication scheme –Virtual nodes Consistency scheme –Majority voting (quorum) system –(N, R, W) where R+W>N Data store in each node –Berkeley Database –MySQL Collision reconciliation
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 7 Collision reconciliation Write –Allow interanl versioning Read –If collision is detected, repairing (reconciliation) occurs Vector clock –Like timestamp of each version –{[node, count]} Reconciliation patterns –Business logic specific –Timestamp based –High performance read
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 8 結果 : 99.9 th percentile performance X-axis tick: 12hours
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 9 結果 : write buffering effects Tradeoff: buffering writes may lost data in crashing servers
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 10 結果 : load distribution x-axis tick: 30min out-of-balance: by over 15% larger/smaller load than avg.
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 11 結果 : partitioning scheme 1: T random tokens per node and partition by token value 2: T random tokens per node and equal sized partitions 3: Q/S tokens per node, equal-sized partitions
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 12 結果 : location of coordination Table 2: Performance of client-driven and server-driven coordination approaches. 99.9th percentile read latency (ms) 99.9th percentile write latency (ms) Average read latency (ms) Average write latency (ms) Server- driven Client- driven
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 13 関連研究 Peer to Peer Systems –Freenet, Gnutella –Pastry, Chord –Oceanstore, PAST Distributed File Systems and Databases –Ficus, Coda, Farsite, GoogleFS, Bayou –FAB, Antiquity –Bigtable Dynamo ’ s features –always writeable –assumes all nodes are trusted –requires neither hierarchical namespaces nor complex relational schema –suitable for latency sensitive applications
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 14 まとめ Dynamo –highly available and scalable data store for Amazon.com ’ s e-commerce platform –incrementally scalable –customizable to meet desired durability and consistency SLAs (parameters N, R, and W)
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 15 コメント GoogleFS&Bigtable との違い – メタデータ管理 Google: マスタサーバに集約 bottleneck 問題 Amazon: 全ノードに分散 routing 問題 一長一短だが, GFS/Bigtable が ~10000 ノードいけるのに対し, Dynamo は zero-hop-DHT だとせいぜい ~1000 ノードで,それ以上は階層化が必要 –Conflict 処理 Google: append のみに絞り高速処理 Amazon: write 時の inconsistency を許容し, read 時に repeir どちらも大きな Novelty 予測 –Relational Database と住み分けるか,食うか? Transaction 的には当分 RDBMS が安泰 (secondary index, forliegn key) Data warehouse 的には RDBMS 不利 (MapReduce などにやられる )