Cassandra and Sigmod contest Cloud computing group Haiping Wang
Outline Cassandra Cassandra overview Data model Architecture Read and write Sigmod contest 2009 Sigmod contest 2010
Cassandra overview Highly scalable, distributed Eventually consistent Structured key-value store Dynamo + bigtable P2P Random reads and random writes Java
Data Model KEY ColumnFamily1 Name : MailList Type : Simple Sort : Name Name : tid1 Value : TimeStamp : t1 Name : tid2 Value : TimeStamp : t2 Name : tid3 Value : TimeStamp : t3 Name : tid4 Value : TimeStamp : t4 ColumnFamily2 Name : WordList Type : Super Sort : Time Name : aloha ColumnFamily3 Name : System Type : Super Sort : Name Name : hint1 Name : hint2 Name : hint3 Name : hint4 C1 V1 T1 C2 V2 T2 C3 V3 T3 C4 V4 T4 Name : dude C2 V2 T2 C6 V6 T6 Column Families are declared upfront Columns are added and modified dynamically SuperColumns are added and modified dynamically Columns are added and modified dynamically
Cassandra Architecture
Cassandra API Data structures Exceptions Service API ConsistencyLevel(4) Retrieval methods(5) Range query: returns matching keys(1) Modification methods(3) Others
Cassandra commands
Partitioning and replication(1) Consistent hashing DHT Balance Monotonicity Spread Load Virtual nodes Coordinator Preference list
01 1/2 F E D C B A N=3 h(key2) h(key1) 9 Partitioning and replication(2)
Data Versioning Always writeable Mulitple versions – put() return before all replicas – get() many versions Vector clocks Reconciliation during reads by clients
Vector clock List of (node, counter) pairs E.g. [x,2][y,3] vs. [x,3][y,4][z,1] [x,1][y,3] vs. [z,1][y,3] Use timestamp E.g. D([x,1]:t1,[y,1]:t2) Remove the oldest version when reach a thresthold
Vector clock Return all the objects at the leaves D3,4([Sx,2],[Sy,1],[Sz,1]) Single new version
Excution operations Two strategies – A generic load balancer based on load balance Easy,not have to link any code specific – Directory to the node Achieve lower latency
Put() operation client coordinator PN-1 P2 P1 w-1 responses Object with vector clock
Cluster Membership Gossip protocol State disseminated in O(logN) rounds Increase its heartbeat counter and send its list to another every T seconds Merge operations
Failure Data center(s) failure – Multiple data centers Temporary failure Permanent failure – Merkle tree
Temporary failure
Merkle tree
Boolom filter a space-efficient probabilistic data structure used to test whether an element is a member of a set false positive
Compactions K1 K2 K3 -- Sorted K2 K10 K30 -- Sorted K4 K5 K10 -- Sorted MERGE SORT K1 K2 K3 K4 K5 K10 K30 Sorted K1 Offset K5 Offset K30 Offset Bloom Filter Loaded in memory Index File Data File D E L E T E D
Write Key (CF1, CF2, CF3) Commit Log Binary serialized Key ( CF1, CF2, CF3 ) Memtable ( CF1) Memtable ( CF2) Data size Number of Objects Lifetime Dedicated Disk --- BLOCK Index Offset, Offset K 128 Offset K 256 Offset K 384 Offset Bloom Filter (Index in memory) Data file on disk
Read Query Closest replica Cassandra Cluster Replica A Result Replica BReplica C Digest Query Digest Response Result Client Read repair if digests differ
Outline Cassandra Cassandra overview Data model Architecture Read and write Sigmod contest 2009 Sigmod contest 2010
Sigmod contest 2009 Task overview API Data structure Architecture Test
Task overview Index system for main memory data Running on multi-core machine Many threads with multiple indices Serialize execution of user-specified transactions Basic function exact match queries,range queries, updates inserts, deletes
API
Record
HashTable
HashShared
TxnState
IdxState Keep track of an index Created openIndex() Destroyed closeIndex() Inherited by IdxStateType Contains pointers pointing to – a hashtable – a FixedAllocator – a Allocator – a array with the type of action
Architecture
IndexManager
DeadLockDetector
Transactor a HashOnlyGet object with type TxnState
Allocator Allocate the memory for the payloads Use pools and linked list Pool sized --the max length of payload is 100 The payloads with the same payload are in the same list
Unit Tests three threads, run over three indices the primary thread – create the primary index – inserts, deletes and accesses data in the primary index the second thread – simultaneously runs some basic tests over a separate index the third thread – ensure the transactional guarantees – Continuously queries the primary index
Outline Cassandra Cassandra overview Data model Architecture Read and write Sigmod contest 2009 Sigmod contest 2010
Task overview Implement a simple distributed query executor with the help of the in-memory index Given centralized query plans, translate them into distributed query plans Given a parsed SQL query, return the right results Data stored on disk, the indexes are all in memory Measure the total time costs
SQL query form SELECT alias_name.field_name,... FROM table_name AS alias_name,… WHERE condition1 AND... AND conditionN Condition alias_name.field_name = fixed value alias_name.field_name > fixed value alias_name.field_name1 =alias_name.field_name2
Initialization phase
Connection phase
Query phase
Closing phase
Tests An initial computation On synthetic and real-world datasets Tested on a single machine Tested on an ad-hoc cluster of peers Passed a collection of unit tests, provided with an Amazon Web Services account of a 100 USD value
Benchmarks(stag1) Assume a partition always cover the entire table, the data is not replicated. Unit-tests Benchmarks – On a single node, selects with an equal condition on the primary key – On a single node, selects with an equal condition on an indexed field – On a single node, 2 to 5 joins on tables of different size – On a single node, 1 join and a "greater than" condition on an indexed field – On three nodes, one join on two tables of different size, the two tables being on two different nodes
Benchmarks(stag2) Tables are now stored on multiple nodes Part of a table, or the whole table may be replicated on multiple nodes Queries will be sent in parallel up to 50 simultaneous connections Benchmarks – Selects with an equal condition on the primary key, the values being uniformly distributed – Selects with an equal condition on the primary key, the values being non- uniformly distributed – Multiple joins on tables separated on different nodes
Important Dates
Thank you!!!