Download presentation
1
Introduction & Data Modeling
Cassandra Training Introduction & Data Modeling
2
Aims By the end of today you should know: How Cassandra organises data
How to configure replicas How to choose between consistency and availability How to efficiently model data for both reads and writes You need to consider Active-Active scenarios Who to ask to help you & sign off on your data model HINT: Ask Neil directly or Introduction to Cassandra
3
Agenda – 100ft Quick Introduction Data Structures
Efficient Data Modeling Data Modeling Examples Introduction to Cassandra
4
Agenda - Introduction Elevator Pitch
Brewer’s Theorem & Tuneable Consistency Distributed Hash Table 101 Write path Read path TTL, Deletion & Tombstones Background Processes Data Model in 5mins Thrift vs CQL Maintaining Consistency Scaling Cassandra Introduction to Cassandra
5
Agenda – Advanced Topics
Data Modelling Key Concepts Time Series Modelling Wide rows Compound Keys Code example Performance Tuning Levers What is DataStax Enterprise? Multi DC Support Virtual Nodes Nodetool Introduction to Cassandra
6
What? Elevator Pitch Write path optimised Eventually consistent (ms)
Distributed Hash Table Highly durable Tunable consistency Introduction to Cassandra
7
let me choose my tradeoff
Elevator Pitch Why? Linear horizontal read & write scaling Data is important and should always be there Often times we don’t need consistency guarantee let me choose my tradeoff Introduction to Cassandra
8
How? Elevator Pitch Data partitioned internally across nodes
Writes must just hit the commit log Store data read-optimised to minimise read & write work: no indexes to update, no query to plan Specify agreement (consistency) per query Introduction to Cassandra
9
Not a silver bullet - easy to design a poorly-performing data model
Elevator Pitch What it’s Not No support for transactions - atomicity, isolation mostly not available Not a silver bullet - easy to design a poorly-performing data model Introduction to Cassandra
10
DHT 101 Each physical node is assigned a token
Nodes own the range from the previous token Introduction to Cassandra
11
Cassandra Write Path The coordinator will send the update to two nodes, starting at the owning node and working clockwise Introduction to Cassandra
12
Cassandra Write Path 128-bit hash used to compute partition key
Keys are therefore distributed randomly around the ring If Unavailable - Hinted Handoff Introduction to Cassandra
13
Random Partitioner – key -> token
Cassandra Write Path Concepts The Snitch – proximity Random Partitioner – key -> token Replication Factor – how many replicas Gossip – discovery protocol Introduction to Cassandra
14
Cassandra Write Path SSTables are sequential and immutable
Data may reside across SSTables SSTables are periodically compacted together Introduction to Cassandra
15
Cassandra Read Path Data read command sent to closest replica - snitch
Digest commands sent to other replicas – CL Read Repair Chance 10% - digest all replicas Introduction to Cassandra
16
Start & Interrogate C* vagrant box add dse.box mkdir ~/vagrant curl > ~/vagrant/dse.tar.gz cd ~/vagrant && tar xzvf dse.tar.gz cd dse && vagrant up vagrant ssh node1 nodetool ring Introduction to Cassandra
17
Find Candidate SSTables - Bloom Filters Seek Through SSTables
Cassandra Read Path Read Mechanics Find Candidate SSTables - Bloom Filters Seek Through SSTables Memory Mapped Files Check Memtable -> minimise sstables for best efficiency Introduction to Cassandra
18
Deleted data marked as removed – tombstone
Deletion & Tombstones Deleted data marked as removed – tombstone Stops zombie data – distributed system Tombstones collected after a few days – configurable Introduction to Cassandra
19
Distributed Data – only 2 at a time – Consistency Availability
Brewer’s Theorem Distributed Data – only 2 at a time – Consistency Availability Partition Tolerance Introduction to Cassandra
20
Brewer’s Theorem CA - normal operation, no partition, consistency and availability provided Introduction to Cassandra
21
Brewer’s Theorem AP - partition occurs, maintaining two mutable, disconnected state copies breaks consistency, availability is conserved Introduction to Cassandra
22
Brewer’s Theorem CP - partition occurs, to maintain consistency we need to take one side offline, sacrificing availability Introduction to Cassandra
23
Cassandra Consistency Level
Tuneable Consistency Cassandra Consistency Level Specify node number to agree on read/write Choose consistency or availability: CL.LOCAL_QUORUM, CL.ONE Eventual consistency will bring both sides into agreement eventually Introduction to Cassandra
24
SSTables Compacted Periodically Size-Tiered Compaction
Background Processes SSTables Compacted Periodically Size-Tiered Compaction – default, no compaction guarantee Leveled-Compaction – better chance of tombstone compaction – more continual compaction, 2x I/O – impact on online – use for update-heavy workloads – creates many SSTables Introduction to Cassandra
25
Agenda – 100ft Quick Introduction Data Structures
Efficient Data Modeling Data Modeling Examples Introduction to Cassandra
26
Keyspace Data Model Analogous to Database/Schema
Segregate Applications Replication configured at this level Introduction to Cassandra
27
Caches configurable at this level
Data Model Column Family Analogous to Table Contains many rows Caches configurable at this level Introduction to Cassandra
28
Row Data Model Each one has a partition key - hash
Has many columns – up to 2Bn Columns don’t have to be defined ahead of time Rows in the same CF can have different columns No sorting by rows, model ordering in rows Introduction to Cassandra
29
Columns Data Model Sorted by name before being written to SSTable
Name and Value are typed Values can be type-validated Column update is timestamped Can have TTL Introduction to Cassandra
30
Counter Columns Data Model Distributed counters Can get false counts
Introduction to Cassandra
31
Super Columns – Don’t Use
Data Model Super Columns – Don’t Use Blob of columns stored inside a single column Have to read and write whole blob Memory intensive Conflicts resolved for whole blob - bad Introduction to Cassandra
32
Can define an index on a column
Secondary Indices Can define an index on a column Cassandra will maintain an inverted index Use sparingly Low Cardinality Columns Only Often times better to maintain own view Introduction to Cassandra
33
Thrift CQL Thrift vs CQL Original interface, hash style syntax
SQL-like syntax but highly limited Sent over Thrift but plans for own protocol Introduction to Cassandra
34
Maintaining Consistency
Consistency Level Used on read & write operations ONE, TWO, LOCAL_QUORUM, ALL, ANY Do you really need consistency guarantee? Introduction to Cassandra
35
Imagine RF=3, Quorum, Nodes=6 Each query impacts 2 nodes sync
Scaling Cassandra Imagine RF=3, Quorum, Nodes=6 Each query impacts 2 nodes sync Each write will touch all 3 nodes, though async To scale writes add more nodes To scale reads, add more replicas Introduction to Cassandra
36
Solr 4 & Hadoop Integration
Advanced Topics Advanced Topics Data Modelling Wide Rows & Clustering Performance Solr 4 & Hadoop Integration Introduction to Cassandra
37
Agenda – 100ft Quick Introduction Data Structures
Efficient Data Modeling Data Modeling Examples Introduction to Cassandra
38
Data Modelling Data Modelling Concepts that Drive Data Modeling
Time-series Modeling Wide Rows (Composite Columns) Compound Keys & CQL3 Introduction to Cassandra
39
Data Modelling - Concepts
Rows in same CF will live on different nodes High cost of multi-get De-normalise your data into rows Don’t Put Consistent Load on Single Row Will heat up replica nodes Introduction to Cassandra
40
Data Modelling - Concepts
Writes to Single Row Atomic & Isolated Columns are Ordered Column Range Slicing Efficient Mutating data often needs compaction tuning Introduction to Cassandra
41
Efficient Reads Wide Rows Store how you want to fetch
Fetch most efficient over few rows Store what you want to fetch in few rows Introduction to Cassandra
42
Use Timestamp for Column Name – ordered Range slicing efficient
Time Series Use Timestamp for Column Name – ordered Range slicing efficient Can limit row length by using date partition key e.g Introduction to Cassandra
43
Composite Column Composite Columns
e.g. time1:log_class, time1:log_message, time2:log_class, time2:log_message Introduction to Cassandra
44
Writing to a Single Row Hotspots Use Round Robin Over Rows
Time Series Writing to a Single Row Hotspots Use Round Robin Over Rows e.g :1, :2, etc… Introduction to Cassandra
45
Compound Key in CQL3 Compound Keys Partition Key is the row key
Compound Key = Partition Key + Composite Key e.g. partition key = , composite key = time1 => time1:name, time1:msg, time2:name, time2:msg Introduction to Cassandra
46
Agenda – 100ft Quick Introduction Data Structures
Efficient Data Modeling Data Modeling Examples Introduction to Cassandra
47
Working with CQL cqlsh -3 192.168.33.21 CREATE KEYSPACE my_app_data
WITH strategy_class = SimpleStrategy AND strategy_options:replication_factor = 2; DESCRIBE KEYSPACE my_app_data; Introduction to Cassandra
48
Compound Keys USE my_app_data; CREATE COLUMNFAMILY logs (
day text, -- partition key log_id timeuuid, -- clustering column log_class text, log_message text, primary key (day, log_id) ); DESCRIBE columnfamilies; Introduction to Cassandra
49
Compound Keys INSERT INTO logs (day,log_id,log_class,log_message)
VALUES (‘ ’, ‘ :05:00’, ‘error’, ‘it broke’) USING CONSISTENCY ONE; VALUES (‘ ’, ‘ :05:00’, ‘error’, ‘it broke again’) USING CONSISTENCY QUORUM; Introduction to Cassandra
50
Compound Keys SELECT * FROM logs USING CONSISTENCY ONE WHERE day=‘ ’; SELECT * FROM logs USING CONSISTENCY QUORUM WHERE day=‘ ’ AND log_id > ‘ :00:00’; TRY WITH CL.TWO: vagrant suspend node2 Setting CL and range querying columns, losing consistency Introduction to Cassandra
51
See the raw Cassandra data
Compound Keys cassandra-cli -h use my_app_data; list logs; See the raw Cassandra data Introduction to Cassandra
52
Hector Code Example - Clients Solid Java Client In Use in Production
Round Robin Node Discovery Introduction to Cassandra
53
Netflix Open Source Library
Code Example - Clients Astyanax Netflix Open Source Library Simpler APIs Introduction to Cassandra
54
Example: Storing Payment Methods
Code Example Example: Storing Payment Methods Introduction to Cassandra
55
Store 1-10 payment methods
Code Example Requirements Store 1-10 payment methods Use a single row Introduction to Cassandra
56
Define a composite column class
Code Example Non-CQL Define a composite column class public static final class Composite { = 0) String paymentUuid; = 1) String field; Introduction to Cassandra
57
Writing Data Code Example
UUID paymentUUID = TimeUUIDUtils.getUniqueTimeUUIDinMillis(); String sPaymentUUID = paymentUUID.toString(); batch.withRow(PAYMENTS_CF, userId) .putColumn(new Composite(sPaymentUUID, "pvtoken"), paymentInfo.pvToken, null) .putColumn(new Composite(sPaymentUUID, "name"), paymentInfo.name, null) .putColumn(new Composite(sPaymentUUID, "number"), paymentInfo.number, null) Introduction to Cassandra
58
Need some logic to handle record boundaries
Code Example Reading Data Need some logic to handle record boundaries //handle the payment info boundary if (lastSeen != null && !column.getName().getPaymentUuid().equals(lastSeen)) { payments.add(payment); payment = new PaymentInfo(); payment.paymentUUID = UUID.fromString(column.getName().paymentUuid); } lastSeen = column.getName().getPaymentUuid(); Introduction to Cassandra
59
Code Example A Bit Messy Introduction to Cassandra
60
Cassandra needs it to split up the row for us
Code Example CQL3 Need to define a Schema Cassandra needs it to split up the row for us Introduction to Cassandra
61
Schema Code Example create table paymentinfo_cql ( user text,
paymentid timeuuid, name text, number text, pvtoken text, primary key (user,paymentid) ); Introduction to Cassandra
62
Inserting Data Code Example insert into paymentinfo_cql (
user, paymentid, name, number, pvtoken ) values ( '%1$s','%2$s','%3$s','%4$s','%5$s’ ) Introduction to Cassandra
63
Reading Data Code Example select * from paymentinfo_cql where user='%s
Introduction to Cassandra
64
Multi Datacentre Support
Cassandra RF=2 (availability), Solr RF=1 (offline search) RFs set per Column Family and per logical datacentre Introduction to Cassandra
65
Multi Datacentre Support
Both DCs participate in same ring Cassandra walks clockwise as normal to fulfill RFs Introduction to Cassandra
66
Performance Tuning Levers
Memory Mapped Files SSTables memory mapped Visible as high virtual memory consumption Read fastest when working set fits in free RAM Introduction to Cassandra
67
Performance Tuning Levers
Row Cache Saves locating SSTables, seeking, reconciliation Off-heap – IPC marshaling penalty Whole row in memory Good for small numbers of hot rows – Gaussian dist. Introduction to Cassandra
68
Performance Tuning Levers
Key Cache Saves seeking through SSTables Beneficial for large SSTables - tiered compaction On-heap Introduction to Cassandra
69
Performance Tuning Levers
Cache hit-rates exposed over JMX Introduction to Cassandra
70
Performance Tuning Levers
Take care using memory that might be stolen from the read path (VirtMem) Introduction to Cassandra
71
Solr 4.0 Integration DataStax Enterprise Near-realtime indexing
Columns are available to Solr to index Indexes maintained in original file format Supports distributed search Use Cassandra API or Solr API Introduction to Cassandra
72
Hadoop Integration DataStax Enterprise
DataStax impements the HDFS on Cassandra – CFS Use H* or C* API No ETL Map operations are sent to replicas Reduce back to the task owner Introduction to Cassandra
73
Problem #1: Adding New Nodes
Virtual Nodes Problem #1: Adding New Nodes Introduction to Cassandra
74
Minimise streaming caused by moves
Virtual Nodes Wish to add node Ring already loaded Minimise streaming caused by moves Could put it in between 2 existing nodes – only helps a small range (this sucks) Introduction to Cassandra
75
Don’t want to have to buy 2 x servers each time (also sucks)
Virtual Nodes Double size of ring Minimise streaming caused by moves Don’t want to have to buy 2 x servers each time (also sucks) Introduction to Cassandra
76
Choose to rebalance the ring
Virtual Nodes Choose to rebalance the ring Load already warranted expansion Now adding streaming load Introduction to Cassandra
77
Problem #2: Replacing Failed Nodes
Virtual Nodes Problem #2: Replacing Failed Nodes Introduction to Cassandra
78
Remaining replica heats up
Virtual Nodes Node fails Remaining replica heats up Introduction to Cassandra
79
Now node 20 starts streaming => FIRE!
Virtual Nodes Bootstrap another Now node 20 starts streaming => FIRE! Introduction to Cassandra
80
Virtual Nodes The Solution Introduction to Cassandra
81
Slice each node into 256 token ranges
Virtual Nodes Slice each node into 256 token ranges Introduction to Cassandra
82
Randomly distribute tokens to other nodes
Virtual Nodes Randomly distribute tokens to other nodes Introduction to Cassandra
83
Each colour represents a node
Virtual Nodes Each colour represents a node Each node owns an even, random distribution of the ring Introduction to Cassandra
84
Can stream from every node
Virtual Nodes Replacing a node Can stream from every node Introduction to Cassandra
85
Do stuff with your deployment watch “nodetool ring”
Nodetool & Opscenter Do stuff with your deployment watch “nodetool ring” Useful overview of the ring – tokens, health Opscenter Introduction to Cassandra
86
Aims By the end of today you should know: How Cassandra organises data
How to configure replicas How to choose between consistency and availability How to efficiently model data for both reads and writes You need to consider Active-Active scenarios Who to ask to help you & sign off on your data model HINT: Ask Neil directly or Introduction to Cassandra
87
Questions Code Example
htraining.s3.amazonaws.com/cassandra-training.pptx Introduction to Cassandra
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.