Introduction & Data Modeling

Name: Introduction & Data Modeling
Uploaded: 2017-09-07T22:32:54+00:00
Duration: PTM28S17
Channel: Edgar Shaw
Description: Introduction & Data Modeling

Introduction & Data Modeling
Cassandra Training Introduction & Data Modeling

Aims By the end of today you should know: How Cassandra organises data
How to configure replicas How to choose between consistency and availability How to efficiently model data for both reads and writes You need to consider Active-Active scenarios Who to ask to help you & sign off on your data model HINT: Ask Neil directly or Introduction to Cassandra

Agenda – 100ft Quick Introduction Data Structures
Efficient Data Modeling Data Modeling Examples Introduction to Cassandra

Agenda - Introduction Elevator Pitch
Brewer’s Theorem & Tuneable Consistency Distributed Hash Table 101 Write path Read path TTL, Deletion & Tombstones Background Processes Data Model in 5mins Thrift vs CQL Maintaining Consistency Scaling Cassandra Introduction to Cassandra

Agenda – Advanced Topics
Data Modelling Key Concepts Time Series Modelling Wide rows Compound Keys Code example Performance Tuning Levers What is DataStax Enterprise? Multi DC Support Virtual Nodes Nodetool Introduction to Cassandra

What? Elevator Pitch Write path optimised Eventually consistent (ms)
Distributed Hash Table Highly durable Tunable consistency Introduction to Cassandra

let me choose my tradeoff
Elevator Pitch Why? Linear horizontal read & write scaling Data is important and should always be there Often times we don’t need consistency guarantee let me choose my tradeoff Introduction to Cassandra

How? Elevator Pitch Data partitioned internally across nodes
Writes must just hit the commit log Store data read-optimised to minimise read & write work: no indexes to update, no query to plan Specify agreement (consistency) per query Introduction to Cassandra

Not a silver bullet - easy to design a poorly-performing data model
Elevator Pitch What it’s Not No support for transactions - atomicity, isolation mostly not available Not a silver bullet - easy to design a poorly-performing data model Introduction to Cassandra

DHT 101 Each physical node is assigned a token
Nodes own the range from the previous token Introduction to Cassandra

Cassandra Write Path The coordinator will send the update to two nodes, starting at the owning node and working clockwise Introduction to Cassandra

Cassandra Write Path 128-bit hash used to compute partition key
Keys are therefore distributed randomly around the ring If Unavailable - Hinted Handoff Introduction to Cassandra

Random Partitioner – key -> token
Cassandra Write Path Concepts The Snitch – proximity Random Partitioner – key -> token Replication Factor – how many replicas Gossip – discovery protocol Introduction to Cassandra

Cassandra Write Path SSTables are sequential and immutable
Data may reside across SSTables SSTables are periodically compacted together Introduction to Cassandra

Cassandra Read Path Data read command sent to closest replica - snitch
Digest commands sent to other replicas – CL Read Repair Chance 10% - digest all replicas Introduction to Cassandra

Start & Interrogate C* vagrant box add dse.box mkdir ~/vagrant curl > ~/vagrant/dse.tar.gz cd ~/vagrant && tar xzvf dse.tar.gz cd dse && vagrant up vagrant ssh node1 nodetool ring Introduction to Cassandra

Find Candidate SSTables - Bloom Filters Seek Through SSTables
Cassandra Read Path Read Mechanics Find Candidate SSTables - Bloom Filters Seek Through SSTables Memory Mapped Files Check Memtable -> minimise sstables for best efficiency Introduction to Cassandra

Deleted data marked as removed – tombstone
Deletion & Tombstones Deleted data marked as removed – tombstone Stops zombie data – distributed system Tombstones collected after a few days – configurable Introduction to Cassandra

Distributed Data – only 2 at a time – Consistency Availability
Brewer’s Theorem Distributed Data – only 2 at a time – Consistency Availability Partition Tolerance Introduction to Cassandra

Brewer’s Theorem CA - normal operation, no partition, consistency and availability provided Introduction to Cassandra

Brewer’s Theorem AP - partition occurs, maintaining two mutable, disconnected state copies breaks consistency, availability is conserved Introduction to Cassandra

Brewer’s Theorem CP - partition occurs, to maintain consistency we need to take one side offline, sacrificing availability Introduction to Cassandra

Cassandra Consistency Level
Tuneable Consistency Cassandra Consistency Level Specify node number to agree on read/write Choose consistency or availability: CL.LOCAL_QUORUM, CL.ONE Eventual consistency will bring both sides into agreement eventually Introduction to Cassandra

SSTables Compacted Periodically Size-Tiered Compaction
Background Processes SSTables Compacted Periodically Size-Tiered Compaction – default, no compaction guarantee Leveled-Compaction – better chance of tombstone compaction – more continual compaction, 2x I/O – impact on online – use for update-heavy workloads – creates many SSTables Introduction to Cassandra

Keyspace Data Model Analogous to Database/Schema
Segregate Applications Replication configured at this level Introduction to Cassandra

Caches configurable at this level
Data Model Column Family Analogous to Table Contains many rows Caches configurable at this level Introduction to Cassandra

Row Data Model Each one has a partition key - hash
Has many columns – up to 2Bn Columns don’t have to be defined ahead of time Rows in the same CF can have different columns No sorting by rows, model ordering in rows Introduction to Cassandra

Columns Data Model Sorted by name before being written to SSTable
Name and Value are typed Values can be type-validated Column update is timestamped Can have TTL Introduction to Cassandra

Counter Columns Data Model Distributed counters Can get false counts
Introduction to Cassandra

Super Columns – Don’t Use
Data Model Super Columns – Don’t Use Blob of columns stored inside a single column Have to read and write whole blob Memory intensive Conflicts resolved for whole blob - bad Introduction to Cassandra

Can define an index on a column
Secondary Indices Can define an index on a column Cassandra will maintain an inverted index Use sparingly Low Cardinality Columns Only Often times better to maintain own view Introduction to Cassandra

Thrift CQL Thrift vs CQL Original interface, hash style syntax
SQL-like syntax but highly limited Sent over Thrift but plans for own protocol Introduction to Cassandra

Maintaining Consistency
Consistency Level Used on read & write operations ONE, TWO, LOCAL_QUORUM, ALL, ANY Do you really need consistency guarantee? Introduction to Cassandra

Imagine RF=3, Quorum, Nodes=6 Each query impacts 2 nodes sync
Scaling Cassandra Imagine RF=3, Quorum, Nodes=6 Each query impacts 2 nodes sync Each write will touch all 3 nodes, though async To scale writes add more nodes To scale reads, add more replicas Introduction to Cassandra

Solr 4 & Hadoop Integration
Advanced Topics Advanced Topics Data Modelling Wide Rows & Clustering Performance Solr 4 & Hadoop Integration Introduction to Cassandra

Data Modelling Data Modelling Concepts that Drive Data Modeling
Time-series Modeling Wide Rows (Composite Columns) Compound Keys & CQL3 Introduction to Cassandra

Data Modelling - Concepts
Rows in same CF will live on different nodes High cost of multi-get De-normalise your data into rows Don’t Put Consistent Load on Single Row Will heat up replica nodes Introduction to Cassandra

Data Modelling - Concepts
Writes to Single Row Atomic & Isolated Columns are Ordered Column Range Slicing Efficient Mutating data often needs compaction tuning Introduction to Cassandra

Efficient Reads Wide Rows Store how you want to fetch
Fetch most efficient over few rows Store what you want to fetch in few rows Introduction to Cassandra

Use Timestamp for Column Name – ordered Range slicing efficient
Time Series Use Timestamp for Column Name – ordered Range slicing efficient Can limit row length by using date partition key e.g Introduction to Cassandra

Composite Column Composite Columns
e.g. time1:log_class, time1:log_message, time2:log_class, time2:log_message Introduction to Cassandra

Writing to a Single Row Hotspots Use Round Robin Over Rows
Time Series Writing to a Single Row Hotspots Use Round Robin Over Rows e.g :1, :2, etc… Introduction to Cassandra

Compound Key in CQL3 Compound Keys Partition Key is the row key
Compound Key = Partition Key + Composite Key e.g. partition key = , composite key = time1 => time1:name, time1:msg, time2:name, time2:msg Introduction to Cassandra

Working with CQL cqlsh -3 192.168.33.21 CREATE KEYSPACE my_app_data
WITH strategy_class = SimpleStrategy AND strategy_options:replication_factor = 2; DESCRIBE KEYSPACE my_app_data; Introduction to Cassandra

Compound Keys USE my_app_data; CREATE COLUMNFAMILY logs (
day text, -- partition key log_id timeuuid, -- clustering column log_class text, log_message text, primary key (day, log_id) ); DESCRIBE columnfamilies; Introduction to Cassandra

Compound Keys INSERT INTO logs (day,log_id,log_class,log_message)
VALUES (‘ ’, ‘ :05:00’, ‘error’, ‘it broke’) USING CONSISTENCY ONE; VALUES (‘ ’, ‘ :05:00’, ‘error’, ‘it broke again’) USING CONSISTENCY QUORUM; Introduction to Cassandra

Compound Keys SELECT * FROM logs USING CONSISTENCY ONE WHERE day=‘ ’; SELECT * FROM logs USING CONSISTENCY QUORUM WHERE day=‘ ’ AND log_id > ‘ :00:00’; TRY WITH CL.TWO: vagrant suspend node2 Setting CL and range querying columns, losing consistency Introduction to Cassandra

See the raw Cassandra data
Compound Keys cassandra-cli -h use my_app_data; list logs; See the raw Cassandra data Introduction to Cassandra

Hector Code Example - Clients Solid Java Client In Use in Production
Round Robin Node Discovery Introduction to Cassandra

Netflix Open Source Library
Code Example - Clients Astyanax Netflix Open Source Library Simpler APIs Introduction to Cassandra

Example: Storing Payment Methods
Code Example Example: Storing Payment Methods Introduction to Cassandra

Store 1-10 payment methods
Code Example Requirements Store 1-10 payment methods Use a single row Introduction to Cassandra

Define a composite column class
Code Example Non-CQL Define a composite column class public static final class Composite { = 0) String paymentUuid; = 1) String field; Introduction to Cassandra

Writing Data Code Example
UUID paymentUUID = TimeUUIDUtils.getUniqueTimeUUIDinMillis(); String sPaymentUUID = paymentUUID.toString(); batch.withRow(PAYMENTS_CF, userId) .putColumn(new Composite(sPaymentUUID, "pvtoken"), paymentInfo.pvToken, null) .putColumn(new Composite(sPaymentUUID, "name"), paymentInfo.name, null) .putColumn(new Composite(sPaymentUUID, "number"), paymentInfo.number, null) Introduction to Cassandra

Need some logic to handle record boundaries
Code Example Reading Data Need some logic to handle record boundaries //handle the payment info boundary if (lastSeen != null && !column.getName().getPaymentUuid().equals(lastSeen)) { payments.add(payment); payment = new PaymentInfo(); payment.paymentUUID = UUID.fromString(column.getName().paymentUuid); } lastSeen = column.getName().getPaymentUuid(); Introduction to Cassandra

Code Example A Bit Messy Introduction to Cassandra

Cassandra needs it to split up the row for us
Code Example CQL3 Need to define a Schema Cassandra needs it to split up the row for us Introduction to Cassandra

Schema Code Example create table paymentinfo_cql ( user text,
paymentid timeuuid, name text, number text, pvtoken text, primary key (user,paymentid) ); Introduction to Cassandra

Inserting Data Code Example insert into paymentinfo_cql (
user, paymentid, name, number, pvtoken ) values ( '%1$s','%2$s','%3$s','%4$s','%5$s’ ) Introduction to Cassandra

Reading Data Code Example select * from paymentinfo_cql where user='%s
Introduction to Cassandra

Multi Datacentre Support
Cassandra RF=2 (availability), Solr RF=1 (offline search) RFs set per Column Family and per logical datacentre Introduction to Cassandra

Multi Datacentre Support
Both DCs participate in same ring Cassandra walks clockwise as normal to fulfill RFs Introduction to Cassandra

Performance Tuning Levers
Memory Mapped Files SSTables memory mapped Visible as high virtual memory consumption Read fastest when working set fits in free RAM Introduction to Cassandra

Row Cache Saves locating SSTables, seeking, reconciliation Off-heap – IPC marshaling penalty Whole row in memory Good for small numbers of hot rows – Gaussian dist. Introduction to Cassandra

Key Cache Saves seeking through SSTables Beneficial for large SSTables - tiered compaction On-heap Introduction to Cassandra

Cache hit-rates exposed over JMX Introduction to Cassandra

Take care using memory that might be stolen from the read path (VirtMem) Introduction to Cassandra

Solr 4.0 Integration DataStax Enterprise Near-realtime indexing
Columns are available to Solr to index Indexes maintained in original file format Supports distributed search Use Cassandra API or Solr API Introduction to Cassandra

Hadoop Integration DataStax Enterprise
DataStax impements the HDFS on Cassandra – CFS Use H* or C* API No ETL Map operations are sent to replicas Reduce back to the task owner Introduction to Cassandra

Problem #1: Adding New Nodes
Virtual Nodes Problem #1: Adding New Nodes Introduction to Cassandra

Minimise streaming caused by moves
Virtual Nodes Wish to add node Ring already loaded Minimise streaming caused by moves Could put it in between 2 existing nodes – only helps a small range (this sucks) Introduction to Cassandra

Don’t want to have to buy 2 x servers each time (also sucks)
Virtual Nodes Double size of ring Minimise streaming caused by moves Don’t want to have to buy 2 x servers each time (also sucks) Introduction to Cassandra

Choose to rebalance the ring
Virtual Nodes Choose to rebalance the ring Load already warranted expansion Now adding streaming load Introduction to Cassandra

Problem #2: Replacing Failed Nodes
Virtual Nodes Problem #2: Replacing Failed Nodes Introduction to Cassandra

Remaining replica heats up
Virtual Nodes Node fails Remaining replica heats up Introduction to Cassandra

Now node 20 starts streaming => FIRE!
Virtual Nodes Bootstrap another Now node 20 starts streaming => FIRE! Introduction to Cassandra

Virtual Nodes The Solution Introduction to Cassandra

Slice each node into 256 token ranges
Virtual Nodes Slice each node into 256 token ranges Introduction to Cassandra

Randomly distribute tokens to other nodes
Virtual Nodes Randomly distribute tokens to other nodes Introduction to Cassandra

Each colour represents a node
Virtual Nodes Each colour represents a node Each node owns an even, random distribution of the ring Introduction to Cassandra

Can stream from every node
Virtual Nodes Replacing a node Can stream from every node Introduction to Cassandra

Do stuff with your deployment watch “nodetool ring”
Nodetool & Opscenter Do stuff with your deployment watch “nodetool ring” Useful overview of the ring – tokens, health Opscenter Introduction to Cassandra

Aims By the end of today you should know: How Cassandra organises data
How to configure replicas How to choose between consistency and availability How to efficiently model data for both reads and writes You need to consider Active-Active scenarios Who to ask to help you & sign off on your data model HINT: Ask Neil directly or Introduction to Cassandra

Questions Code Example
htraining.s3.amazonaws.com/cassandra-training.pptx Introduction to Cassandra

Introduction & Data Modeling

Similar presentations

Presentation on theme: "Introduction & Data Modeling"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction & Data Modeling

Similar presentations

Presentation on theme: "Introduction & Data Modeling"— Presentation transcript:

Similar presentations

About project

Feedback