Cassandra for SQL Server Professionals

Cassandra for SQL Server Professionals

Agenda Introduction Relational Can Be Hard Introducing Cassandra
Storage and Data Model Cassandra vs SQL Server

Thank You to All Our Sponsors!

Who am I? Dan Mallott Twitter: @DanielMallott
Github: LinkedIn: Principal for West Monroe Partners In the industry since 2011 Primary experience with SQL Server, starting with SQL Server 2005 Also worked with Oracle, PostgreSQL, and Cassandra Been both a DBA and a developer Have a couple Microsoft certifications DataStax Certified Professional

RDBMS has been great… Keeps data consistent Great with OLTP
Motivations Tremendously successful for the past 40+ years Keeps data consistent Great with OLTP Provides access with good grammar Manipulates data with widely adopted programming languages Many people have required skills to develop and maintain Data in a single location (with replication to other sites)

When Keeping It Re(lation)al Goes Wrong
Motivations Velocity – how frequent are we being updated? Batch -> Periodic -> near Real Time -> Real Time Instead of daily measurements, we now get the same measurement in fractions of seconds Volume – how many bytes? The average dataset is growing quickly past GBs to TBs and PBs Variety – what data format and structure? From a DB record to a Document, picture, webpage, tweet, video

When Keeping It Re(lation)al Goes Wrong
We present to you “When Keeping It Relational Goes Wrong”

When Keeping It Re(lation)al Goes Wrong (con’t)
ACID is a LIE! Replication lag = async operations from master to slave Reads to the slave might yield old data Consistency is lost Normalizing data hurts! Data needs to be normalized at write time Joins can be a very expensive operation User User_ID Username … BAD Videos Video_ID Uploaded_By_ID … More likely than not production data Replication necessary, leads to ACID falling apart Satisfying to have normalized database, but at the end of the day its all about query speed Any queries that require user and video information will require at least 1 join Means trouble for our database So what can we do about all the stress our poor database is under?

Scale UP vs Scale OUT Option 1: Scale UP Option 2: Scale OUT
Single huge powerful computer Can be Expensive! – Reach an economic limit when your data can’t all be in memory. This changes over time as memory and computing power become more economical. ~250GB Can’t have a single point of failure -> DR solution so 2x’s the expense Option 2: Scale OUT Multiple similar machines that are connected and distribute the workload and storage. Close to a linear to scale for more ability to work. I.e. to do 3 times more calculations, you need 2 more machines. Wait, how is this all coordinated?

Recent attempts to improve RDBMS performance have actually been small steps towards distributed computing Sharding + data replication is a nightmare Adding shards requires manually moving data Failover will almost always result in some amount of down time

Introducing Cassandra

What is Cassandra? Cassandra Fast, distributed database built for High Availability and Linear Scalability No Single Points of Failure (no leader election, nodes calculate where data is in the cluster) Multi-Data Center with replication between Data Centers Commodity hardware Operational management ease

Playlists and Collections
Who uses Cassandra? Cassandra Catalog and Playlists 1 trillion transactions per day 10 million per second Message Delivery 100k nodes Playlists and Collections Personalization and Recommendation Engines Messaging

Who else uses Cassandra?

Storage Model How Cassandra stores data for fast reading and writing

The Cassandra Data Model – Replication Factor
Each Keyspace – analogous to Schema or Database – has its own Replication Factor defined at create time The Replication Factor defines how many copies of the data reside in the cluster Put a different way, the Replication Factor defines how many nodes the data is replicated to Higher Replication Factors result in higher reliability and can result in faster reads but may slow writes and increase network traffic

The Cassandra Data Model – Partition Key
Responsible for data distribution across partitions in the cluster Can be composed of one or more columns Nodes in the cluster each store several partitions Ideal situation: Partition per query videos_by_user User_id Uploaded_time Title Video_id … C ↓ C ↑ K Note in here that each Keyspace (database/schema) has its own Replication Factor (RF)

The Cassandra Data Model
De-normalization is the norm Replicated data Primary key is used for uniqueness AND to satisfy queries efficiently Partition Key – 1 or more fields Clustering Columns – 0 or more fields videos_by_user User_id Uploaded_time Title Video_id … videos_by_actor Actor_it Title Video_id … C ↓ C ↑ K videos_by_tag Tag Uploaded_time Video_id … videos_by_genre Genre Uploaded_time Video_id … The structure of your data allows queries be efficient by only searching 1 or a few partitions for your data Results in composite primary keys You are effectively grouping and sorting your data with the way you set up the primary key of a table Names of tables give hints to how they’re structured and what they’re used for Cassandra table naming conventions are based off the primary key

The Cassandra Data Model – Clustering Columns
Responsible for sorting data within the partition Additional clustering column(s) used to ensure each record is unique Can do equality or range searches on clustering columns videos_by_user User_id Uploaded_time Title Video_id Username … C ↓ C ↑ K S

Building a Cassandra Primary Key
First field listed in the PRIMARY KEY tuple will be the partition key The rest of the fields are clustering columns Use a nested tuple to create a composite partition key Parentheses only necessary if there is more than one column in the partition key

What is the table name and primary key of this table?
Data Modeling Example table in which year is the partition key. Partition Key act as pointers to the data we’re looking for Partition keys are used in hash to find exactly which partition the data is stored on. Basically a giant hash table TABLE videos_by_year ……. PRIMARY KEY(year, runtime, id)

CQL syntax and special data types
SQL Keyspace Database/Schema Table Text Varchar Timestamp Datetimeoffset UUID Uniqueidentifier TIMEUUID List Set Map Blob Varbinary Inet Tuple videos_by_user User_id Uploaded_time Title Video_id Username Description C ↓ C ↑ K Casssandra is a NoSQL db, so lists, sets, and maps are supported Correlate to Java types

User Defined Types Store complex types as a single field
Great for storing related fields of information Can only be used within the keyspace it was defined Fields within a UDT can be accessed with dot notation (home_address.zipcode)

Cassandra vs SQL Server

Comparison of SQL Server and Cassandra
Database Model/Licensing SQL Server Relational Database Closed-source, commercial licensing Microsoft Windows Linux T-SQL .NET Languages Full Transaction Support with ACID Horizontal Partitioning via Filegroups Sharding via Federation Cassandra Wide Column Store Open-source licensing Linux OSX Microsoft Windows CQL (Cassandra Query Language) No Transactions (all operations are independent) Sharding Supported OS’s Programming Transactions Data Partitioning

Reasons to Use Cassandra
1 Need ease of scaling and/or linear performance gains from scaling 2 Need to support a very high volume (100,000s of transactions per second) with low latency 3 Need to support a large (10s – 100s of Terabytes) and/or constantly growing volumes of data with no loss in performance 4 Must have 0 downtime – the application must be available at all times and survive Data Center failures 5 Must have high-performance multi-geographic replication 6 Data model lends itself to changes needed for Cassandra performance

Reasons to (Still) Use SQL Server
1 Maintaining ACIDity is a business requirement 2 Only need to support a small to medium volume of transactions and/or data 3 Performance is not bound by hardware and is meeting business needs 4 Some downtime is acceptable 5 Must run in a Window—only environment and/or must have GUI tool support Cassandra does not support ACID in the following ways: Not a relational database, does not support foreign keys or join operations and so is not Consistent Not necessarily Atomic Most modifications are Isolated, except batch operations that modify more than one partition All operations are Durable 6 Application Developers can’t (or won’t!) make changes to take advantage of Cassandra

How do I learn more?!? https://docs.datastax.com
Online courses Quick tutorials Get Cassandra & DataStax certified Courses to prep you for certifications/give you a base of understanding Quick tutorials for doing specific things (deploy to azure, etc) All very short videos

West Monroe Partners is Hiring!

Questions? More detail on Solr, Spark What about this Graph thing?

Cassandra for SQL Server Professionals

Similar presentations

Presentation on theme: "Cassandra for SQL Server Professionals"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cassandra for SQL Server Professionals

Similar presentations

Presentation on theme: "Cassandra for SQL Server Professionals"— Presentation transcript:

Similar presentations

About project

Feedback