Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha.

Slides:



Advertisements
Similar presentations
CASSANDRA-A Decentralized Structured Storage System Presented By Sadhana Kuthuru.
Advertisements

Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Cassandra A Decentralized, Structured Storage System Avinash Lakshman and Prashant Malik Facebook Published: April 2010, Volume 44, Issue 2 Communications.
AMAZON’S KEY-VALUE STORE: DYNAMO DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: Amazon's highly available.
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores.
Cassandra Structured Storage System over a P2P Network Avinash Lakshman, Prashant Malik.
Cloud Storage Theo Benson. Outline Distributed storage – Commodity server, limited resources, – Geodistribution, scalable, reliable Cassandra [FB] – High.
COLUMN-BASED DBS BigTable, HBase, SimpleDB, and Cassandra.
Cloud Storage Yizheng Chen. Outline Cassandra Hadoop/HDFS in Cloud Megastore.
NoSQL Databases: MongoDB vs Cassandra
Cassandra Database Project Alireza Haghdoost, Jake Moroshek Computer Science and Engineering University of Minnesota-Twin Cities Nov. 17, 2011 News Presentation:
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
A Decentralized Structure Storage Model - Avinash Lakshman & Prashanth Malik - Presented by Srinidhi Katla CASSANDRA.
Working with SQL and PL/SQL/ Session 1 / 1 of 27 SQL Server Architecture.
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 7 Configuring File Services in Windows Server 2008.
Distributed storage for structured data
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
Cloud Storage: All your data belongs to us! Theo Benson This slide includes images from the Megastore and the Cassandra papers/conference slides.
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach M. Burrows, T. Chandra, A. Fikes, R.E.
1 The Google File System Reporter: You-Wei Zhang.
ASP.NET Programming with C# and SQL Server First Edition
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
Cassandra Installation Guide and Example Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha.
Database Technical Session By: Prof. Adarsh Patel.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Cloud Computing Cloud Data Serving Systems Keke Chen.
Apache Cassandra - Distributed Database Management System Presented by Jayesh Kawli.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
Cassandra - A Decentralized Structured Storage System
Discussion MySQL&Cassandra ZhangGang 2012/11/22. Optimize MySQL.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
CS 347Lecture 9B1 CS 347: Parallel and Distributed Data Management Notes 13: BigTable, HBASE, Cassandra Hector Garcia-Molina.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Partitioning and Replication.
Introduction to Hbase. Agenda  What is Hbase  About RDBMS  Overview of Hbase  Why Hbase instead of RDBMS  Architecture of Hbase  Hbase interface.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
HDB++: High Availability with
Bigtable: A Distributed Storage System for Structured Data
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Cassandra. Outline Introduction Background Use Cases Data Model & Query Language Architecture Conclusion.
SQL Basics Review Reviewing what we’ve learned so far…….
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Cassandra Architecture.
Bigtable A Distributed Storage System for Structured Data.
Amirhossein Saberi May CASSANDRA NAME A daughter of the Trojan king Priam, who was given the gift of prophecy by Apollo. When she cheated him, however,
Plan for Final Lecture What you may expect to be asked in the Exam?
Cassandra The Fortune Teller
and Big Data Storage Systems
Cassandra - A Decentralized Structured Storage System
Managing Multi-User Databases
Introduction to Cassandra
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Cassandra Tools and Config Files
CSE-291 (Cloud Computing) Fall 2016
NOSQL.
The NoSQL Column Store used by Facebook
NOSQL databases and Big Data Storage Systems
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Apache Cassandra for the SQLServer DBA
Distributed P2P File System
Introduction to Apache
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Presentation transcript:

Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha

Outlined Introduction History Data Model System Architecture Cassandra Configuration CQL = Cassandra Query Language Cassandra Driver Practical Example

Introduction Apache Cassandra ™ is a massively scalable open source NoSQL database. Cassandra is perfect for managing large amounts of data across multiple data centers and cloud. Cassandra delivers continuous availability, linear scalability, and operational simplicity across many commodity servers with no Single Point of Failure (SPOF), along with a powerful data model designed for maximum flexibility and fast response times.

Introduction (Cont’d) Cassandra has a “masterless” architecture. Cassandra provides customizable replication, storing redundant copies of data across nodes that participate in a Cassandra ring.

History Cassandra was created to power the Facebook Inbox Search. Facebook open-sourced Cassandra in 2008 and became an Apache Incubator project. In 2010, Cassandra graduated to a top-level project, regular update and releases followed.

General Design Features Emphasis on performance over analysis Still has support for analysis tools such as Hadoop. Organization Rows are organized into tables. First component of a table’s primary key is the partition key. Rows are clustered by the remaining columns of the key. Columns may be indexed separately from the primary key. Tables may be created, dropped, altered at runtime without blocking queries. Language CQL (Cassandra Query Language) introduced, similar to SQL (flattened learning curve).

Data Model Table is a multi dimensional map indexed by key (row key). Columns are grouped into Column Families. 2 Types of Column Families Simple Super (nested Column Families) Each Column has Name Value Timestamp A row is a collection of columns labeled with name.

Data Model keyspace settings column family settings column namevaluetimestamp

Data Model Cassandra Row The value of a row is itself a sequence of key- value pairs. Such nested key-value pairs are column. Key = column name. A row must contain at least 1 column.

Data Model Example of Column

Data Model Key Space A Key Space is a group of column families together. It is only a logical grouping of column families and provides an isolated scope for names.

System Architecture The ring represents a cyclic range of token values (i.e., the token space). Each node is assigned a position on the ring based on its token. Each node communicates with each other node using Gossip protocol. First data written into commit log for data durability. Later data pushed from commit log to memtable, once memtable is full then the data written into sstable (disk) ABCD

System Architecture Important keyword Node The place for store the data. It is the basic infrastructure component of Cassandra. Data Center A collection of related nodes. A data center can be a physical data center or virtual data center. Cluster A cluster contains one or more data centers. It can span physical locations. Commit log All data is written first to the commit log for durability. After all its data has been flushed to SSTables, it can be archived, deleted or recycled. Table A collection ordered column fetched by row. A row consists of columns and have a primary key. The first part of the key is a column name. SSTable A sorted string table (SSTable) is an immutable data file to which Cassandra writes memtables periodically. SSTables are append only and stored on disk sequentially and maintained for each Cassandra table.

System Architecture Involve: Partitioning How Data is partitioned across nodes. Replication How Data is duplicated across nodes. Cluster Membership How nodes are added, deleted to the cluster

System Architecture Partitioning Nodes are logically structured in Ring Topology. Hashed value of key associated with data partition is used to assign it to a node in the ring. Hashing rounds off after certain value to support ring structure. Cassandra has 3 type of partition Murmur3Partitioner RandomPartitioner ByteOrdererPartitioner

System Architecture Replication Each data item is replicated at N (replication factor) nodes. Different Replication Policies Rack Unaware – replicate data at N-1 successive nodes after its coordinator. Rack Aware – uses ‘Zookeeper’ to choose a leader which tells nodes the range they are replicas for. Datacenter Aware – similar to Rack Aware but leader is chosen at Datacenter level instead of Rack Level.

System Architecture Gossip Protocol Network Communication protocols inspired for real life rumour spreading. Periodic, Pairwise, inter-node communication. Low frequency communication ensures low cost. Random selection of peers. Example – Node A wish to search for pattern in data Round 1 – Node A searches locally and then gossips with node B. Round 2 – Node A,B gossips with C and D. Round 3 – Nodes A,B,C and D gossips with 4 other nodes …… Round by round doubling makes protocol very robust.

System Architecture Cluster Membership Uses Scuttleback (a Gossip protocol) to manage nodes. Uses gossip for node membership and to transmit system control state. Node Fail state is given by variable ‘phi’ which tells how likely a node might fail (suspicion level) instead of simple binary value (up/down). This type of system is known as Accrual Failure Detector.

System Architecture Accrual Failure Detector If a node is faulty, the suspicion level monotonically increases with time. Φ(t)  k as t  k Where k is a threshold variable (depends on system load) which tells a node is dead. If node is correct, phi will be constant set by application. Generally Φ(t) = 0

System Architecture Local Persistence Relies on local file system for data persistency. Write operations happens in 2 steps Write to commit log in local disk of the node Update in-memory data structure. Read operation Looks up in-memory ds first before looking up files on disk. Uses Bloom Filter (summarization of keys in file store in memory) to avoid looking up files that do not contain the key.

System Architecture Write Path

System Architecture Read Path

System Architecture Example write and read process. Data Model

System Architecture Write Process

System Architecture Replication Process

System Architecture Read Process

Cassandra Configuration Key components for configuring Cassandra Gossip A peer-to-peer communication protocol to discover and share location and state information about the other nodes in a cluster. Gossip information is also persisted locally by each node to use immediately when a node restarts. Partitioner A partitioner determines how to distribute the data across the nodes in the cluster and which node to place the first copy of data on. Replication factor The total number of replicas across the cluster.

Cassandra Configuration Key component for configuring Cassandra Replica placement strategy Cassandra stores copies (replicas) of data on multiple nodes to ensure reliability and fault tolerance. Snitch Defines groups of machines into data centers and racks (the topology) that the replication strategy uses to place replicas. The cassandra.yaml configuration file The main configuration file for setting the initialization properties for a cluster, caching parameters for tables, properties for tuning and resource utilization, timeout settings, client connections, backups and security.

CQL = Cassandra Query Language Default and primary interface into the Cassandra DBMS. Provide SQL-like command. CQL and SQL share the same abstract idea of a table constructed of tables and rows. The main difference from SQL is that CQL does not support joins or subqueries. Run cqlsh in terminal window. The command is inside bin directory.

CQL = Cassandra Query Language Creating and updating a keyspace Cassandra keyspace is a namespace that defines how data is replicated on nodes. To create a keyspace: cqlsh> CREATE KEYSPACE demodb WITH REPLICATION = {‘class’ : ‘SimpleStrategy’, ‘replication_factor’ : 1}; To update a keyspace: cqlsh>ALTER KEYSPACE demodb WITH REPLICATION = {‘class’ : ‘NetworkTopologyStrategy’, ‘replication_factor’ : 2}; To use namespace: Cqlsh>USE demodb;

CQL – Cassandra Query Language Creating Tables: CREATE TABLE users( varchar, bio varchar, birthday timestamp, active boolean, PRIMARY KEY ( ));

CQL – Cassandra Query Language Inserting Data: **timestamp fields are specified in milliseconds since epoch. INSERT INTO users ( , bio, birthday, active) VALUES ‘RoomMate’, ‘ , true);

CQL – Cassandra Query Language Querying Tables: SELECT expression reads one or more records from Cassandra column family and returns a result-set of rows. SELECT * FROM users; SELECT FROM users WHERE active = true;

Cassandra Driver To connect with programming language, Cassandra provide driver package. The programming language that supported by Cassandra Drivers are : C# Java Node.js Python URL Cassandra driver download:

Reference Lakshman, Avinash, and Prashant Malik. "Cassandra: a decentralized structured storage system." ACM SIGOPS Operating Systems Review 44.2 (2010): Hewitt, Eben. Cassandra: the definitive guide. O'Reilly Media, /cassandra/gettingStartedCassandraIntro.html 1/cassandra/gettingStartedCassandraIntro.html ql_intro_c.html ql_intro_c.html ava-driver/2.1/java-driver/whatsNew2.html ava-driver/2.1/java-driver/whatsNew2.html apache-cassandra-and-java/

Installation guide and practical example.