Experience Cassandra Wenjing wu 2011-5-17. outline About Cassandra Data Model Deployment Client Programming An example: implementing a name space Stress.

Slides:



Advertisements
Similar presentations
PHP SQL. Connection code:- mysql_connect("server", "username", "password"); Connect to the Database Server with the authorised user and password. Eg $connect.
Advertisements

CASSANDRA-A Decentralized Structured Storage System Presented By Sadhana Kuthuru.
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
Peer-to-Peer (P2P) Distributed Storage 1Dennis Kafura – CS5204 – Operating Systems.
Concepts about the file system 2. The disk structure 3. Files in disk – The ext2 FS 4. The Virtual File System (c) 2013, Prof. Jordi Garcia.
WEXTOOL User Guide v1.0 E.P. PLANETE B.B.R.. Plan Introduction & Architecture of Wextool Installation Scenario description Experimentation phase Saving/Synchronizing.
Quiz 2 Review. For which of the following attributes would a hash- index most likely be a better fit than a B+-tree index? A. Social Security Number B.
Ext2/Ext3 Linux File System Reporter: Po-Liang, Wu.
COLUMN-BASED DBS BigTable, HBase, SimpleDB, and Cassandra.
NoSQL Databases: MongoDB vs Cassandra
Reporter: Haiping Wang WAMDM Cloud Group
ETEC 100 Information Technology
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
A Decentralized Structure Storage Model - Avinash Lakshman & Prashanth Malik - Presented by Srinidhi Katla CASSANDRA.
Cloud Storage: All your data belongs to us! Theo Benson This slide includes images from the Megastore and the Cassandra papers/conference slides.
Introduction to cassandra eben hewitt september web 2.0 expo new york city.
ZhangGang, Fabio, Deng Ziyan /31 NoSQL Introduction to Cassandra Data Model Design Implementation.
Chapter Oracle Server An Oracle Server consists of an Oracle database (stored data, control and log files.) The Server will support SQL to define.
Copyright ®xSpring Pte Ltd, All rights reserved Versions DateVersionDescriptionAuthor May First version. Modified from Enterprise edition.NBL.
Cassandra Installation Guide and Example Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
High Throughput Computing on P2P Networks Carlos Pérez Miguel
LOGO Discussion Zhang Gang 2012/11/8. Discussion Progress on HBase 1 Cassandra or HBase 2.
Apache Cassandra - Distributed Database Management System Presented by Jayesh Kawli.
Data storing and data access. Plan Basic Java API for HBase – demo Bulk data loading Hands-on – Distributed storage for user files SQL on noSQL Summary.
Changwon Nati Univ. ISIE 2001 CSCI5708 NoSQL looks to become the database of the Internet By Lawrence Latif Wed Dec Nhu Nguyen and Phai Hoang CSCI.
NoSQL Databases Oracle - Berkeley DB Rasanjalee DM Smriti J CSC 8711 Instructor: Dr. Raj Sunderraman.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
Cassandra - A Decentralized Structured Storage System
Cassandra – A Decentralized Structured Storage System Lecturer : Prof. Kyungbaek Kim Presenter : I Gde Dharma Nugraha.
Performance Evaluation on Hadoop Hbase By Abhinav Gopisetty Manish Kantamneni.
The LCG File Catalog (LFC) Jean-Philippe Baud – Sophie Lemaitre IT-GD, CERN May 2005.
Discussion MySQL&Cassandra ZhangGang 2012/11/22. Optimize MySQL.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Exam and Lecture Overview.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
1 HBase Intro 王耀聰 陳威宇
A Brief Documentation.  Provides basic information about connection, server, and client.
Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();
Lecture 5 Cost Estimation and Data Access Methods.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Indexes and Views Unit 7.
Ext2/Ext3 Linux File System Reporter: Po-Liang, Wu.
CSCI 4333 Database Design and Implementation – Exercise (5) Xiang Lian The University of Texas – Pan American Edinburg, TX
SQOOP INSTALLATION GUIDE Lecturer : Prof. Kyungbaek Kim Presenter : Zubair Amjad.
What is MySQL? MySQL is a relational database management system (RDBMS) based on SQL (Structured Query Language). First released in January, Many.
NOSQL DATABASE Not Only SQL DATABASE
HDB++: High Availability with
CS 540 Database Management Systems
Status of tests in the LCG 3D database testbed Eva Dafonte Pérez LCG Database Deployment and Persistency Workshop.
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, From SQL to NoSQL Xiao Yu Mar 2012.
CS422 Principles of Database Systems Introduction to NoSQL Chengyu Sun California State University, Los Angeles.
Chapter 11 Indexing And Hashing (1) Yonsei University 1 st Semester, 2016 Sanghyun Park.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Cassandra Architecture.
Oracle Announced New In- Memory Database G1 Emre Eftelioglu, Fen Liu [09/27/13] 1 [1]
Plan for Final Lecture What you may expect to be asked in the Exam?
Cassandra - A Decentralized Structured Storage System
Hadoop.
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Modern Databases NoSQL and NewSQL
NOSQL.
The NoSQL Column Store used by Facebook
NOSQL databases and Big Data Storage Systems
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Session #, Speaker Name Indexing Chapter 8 11/19/2018.
آزمايشگاه سيستمهای هوشمند علی کمالی زمستان 95
Introduction to Apache
Database Systems Summary and Overview
Chapter 11 Indexing And Hashing (1)
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Presentation transcript:

Experience Cassandra Wenjing wu

outline About Cassandra Data Model Deployment Client Programming An example: implementing a name space Stress tests

What is Cassandra(1) Decentralized/fault tolerant/scalable /durable distributed hash storage Originally developed by facebook, now maintained by apache. A list of big users: cloudkick, digg, Facebook, twitter, Rackspace, Cisco etc. A combination of Big Table and Dynamo Like a big hash table(both 2 and 3 dimensional )

What is Cassandra(2) Eventual consistence CAP theory: AP, however, configurable tradeoffs between A and C. Easy to deploy Rich client APIs for your own application, easy to install/use

Data model(1) Non SQL Support single index for query – Select username from user where city=‘beijing’(Yes) – -select username from user where city=‘beijing’ and age=‘28’ (No!) No joins, no complicated query Useful for suitable cases

Data model(2) Keyspace, one for each application, equivalent to a database Column: an attribute of the structured data, has a name, value and timestamp, equivalent to column of a table. (column=username, value=tom, timestamp= ) Column family: a serial columns as above ones. Define a column family User: – (column=username, value=tom, timestamp= ) – (column= , timestamp= ) – (column=city, value=beijing, timestamp= )

Data Model(3) A row : identified by a key, instantiated one or more of the columns in column family: – RowKey: userkey1 – (column=username, value=tom, timestamp= ) – (column= , timestamp= ) Application creates the key(unique, usually use uuid to avoid collision) for each row, each row can have different number of columns within the column family Analogous to 2 dimensional hash table User{row_key1}{username}=tom

Data Model(4) Supper column family – Each column of the super column family is a column family 3 dimensional hash table – Person{row_key1}{user}{user_name}=tom – – Person{row_key2}{manager}{user_name}=Alice

Deployment(1) Pretty easy! – Wget ra/0.7.5/apache-cassandra bin.tar.gz ra/0.7.5/apache-cassandra bin.tar.gz – tar zxvf apache-cassandra bin.tar.gzapache-cassandra bin.tar.gz – cd apache-cassandra – udo mkdir -p /var/log/cassandra – sudo chown -R `whoami` /var/log/cassandra – sudo mkdir -p /var/lib/cassandra – sudo chown -R `whoami` /var/lib/cassandra

Deployment(2) Start service – bin/cassandra –f Try to connect with client: – bin/cassandra-cli --host localhost –port 9160 How to start: – create keyspace Keyspace1 – create column family Users with comparator=UTF8Type and default_validation_class=UTF8Type; – set Users[jsmith][first] = 'John'; – set Users[jsmith][last] = 'Smith'; What you see? – get Users[jsmith]; – => (column=last, value=Smith, timestamp= ) – => (column=first, value=John, timestamp= )

Run over a cluster Configuration file – conf/cassandra.yaml – listen_address: fst01.ihep.ac.cn(for gossip) – rpc_address: fst01.ihep.ac.cn(for client) – seeds: - fst02.ihep.ac.cn - fst03.ihep.ac.cn - fst04.ihep.ac.cn Test the cluster – bin/nodetool –host fst01.ihep.ac.cn ring

Client Programming Rich client options (c/java/php/perl/python) Driver for python client(pycassa) Easy to install – Install by easy_install Have easy_install installed $easy_install pycassa – Manual install $ Easy_install thrift05 $ git clone git://github.com/pycassa/pycassa.git $ cd pycassa/ $ sudo python setup.py install

API examples >>> import pycassa >>> pool = pycassa.connect('Keyspace1', ['localhost:9160']) col_fam = pycassa.ColumnFamily(pool, ’User') col_fam.insert(’user_key1', {’username': ’tom'}) col_fam.get(’user_key1') col_family.remove(‘user_key1’)

An example: implement a namespace Use pycassa to implement a name space. Similar to ext3 file system, inodes to represent metadata 2 column family used (Directory, FFile) to describe the metadata CF Directory, columns include : – Metadata: create/modify/access time, owner,group – Contents inside the directory: sub directories names, file names

Directory(1) dir_key1 Ownerfilestore Groupfilestore testdir1dir_keyxxxxx1 testdir2dir_keyxxxxx2 testfile1file_keyyyyyy1

Directory(2) A row : – RowKey: dirkey_372c5d e0-bc71-001a64631cb0 – => (column=dir3, value=3e180f00-459b-11e a64631cb0, timestamp= ) – => (column=f2, value=c69f2ac2-45a6-11e0-9c79-001a64631cb0, timestamp= ) – => (column=f3, value=ddd77c2e-45a5-11e0-934f-001a64631cb0, timestamp= ) – => (column=group, value=root, timestamp= ) – => (column=owner, value=root, timestamp= ) – => (column=p3, value=edf0ed73-45a6-11e0-bf90-001a64631cb0, timestamp= )

FFile(1) CF FFile is used to store the metadata and contents of a specific file FFile columns include: – Metadata: create/modify/access time, owner,group,size, checksum – Contents of the file

FFile(2) file_key_yyyy1 ownerfilestore groupfilestore size1023 contentBla bla….

Ffile(3) A row – RowKey: filekey_edf0ed73-45a6-11e0-bf90-001a64631cb0 – => (column=content, value= – localhost.localdomain localhost – lcg002.ihep.ac.cn lcg002 – lwn011.ihep.ac.cn lwn011 –....,timestamp= ) – => (column=group, value=root, timestamp= ) – => (column=owner, value=root, timestamp= ) – => (column=size, value=11281, timestamp= )

Name space operation fs_ls (list a dir/file) fs_mkdir(make a dir) fs_rename (rename a file/dir) fs_mv(move a file/dir to another file/dir) fs_rm (remove a file/dir) fs_cpw(write a file to the storage) fs_cpr(read a file from the storage)

How does it work dir_key1 ownerfilestore groupfilestore testdir1dir_keyxx x1 testdir2dir_keyxx x2 dir_keyxxx1 ownerfilestore testdir12dir_keyxxx 4 testfile11file_keyyyy 1 testfile12File_keyyyy 2 file_keyyyy1 ownerfilestore groupfilestore size1023 contentThis is a test file…. /testdir1/testfile11

How to implement? mk_dir: fs_mkdir /testdir1/testdir2/testdir3 (/testdir1/testdir2 already exisits) – 1. generate a key for this entry: new_key=dirkey_`uuid` – 2. walk from the root directory(/, key is dirkey_1) to get the key for the parent directory(testdir2), assuming the key is dirkey_XXX – 3.insert a column in the parent directory entry (testdir2, with key dirkey_XXX). the column name is the name of the inserting directory(testdir3), and its value is the new_key – 4. create a new entry for the new directory, with all the metadata columns (owner, group)

Stress test Testbed: A small cluster – 4 nodes cluster – Replica number is 3 – One client test methodology: – Operation sequence: mkdir/touch a file/list dir & file – Depth of directory(4) /dir1/dir2/dir3/dir4 – -test result: finished operation(mkdir,create file,list dir, list file) in seconds, second for each operation sequence – Another test failed (more than 10million operation) due to memory crash.