ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

ZhangGang, Fabio, Deng Ziyan 2013.01.24

2/31 NoSQL Introduction to Cassandra Data Model Design Implementation

4/31 Learning NoSQL What is NoSQL Be different from RDBMS Use Redis to get familiar with some interesting features of NoSQL Compare and choose one for DIRAC What we need? High scalability, fast write and read, big data… Four candidates: Raik, CouchDB, Hadoop HBase and Cassandra First, choose Hadoop HBase to explore

6/31 Hadoop HBase Modeled after Google's BigTable HBase must be installed on HDFS Deploy and maintenance are much more complicated than Cassandra Then, turn to Cassandra

8/31 Some features of Cassandra Schema flexibility BigTable-like features: columns, column families Key/value pairs: row/columns pairs Secondary index Writes are much faster than reads All nodes are similar: no single point of failure Tunable trade-offs for distribution and replication All research is based on a standalone mode. In production, need a cluster

9/31 RDBMS: Use the join operation, increase the normalization and reduce the redundancy NoSQL(Cassandra): For getting a better performance and high scalability, get rid of join operation, which means denormalizing the data and maintaining multiple copies of data(increase the redundancy)

10/31 Data model Cassandra stores data in a multidimensional hash table [keyspace][columnfamily][row][column] Some concepts in Cassandra: keyspace, column family, row, column keyspqce->database column family->table

11/31 Another structure: super column family Can be thought of as a map of maps

12/31 Model the query first With Cassandra we model the queries and let the data be organized around them. Think of the most common query paths the application will use, and then create the column families that we need to support them Aggregate key pattern This pattern fuses together two scalar values with a separator to create an aggregateLike: CPUTime:2008-01-01 value

13/31 A special column: counter column A counter is a special kind of column used to store a number that incrementally counts the occurrences of a particular event or process Set the value type “CounterColumnType” when create a column family Increase or decrease the value, not replace Example: for site “BES”,at “2012-10-10”,10 jobs(means the day has 10 CPUTime). CPUTime:2012-10-01 val=sum(10 CPUTime)

15/31 Model the query first Four factors determine a plot: start time, end time, plot to generate, and groupby The data that a plot need is grouped by something. Preprocessing the data

16/31 Create a CF for CPUTime groupby user CF: standard column family Row key: user Column name: startTime Column value: CPUTime Problem: bad performance and column name not unique user 1startTime 1…startTime n value… ………… user nstartTime 2…NULL value…NULL

17/31 The first improvement CF: counter column family Row key: user Column name: startTime/86400(aggregate by day) Column value: the sum of CPUTime within a day Problem: one CF for one plot user 12008-01-01…2013-01-24 value… ………… user n2009-05-172009-05-20NULL value NULL

18/31 In cassandra, two method to slove the problem Use a super column family Use aggregate key

19/31 The second improvement CF: Counter column family Row key: user Column name: aggregate key (CPUTime,2012-01-01),(DiskSpace,2012-01-01) Column value: the sum value within a day user 1CPUTime:2008-01-01…DiskSpace:2013-01-24 value… ………… user nCPUTime:2009-05-17JobCount:2009-05-20NULL value NULL

20/31 Create a CF to store the raw data Store raw data for future usage The columns are specified when create the CF Row key: timestamp type(DoubleType) Disk space: 21GB In Mysql:4GB 内容内容一级标题一级标题一级标题一级标题

21/31 It is a static column family Biao JobClassUser…Site…CPUTime row 1value … … row 2value … … ………………… row nvalue … …………………

23/31 Create column families for each “groupby” cum_groupby_user 4.2MB cum_groupby_site11MB cum_groupby_processingtype 428KB cum_groupby_country3.9MB cum_groupby_grid1.2MB cum_groupby_usergroup828KB 一级标题一级标题一级标题内容

24/31 Communicate with Cassandra pycassa : a Python client for Apache Cassandra Input data When a new record comes, insert the data into all the CF at the same time Performance: 4 CF,1100records,about 18s Input data into raw_data_cf: (for a standard CF) pycassa.ColumnFamily.insert(key,columns) Input data into groupby_cfs: (for a counter CF ) pycassa.ColumnFamily.add(key,column,value)

25/31 The data in one row like: 内容内容内容内容内容内容内容一级标题一级标题一级标题一级标题

26/31 Retrieve data and generate a plot start_time, end_time: determine the time span generate : detemine columns with time span groupby: decide which CF should be chose

27/31 Badger01 hardware CPU: Intel(R) Xeon(R) CPU E5620 @2.40GHz CPU core:4 Memory:16GB Comparison Left is the plot get from LHCb web potal Right plot is generate by Cassandra at badger01

28/31 Generate the same plot at badger01 use mysql: about 30s

31/31 Thanks

ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

Similar presentations

Presentation on theme: "ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation.

Similar presentations

Presentation on theme: "ZhangGang, Fabio, Deng Ziyan 2013.01.24. 2/31 NoSQL Introduction to Cassandra Data Model Design Implementation."— Presentation transcript:

Similar presentations

About project

Feedback