Download presentation
Presentation is loading. Please wait.
Published byMargery Shaw Modified over 9 years ago
1
ZhangGang, Fabio, Deng Ziyan 2013.01.24
2
2/31 NoSQL Introduction to Cassandra Data Model Design Implementation
3
3/31
4
4/31 Learning NoSQL What is NoSQL Be different from RDBMS Use Redis to get familiar with some interesting features of NoSQL Compare and choose one for DIRAC What we need? High scalability, fast write and read, big data… Four candidates: Raik, CouchDB, Hadoop HBase and Cassandra First, choose Hadoop HBase to explore
5
5/31
6
6/31 Hadoop HBase Modeled after Google's BigTable HBase must be installed on HDFS Deploy and maintenance are much more complicated than Cassandra Then, turn to Cassandra
7
7/31
8
8/31 Some features of Cassandra Schema flexibility BigTable-like features: columns, column families Key/value pairs: row/columns pairs Secondary index Writes are much faster than reads All nodes are similar: no single point of failure Tunable trade-offs for distribution and replication All research is based on a standalone mode. In production, need a cluster
9
9/31 RDBMS: Use the join operation, increase the normalization and reduce the redundancy NoSQL(Cassandra): For getting a better performance and high scalability, get rid of join operation, which means denormalizing the data and maintaining multiple copies of data(increase the redundancy)
10
10/31 Data model Cassandra stores data in a multidimensional hash table [keyspace][columnfamily][row][column] Some concepts in Cassandra: keyspace, column family, row, column keyspqce->database column family->table
11
11/31 Another structure: super column family Can be thought of as a map of maps
12
12/31 Model the query first With Cassandra we model the queries and let the data be organized around them. Think of the most common query paths the application will use, and then create the column families that we need to support them Aggregate key pattern This pattern fuses together two scalar values with a separator to create an aggregateLike: CPUTime:2008-01-01 value
13
13/31 A special column: counter column A counter is a special kind of column used to store a number that incrementally counts the occurrences of a particular event or process Set the value type “CounterColumnType” when create a column family Increase or decrease the value, not replace Example: for site “BES”,at “2012-10-10”,10 jobs(means the day has 10 CPUTime). CPUTime:2012-10-01 val=sum(10 CPUTime)
14
14/31
15
15/31 Model the query first Four factors determine a plot: start time, end time, plot to generate, and groupby The data that a plot need is grouped by something. Preprocessing the data
16
16/31 Create a CF for CPUTime groupby user CF: standard column family Row key: user Column name: startTime Column value: CPUTime Problem: bad performance and column name not unique user 1startTime 1…startTime n value… ………… user nstartTime 2…NULL value…NULL
17
17/31 The first improvement CF: counter column family Row key: user Column name: startTime/86400(aggregate by day) Column value: the sum of CPUTime within a day Problem: one CF for one plot user 12008-01-01…2013-01-24 value… ………… user n2009-05-172009-05-20NULL value NULL
18
18/31 In cassandra, two method to slove the problem Use a super column family Use aggregate key
19
19/31 The second improvement CF: Counter column family Row key: user Column name: aggregate key (CPUTime,2012-01-01),(DiskSpace,2012-01-01) Column value: the sum value within a day user 1CPUTime:2008-01-01…DiskSpace:2013-01-24 value… ………… user nCPUTime:2009-05-17JobCount:2009-05-20NULL value NULL
20
20/31 Create a CF to store the raw data Store raw data for future usage The columns are specified when create the CF Row key: timestamp type(DoubleType) Disk space: 21GB In Mysql:4GB 内容内容 一级标题一级标题一级标题 一级标题
21
21/31 It is a static column family Biao JobClassUser…Site…CPUTime row 1value … … row 2value … … ………………… row nvalue … …………………
22
22/31
23
23/31 Create column families for each “groupby” cum_groupby_user 4.2MB cum_groupby_site11MB cum_groupby_processingtype 428KB cum_groupby_country3.9MB cum_groupby_grid1.2MB cum_groupby_usergroup828KB 一级标题一级标题一级标题 内容
24
24/31 Communicate with Cassandra pycassa : a Python client for Apache Cassandra Input data When a new record comes, insert the data into all the CF at the same time Performance: 4 CF,1100records,about 18s Input data into raw_data_cf: (for a standard CF) pycassa.ColumnFamily.insert(key,columns) Input data into groupby_cfs: (for a counter CF ) pycassa.ColumnFamily.add(key,column,value)
25
25/31 The data in one row like: 内容 内容内容内容内容 内容内容 一级标题一级标题一级标题 一级标题
26
26/31 Retrieve data and generate a plot start_time, end_time: determine the time span generate : detemine columns with time span groupby: decide which CF should be chose
27
27/31 Badger01 hardware CPU: Intel(R) Xeon(R) CPU E5620 @2.40GHz CPU core:4 Memory:16GB Comparison Left is the plot get from LHCb web potal Right plot is generate by Cassandra at badger01
28
28/31 Generate the same plot at badger01 use mysql: about 30s
29
29/31
30
30/31
31
31/31 Thanks
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.