Download presentation
Presentation is loading. Please wait.
Published byShon Day Modified over 9 years ago
1
LOGO Discussion Zhang Gang 2012/11/8
2
Discussion Progress on HBase 1 Cassandra or HBase 2
3
HBase Sechma Design HBase reference guide How to design a good HBase schema. – row key – column family
4
HBase Sechma Design row key – monotonically increasing keys or timeseries keys may cause a pile-up on a single region. – randomize the input records to not be in sorted order can mitigate the situation. So its best to avoid using a timestamp or a sequence as the row key. – at present, I use the startTime(a timestamp) as the row key, in future I will explore if there has a better replacement.
5
HBase Sechma Design column famliy: – I was wrong about the schema with two column families. – HBase currently does not do well with anything above two or three column families. – Try to make do with one column family if you can in your schemas. – If you have thousands or even millions column, you can consider have more than one column family. We only have 21 columns, so one is enough and the best choice.
6
HBase Sechma Design Optimization(minimize row and column sizes) – in HBase, values are always as a cell value that accompanied by its row, column name, and timestamp. So if row and column name is long, it will waste a large size.(see behind) – column family: keep the name as short as possible. – row key length: keep them as short as is reasonable such that they can still be useful for required data access.
7
Sqoop Have successfully configured the sqoop in my PC. On farm, have a Exception-- ”access denied for user ‘zhang’, but it seems successfully transfer the data. Command : – Sqoop import –connect jdbc:
8
Sqoop sqoop on my PC: – test: 81,280 records, 45.1613s – test: 215,500 records, 73.2617s – test: 1,539,763 records,310s – then:35,427,339 records, 1235060s/about 3.43h – the HBase table size: about 35G, compare mysql table(5G), the size is bigger. So design a good schema is very necessary.
9
Sqoop sqoop on the farm: – two exceptions: – then found access denied – import: 35,427,339 records,5120s/about 1.39h – hbase-name:’hb_type_job’ – row-key: ’startTime’ – column-family: ’d’ s
10
Sqoop
11
Cassandra or HBase
12
review our requirement: – big data: now 5G, increases 1.5 GB per year, not very big. – high scalability: we want the database we choice has a better scalability.(many candidates have the feature. – write/read: we read more than we write.(One of the reasons we choose HBase before)
13
Cassandra or HBase Written in: Java Main point: Best of BigTable and Dynamo Tunable trade-offs for distribution and replication (N, R, W) Querying by column, range of keys BigTable-like features: columns, column families Has secondary indices Writes are much faster than reads (!) Map/reduce possible with Apache Hadoop All nodes are similar, as opposed to Hadoop/Hbase Gossip protocol, multi data center, no single point of failure
14
Cassandra or HBase C has only one type of nodes, all nodes are similar. H consists of several different types of nodes (Muster/RegionServer). H must deployed over the HDFS, compare this C is much more simple Data consistency of C is tunable(N,W,R). H better support map/reduce H provides the developer with row locking facilities whereas Cassandra can not. C just use timestamp. C has better I/O performance and better scalability but not good at range scan. CAP:C focus on AC and H focus on CP H has an SQL compatibility interface(Hive),so H support SQL
15
Cassandra or HBase The structure of C is simple,deploy and maintenance is simple, compare C(save money, save time),H is much more complex deploy or maintenance. H maybe more suitable for data warehousing, and large scale data processing and analysis. And C being more suitable for real time transaction processing and the serving of interactive data.
16
Cassandra or HBase How do I incorporate my logo to a slide that will apply to all the other slides? – bb Aa – bb Aa – On
17
Cassandra or HBase the possibility we start to explore Cassandra – more simple than Hadoop HBase. – written by Java.(same as HBase) – pycassa:It is a python client library for Apache Cassandra. problem: seem doesn’t have a ready- made tool for transfer the data from mysql to Cassandra.
18
LOGO Your Company Slogan
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.