Presentation is loading. Please wait.

Presentation is loading. Please wait.

Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )

Similar presentations


Presentation on theme: "Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )"— Presentation transcript:

1 Zhang Gang 2012.9.27

2 Big data High scalability One time write, multi times read …….(to be add )

3 Features

4 Written in: Erlang & C, some Javascript Main point: Fault tolerance Principle from Amazon's Dynamo paper Tunable trade-offs for distribution and replication (N, R, W) Map/reduce in JavaScript or Erlang Masterless multi-site replication Language support: include python Support full-text search, indexing, querying with Riak Search server

5 Best used: If you want something Cassandra-like (Dynamo-like), but no way you're gonna deal with the bloat and complexity. If you need very good single-site scalability, availability and fault-tolerance, but you're ready to pay for multi-site replication. For example: Point-of-sales data collection. Factory control systems. Places where even seconds of downtime hurt. Could be used as a well-update-able web server.

6 Written in: Erlang Main point: embrace the web, ease of use Document-oriented Data format : JSON Bi-directional replication and off-line operation in mind MVCC - write operations do not block reads Needs compacting from time to time Views: embedded map/reduce Built for off-line Automatically replicates all the data to all servers. Support AICD transaction.

7 Best used: Replication and synchronization capabilities of CouchDB make it ideal for using it in mobile devices, where network connection is not guaranteed but the application must keep on working offline. For accumulating, occasionally changing data, on which pre-defined queries are to be run. Places where versioning is important. For example: CRM, CMS systems. Master-master replication is an especially interesting feature, allowing easy multi-site deployments

8 Written in: Java Main point: Best of BigTable and Dynamo Tunable trade-offs for distribution and replication (N, R, W) Querying by column, range of keys BigTable-like features: columns, column families Has secondary indices Writes are much faster than reads (!) Map/reduce possible with Apache Hadoop All nodes are similar, as opposed to Hadoop/Hbase Gossip protocol, multi data center, no single point of failure

9 Best used: When you write more than you read (logging). If every component of the system must be in Java. ("No one gets fired for choosing Apache's stuff.") For example: Banking, financial industry.Writes are faster than reads, so one natural niche is real time data analysis.

10 Written in: Java Main point: Billions of rows X millions of columns Modeled after Google's BigTable Uses Hadoop's HDFS as storage Map/reduce with Hadoop Optimizations for real time queries A high performance Thrift gateway(access interface) Cascading, hive, and pig source and sink modules Random access performance is like MySQL A cluster consists of several different types of nodes(Muster/RegionServer) Not scale down to small installations.

11 Best used: Hadoop is probably still the best way to run Map/Reduce jobs on huge datasets. Best if you use the Hadoop/HDFS stack already. For example: Analysing log data.

12 Comparison

13 Points that favor CouchDB A document store Offline replication embrace the web Automatically replicates all the data to all servers, impractical for very large number of replicas and very databases. This features maybe unsuitable for DIRAC Accounting System So, compare CouchDB,Cassandra win I think

14 Both architecturally strongly influenced by Dynamo Both also go beyond Dynamo in providing a "richer than pure K/V" data model Points that favor Cassandra speed support for clusters spanning multiple data centers big names using it (digg, twitter, facebook, webex,... ) Points that favor Cassandra map/reduce support out of the box(Cassandra can do it with Hadoop map/reduce ) So, maybe Cassandra win again I think

15 C has only one type of nodes, all nodes are similar. H consists of several different types of nodes(Muster/RegionServer). H must deployed over the HDFS, compare this C is much more simple Data consistency of C is tunable(N,W,R). H better support map/reduce H provides the developer with row locking facilities whereas Cassandra can not. C just use timestamp. C has better I/O performance and better scalability but not good at range scan. CAP:C focus on AC and H focus on CP H has an SQL compatibility interface(Hive),so H support SQL

16 The structure of C is simple,deploy and maintenance is simple, compare C(save money, save time),H is much more complex deploy or maintenance. But we have a Hadoop cluster here already. H maybe more suitable for data warehousing, and large scale data processing and analysis. And C being more suitable for real time transaction processing and the serving of interactive data..

17 HBase has been recognized by the WLCG Database Technical Evolution Group as having the greatest potential impact in the LHC experiments out of all NoSQL technologies. The CERN IT organization is setting up a cluster to try it. So, for a Accounting system,maybe HBase is a good choice I think.

18 B C D E R G

19 CouchDB: CMS use CouchDB in production for parts of its Data and Workflow Management systems, in particular for some queues and for the job state machine. The installation has 3 replicas of a CouchDB database at CERN and 4 replicas of the same database at Fermilab.

20 HBase: HBase is used in production by ATLAS in its Distributed Data Manager called DQ2,for both log analysis and accounting on a 12-node cluster. The original method they had for doing their accounting summary was 8 to 20 times faster than the same method on the shared Oracle system they had, depending on the HDFS replication level.

21 Cassandra: Cassandra is used in production by ATLAS PanDa monitoring.They chose to host it at BNLon only 3 nodes that were quite high- powered:each node has 24 cores and 1Terabyte of RAID0 Solid-State Disks(SSDs).

22 Script Use the records in type_*table to draw some pie plot

23 FUNCTION: def generatePlotByTime(groupby,generate, keyTableName,startTime,endTime): “the main function,gengrate a plot by parameters” def getTrueValue(keyTableName,index): “select the key tables to get the true value by index” Calling like this: generatePlotByTime(‘Site’,’CPUTime’,’ac_key_Lhcb- Production_job_Site’,’2010-6-20’,’2012-6-20’)

24 DiskSpace groupby site cost about:97.39s

25 CPU time groupby User cost about:97.03s

26 CPU time groupby UserGroup cost about:93.62s

27 Diskspace groupby ProcessingType cost:94.82s

28 processing time:86.64s

29 processing time:97.69s

30 Thanks


Download ppt "Zhang Gang 2012.9.27. Big data High scalability One time write, multi times read …….(to be add )"

Similar presentations


Ads by Google