Download presentation
Presentation is loading. Please wait.
Published byMarlene Cook Modified over 9 years ago
1
Analytics: SQL or NoSQL? Richard Taylor Chair Business Intelligence SIG
2
The NoSQL Movement Meetup June 11 2009 in San Francisco NoSQL name proposed by Eric Evans 2004 BigTable (Google) 2007 Dynamo (Amazon) 2008 Cassandra (Facebook) Hadoop/HBase (Yahoo) Project Voldemort (LinkedIn) NoSQL Conferences
3
Relational Database/SQL
4
1980 1981 Bernstein and Goodman Multi-version Concurrency Control Database Timeline 1970 199020002010 1969 CODASYL - Network database - Schema - DDL/DML 1970 Codd Relational Model 1980 Gray Transaction 1995 Bernstein et al Critique of ANSI SQL Isolation Levels 1989 SQL-89 1992 SQL-92 1999 SQL:1999 Object Relational 2003 SQL:2003 Analytics extensions 1979 Oracle 1974 SEQUEL
5
Row Column Relational Model Normalized data “Atomic” Multi-column Key Operations on tables: select, project, join Relationship on key Primary Key Foreign Key Table – n-tuple Key
6
SQL Designed for Transaction Processing Good Easily handles simple cases Everyone has a Query Language Bad Data access language (not Turing complete) Declarative Language (4GL) Impedance mismatch with procedural languages Complicated cases get repetitive
7
Normalization Refine design of structured data “Atomic” No repeating groups Data item depends on key (and nothing else) Avoid modification anomalies Ensure every data item is stored only once Avoid bias to any particular pattern of querying Allow data to be accessed from every angle Denormalization
8
Star Schema Example Fact Table Product Store Promotion Date Date_key Store_key Promotion_key Product_key Receipt_number Quantity Revenue Unit_price Date_key Day_in_week Day_in_month Day_in_year Day_name Week_in_month Week_in_year Month_nbr Month_name Quarter Year Holiday Holiday_desc …
9
Database Summary Costs –Fixed schema –Normalization –Transform data on load –Cost of scaling –Problems with large objects –Complicated software Benefits –Mature technology –Precise querying –Star Schema – historic data
10
Tuple Store/NoSQL
11
Tuple Storage Systems Google Database System –Chubby – Lock/metadata manager –Google File System – Distributed file system –Bigtable – Tuple storage on GFS –Map Reduce – Data processing on tuples Other tuple stores –Voldemort – Amazon Dynamo –Cassandra –HBase –Hypertable
12
Tuple Store Model One Table Operate on Map Set of (Key, Value) Structured Key Unstructured Value Operations: select, project Map Reduce Tuple Store KeyValue KeyColumnTimestamp
13
Map Reduce Define two functions –Map Input: tuple Output: list of tuples –Reduce Input: key, list of values Output: list or tuple Specify a cluster Specify input and output tuple stores Framework does the rest { Map(k1, v1) } -> { list(k2, v2) } { list(k2, v2) } -> { (k2, list(v2)) } { Reduce(k2, list(v2)) } -> { list(v3) } -> { (k2, v3) }
14
Map Reduce Example For each web page count the number of pages that reference that page Input tuple store is WWW Map Function: for each anchor on web page, emit (anchorURL, 1) Reduce Function: emit (anchorURL, sum(list)) { Map(k1, v1) } -> { list(k2, v2) } { list(k2, v2) } -> { (k2, list(v2)) } { Reduce(k2, list(v2)) } -> { (k2, v3) } URLWeb Page URLWeb Page URLWeb Page URLWeb Page … Output tuple store is { (URL, count) }
15
Example in SQL CREATE TABLE links (URL page NOT NULL, URL ref_page NOT NULL, PRIMARY KEY page, ref_page ) SELECT ref_page, count(DISTINCT page) FROM links GROUP BY ref_page For each web page count the number of pages that reference that page
16
Tuple Store Summary Semi-structured data –No need to normalize data Simple implementations –Cheap, fast, scalable Map Reduce Processing –Simple programming (for geeks) Issues –No guidance from schema –No model for historic data Hadoop wins Sort Benchmark
17
Synthesis
18
Summary SQL –Structured data –Precise –Historic data –Needs transformation –Scalability issues NoSQL –Cheap –Scalable –Handles large data
19
Enterprise Model MoneyContentAnalytics ? NoSQL Relational DB Metadata? Issues: - Data volume - Query requirements
20
Analytics Architecture Map Reduce Processing TB+/day RDB Data Warehouse GB++/day Reports Tuple Store Cubes Reports etc.
21
Summary It is all about structured data How much do we want? How much can we afford?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.