NoSQL for the SQL Server Pro

Slides:



Advertisements
Similar presentations
No SQL is not about SQL No SQL is a Zoo.. Key-Value Stores Wide Column Stores Document Stores Graph Databases.
Advertisements

Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
Map/Reduce in Practice Hadoop, Hbase, MongoDB, Accumulo, and related Map/Reduce- enabled data stores.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
FAST FORWARD WITH MICROSOFT BIG DATA Vinoo Srinivas M Solutions Specialist Windows Azure (Hadoop, HPC, Media)
Jennifer Widom NoSQL Systems Overview (as of November 2011 )
Reporter: Haiping Wang WAMDM Cloud Group
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
Running Hadoop-as-a-Service in the Cloud
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
NoSQL W2013 CSCI 2141.
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
AN INTRODUCTION TO NOSQL DATABASES Karol Rástočný, Eduard Kuric.
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
:: Conférence :: NoSQL / Scalabilite Etat de l’art Samuel BERTHE10 Mars 2014Epitech Nantes.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
Getting Biologists off ACID Ryan Verdon 3/13/12. Outline Thesis Idea Specific database Effects of losing ACID What is a NoSQL database Types of NoSQL.
WTT Workshop de Tendências Tecnológicas 2014
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Modern Databases NoSQL and NewSQL Willem Visser RW334.
Changwon Nati Univ. ISIE 2001 CSCI5708 NoSQL looks to become the database of the Internet By Lawrence Latif Wed Dec Nhu Nguyen and Phai Hoang CSCI.
Cloud Computing Clase 8 - NoSQL Miguel Johnny Matias
Methodological Foundations of Biomedical Informatics (BMSC-GA 4449) Himanshu Grover.
© Copyright 2013 STI INNSBRUCK
An Introduction to HDInsight June 27 th,
Spatial Tajo Supporting Spatial Queries on Apache Tajo Slideshare Shorten URL : goo.gl/j0VLXpgoo.gl/j0VLXp.
Windows Azure. Azure Application platform for the public cloud. Windows Azure is an operating system You can: – build a web application that runs.
NOSQL Implementation and examples Maciej Matuszewski.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
Nov 2006 Google released the paper on BigTable.
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT IT Monitoring WG Technology for Storage/Analysis 28 November 2011.
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, From SQL to NoSQL Xiao Yu Mar 2012.
NoSQL databases A brief introduction NoSQL databases1.
Big Data Yuan Xue CS 292 Special topics on.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
Azure HDInsight And Excel Analyze unstructured data at scale, then visualize! George Walters Sr. Technical Solutions Professional, Data Platform Microsoft.
Introduction to NoSQL Databases Chyngyz Omurov Osman Tursun Ceng,Middle East Technical University.
Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Microsoft Power Query: an Excel Users Dream for Data Extraction and Cleansing Presented by: Belinda Allen Smith & Allen Consulting, Inc.
Group members: Phạm Hoàng Long Nguyễn Huy Hùng Lê Minh Hiếu Phan Thị Thanh Thảo Nguyễn Đức Trí 1 BIG DATA & NoSQL Topic 1:
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
Microsoft Power Query 101 Belinda Allen Smith & Allen Consulting, Inc.
Dive into NoSQL with Azure Niels Naglé Hylke Peek.
Microsoft Ignite /28/2017 6:07 PM
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
OMOP CDM on Hadoop Reference Architecture
CS 405G: Introduction to Database Systems
and Big Data Storage Systems
CSE 775 – Distributed Objects Bekir Turkkan & Habib Kaya
CS122B: Projects in Databases and Web Applications Winter 2017
Introduction In the computing system (web and business applications), there are enormous data that comes out every day from the web. A large section of.
Hadoopla: Microsoft and the Hadoop Ecosystem
Modern Databases NoSQL and NewSQL
NOSQL.
NOSQL databases and Big Data Storage Systems
07 | Analyzing Big Data with Excel
NoSQL Systems Overview (as of November 2011).
NoSQL Databases Antonino Virgillito.
Database Systems Summary and Overview
Charles Tappert Seidenberg School of CSIS, Pace University
NoSQL databases An introduction and comparison between Mongodb and Mysql document store.
Presentation transcript:

NoSQL for the SQL Server Pro Lynn Langit Feb 2013 – SDC, Sweden

Is NoSQL just Hadoop? HUGE Hype factor over last few years http://hadoop.apache.org/ http://en.wikipedia.org/wiki/Apache_Hadoop Apache Hadoop is a software framework that supports data-intensive distributed applications under a free license enables applications to work with thousands of nodes and petabytes of data was inspired by Google's MapReduce and Google File System (GFS) papers

Hadoop in the Enterprise http://www.oracle.com/technetwork/bdc/hadoop-loader/overview/index.html http://www.microsoft.com/download/en/details.aspx?id=27584

Working with Hadoop Common Tools / Languages Java (JDK) / Eclipse MapReduce Map (query/format) Reduce (aggregate) plug-in for Eclipse (Java) Pig (ETL -- Java) Hive (HQL Query) HBase tables Others Mahout (analyze) Karmasphere (analyze) R (analyze) http://hortonworks.com/technology/hortonworksdataplatform/ More about Hbase, from the O’Reilly ‘Getting Ready for BigData’ report “Enter HBase, a column-oriented database that runs on top of HDFS. Modeled after Google’s BigTable, the project’s goal is to host billions of rows of data for rapid access. MapReduce can use HBase as both a source and a destination for its computations, and Hive and Pig can be used in combination with HBase. In order to grant random access to the data, HBase does impose a few restrictions: performance with Hive is 4-5 times slower than plain HDFS, and the maximum amount of data you can store is approximately a petabyte, versus HDFS’ limit of over 30PB.” http://www.cloudera.com/

Demo -HDInsight– Cluster Allocation https://www.hadooponazure.com/Account Demo - http://www.youtube.com/watch?v=ugi9C6s_sH4

What is the relationship? NoSQL BigData

BigData = Exponentially More Data Retail Example -> ‘Feedback Economy’ Number of transactions Number of behaviors (collected every minute) From the O’Reilly / Strata “Getting Ready for Big Data” Report… “the three Vs of volume, velocity and variety are commonly used to characterize different aspects of big data”

BigData = ‘Next State’ Questions What could happen? Why didn’t this happen? When will the next new thing happen? What will the next new thing be? What happens? Collecting Behavioral data

Demo - HDInsight - MapReduce https://www.hadooponazure.com/Account Demo - http://www.youtube.com/watch?v=XcHz8aUDDN8 and http://www.youtube.com/watch?v=c7oHntP8HBI

Hitting (Relational) Walls CA Highly-available consistency CP Enforced consistency AP Eventual consistency http://berb.github.com/diploma-thesis/original/061_challenge.html http://nosqltips.blogspot.com/2011/04/cap-diagram-for-distribution.html http://blog.mccrory.me/2010/11/03/cap-theorem-and-the-clouds/ ACID VS BASE ACID RDBMS are the predominant database systems currently in use for most applications, including web applications. Their data model and their internals are strongly connected to transactional behaviour when operating on data. However, transactional behaviour is not solely related to RDBMS, but is also used for other systems. A set of properties describes the guarantees that database transactions generally provide in order to maintain the validity of data [Hae83]. Atomicity This property determines that a transaction executes with a "all or nothing" manner. A transaction can either be a single operation, or a sequence of operations resp. sub-transactions. As a result, a transaction either succeeds and the database state is changed, or the database remains unchanged, and all operations are dismissed. Consistency In context of transactions, we define consistency as the transition from one valid state to another, never leaving behind an invalid state when executing transactions. Isolation The concept of isolation guarantees that no transaction sees premature operations of other running transactions. In essence, this prevents conflicts between concurrent transactions. Durability The durability property assures persistence of executed transactions. Once a transaction has committed, the outcome of the transaction such as state changes are kept, even in case of a crash or other failures. Strongly adhering to the principles of ACID results in an execution order that has the same effect as a purely serial execution. In other words, there is always a serially equivalent order of transactions that represents the exact same state [Dol05]. It is obvious that ensuring a serializable order negatively affects performance and concurrency, even when a single machine is used. In fact, some of the properties are often relaxed to a certain extent in order to improve performance. A weaker isolation level between transactions is the most used mechanism to speed up transactions and their throughput. Stepwise, a transactional system can leave serializablity and fall back to the weaker isolation levels repeatable reads, read committed and read uncommitted. These levels gradually remove range locks, read locks and resp. write locks (in that order). As a result, concurrent transactions are less isolated and can see partial results of other transactions, yielding so called read phenomena. Some implementations also weaken the durability property by not guaranteeing to write directly to disk. Instead, committed states are buffered in memory and eventually flushed to disk. This heavily decreases latencies at the cost of data integrity. Consistency is still a core property of the ACID model, that cannot be relaxed easily. The mutual dependencies of the properties make it impossible to remove a single property without affecting the others. Referring back to the CAP theorem, we have seen the trade-off between consistency and availability regarding distributed database systems that must tolerate partitions. In case we choose a database system that follows the ACID paradigm, we cannot guarantee high availability anymore. The usage of ACID as part of a distributed systems yields the need of distributed transactions or similar mechanisms for preserving the transactional properties when state is shared and sharded onto multiple nodes. Now let us reconsider what would happen if we evict the burden of distributed transactions. As we are talking about distributed systems, we have no global shared state by default. The only knowledge we have is a per-node knowledge of its own past. Having no global time, no global now, we cannot inherently have atomic operations on system level, as operations occur at different times on different machines. This softens isolation and we must leave the notion of global state for now. Having no immediate, global state of the system in turn endangers durability. In conclusion, building distributed systems adhering to the ACID paradigm is a demanding challenge. It requires complex coordination and synchronization efforts between all involved nodes, and generates considerable communication overhead within the entire system. It is not for nothing that distributed transactions are feared by many architects [Hel09,Alv11]. Although it is possible to build such systems, some of the original motivations for using a distributed database system have been mitigated on this path. Isolation and serializablity contradict scalability and concurrency. Therefore, we will now consider an alternative model for consistency that sacrifices consistency for other properties that are interesting for certain systems. BASE This alternative consistency model is basically the subsumption of properties resulting from a system that provides availability and partition tolerance, but no strong consistency [Pri08]. While a strong consistency model as provided by ACID implies that all subsequent reads after a write yield the new, updated state for all clients and on all nodes of the system, this is weakened for BASE. Instead, the weak consistency of BASE comes up with an inconsistency window, a period in which the propagation oft the update is not yet guaranteed. Basically available The availability of the system even in case of failures represents a strong feature of the BASE model. Soft state No strong consistency is provided and clients must accept stale state under certain circumstances. Eventually consistent Consistency is provided in a "best effort" manner, yielding a consistent state as soon as possible. The optimistic behaviour of BASE represents a best effort approach towards consistency, but is also said to be simpler and faster. Availability and scaling capacities are primary objectives at the expense of consistency. This has an impact on the application logic, as the developer must be aware of the possibility of stale data. On the other hand, favoring availability over consistency has also benefits for the application logic in some scenarios. For instance, a partial system failure of an ACID system might reject write operations, forcing the application to handle the data to be written somehow. A BASE system in turn might always accept writes, even in case of network failures, but they might not be visible for other nodes immediately. The applicability of relaxed consistency models depends very much on the application requirements. Strict constraints of balanced accounts for a banking application do not fit eventual consistency naturally. But many web applications built around social objects and user interactions can actually accept slightly stale data. When the inconsistency window is on average smaller than the time between request/response cycles of user interactions, a user might not even realize any kind of inconsistency at all. http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

So many NoSQL options More than just the Elephant in the room Over 120+ types of NoSQL databases http://nosql-database.org/ http://hadoop.apache.org/ & http://www.mongodb.org/ Wikipedia - http://en.wikipedia.org/wiki/NoSQL List of noSQL databases – http://nosql-database.org/ The good, the bad - http://www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772

Flavors of NoSQL http://bigdatanerd.wordpress.com/2012/01/04/why-nosql-part-2-overview-of-data-modelrelational-nosql/ http://docs.jboss.org/hibernate/ogm/3.0/reference/en-US/html_single/

Key / Value Database Schema-less State (Persistent or Volatile) Examples AWS Dynamo DB Riak http://en.wikipedia.org/wiki/Project_Voldemort http://aws.amazon.com/ http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/Introduction.html http://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html

Column Database Wide, sparse column sets Examples: Cassandra HBase BigTable GAE HR DS Azure Tables SQL 2012 Tabular Model http://stage.hypertable.com/index.php/documentation/architecture/ http://code.google.com/appengine/ http://code.google.com/appengine/articles/datastore/overview.html

More about Column Databases Type A Column-families Non-relational Sparse Examples: HBase, Cassandra, xVelocity (SQL 2012 Tabular) Type B Column-stores Relational Dense Example: SQL Server 2012 Columnstore index http://cwebbbi.wordpress.com/2012/02/14/so-what-is-the-bi-semantic-model/ http://www.databasejournal.com/features/mssql/understanding-new-column-store-index-of-sql-server-2012.html http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-major-types-of_29.html http://ayende.com/blog/4500/that-no-sql-thing-column-family-databases

Demo - Document Database (Mongo DB) document-oriented (collection of JSON documents) w/semi structured data Encodings include BSON, JSON, XML… binary forms PDF, Microsoft Office documents -- Word, Excel…) http://en.wikipedia.org/wiki/MongoDB http://www.mongodb.org/downloads http://www.mongodb.org/display/DOCS/Drivers

Demo - Graph Database (Neo4j) a lot of many-to-many relationships recursive self-joins when your primary objective is quickly finding connections, patterns and relationships between the objects within lots of data http://www.infinitegraph.com/what-is-a-graph-database.html http://en.wikipedia.org/wiki/Graph_database http://www.freebase.com/

So which type of NoSQL? Back to CAP… Consistency Availability Partitioning CP = NoSQL/column Hadoop Big Table H-base MemCacheDB CA = SQL/RDBMS SQL Sever / Oracle MySQL http://amitpiplani.blogspot.com/2010/05/u-pick-2-selection-for-nosql-providers.html AP = NoSQL/document or key/value DynamoDB CouchDB Cassandra Voldemort

Which type of NoSQL for which type of data? Type of NoSQL solution Example Log files Wide Column HBase Product Catalogs Key Value on disk DynamoDB User profiles Key Value in memory Redis Startups Document MongoDB Social media connections Graph Neo4j LOB w/Transactions NONE! Use RDBMS SQL Server

Cloud-hosted NoSQL up to 50x CHEAPER http://lynnlangit.wordpress.com/2011/11/09/relational-cloud-storage-is-50x-more-expensive-than-nosql/

The reality…two pivots Storage Methods SQL (RDBMS) NoSQL Storage Locations On premises Cloud-hosted

NoSQL (Cloud) BLOB Storage Buckets Amazon – S3 or Glacier The gold standard Google – Cloud Storage Free for developers Microsoft Azure BLOBS DropBox, Box… http://code.google.com Access via REST APIs Very Cheap, but not much functionality included Lots of code to write for application development But…can be a good backup solution

Cloud-hosted RDBMS AWS RDS – SQL Server, mySQL, Oracle Google – mySQL Medium cost Solid feature set, i.e. backup, snapshot Use existing tooling Google – mySQL Lowest cost Most limited RDBMS functionality Microsoft – SQLAzure Highest cost http://code.google.com

Demo - AWS RDS SQL Server, MySQL or Oracle Essential to understand pricing models

Cloud Offerings– RDBMS AND NoSQL AWS Google Microsoft Cloud RDBMS RDS – all major mySQL SQL Azure NoSQL buckets S3 or Glacier Cloud Storage Azure Blobs NoSQL databases DynamoDB H/R Data on GAE Azure Tables Streaming ML or (Mahout) Custom EC2 Prospective Search & Prediction API StreamInsight Document or Graph MongoDB on EC2 Freebase MongoDB on Windows Azure Hadoop Elastic MapReduce using S3 & EC2 none HDInsight Dremel/Warehousing RedShift BigQuery Hadoop on AWS - http://wiki.apache.org/hadoop/AmazonEC2

Data Scientists… About Data Science -- http://www.romymisra.com/the-new-job-market-rulers-data-scientists/ R language - http://www.r-project.org/ Infer.NET - http://research.microsoft.com/en-us/um/cambridge/projects/infernet/ There are a plethora of languages to access, manipulate and process bigData. These languages fall into a couple of categories: RESTful – simple, standards ETL – Pig (Hadoop) is an example Query – Hive (again Hadoop), lots of *QL Analyze – R, Mahout, Infer.NET, DMX, etc.. Applying statistical (data-mining) algorithms to the data output

Comparing… http://rickosborne.org/download/SQL-to-MongoDB.pdf

Karmasphere Studio for AWS http://www.youtube.com/watch?v=gjsMDAcI1Mo - analyst http://www.youtube.com/watch?v=_MT04szKlyo http://aws.amazon.com/articles/9574327584309154

Hadoop Connector to Excel http://www.microsoft.com/en-us/bi/default.aspx http://dennyglee.com/ Demos -   http://www.youtube.com/watch?v=djfpPsGwm6A and http://www.youtube.com/watch?v=uh9bKWO1K7U

Google BigQuery Hadoop-like (Dremel) based service For massive amounts of data SQL-like query language http://research.google.com/pubs/pub36632.html

Dremel Realized => Impala Interactive Hadoop? http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/

Other types of cloud data services Hosting public datasets Pay to read Earn revenue by offering for read Cleaning / matching (your) data ETL – Microsoft Data Explorer, Google Refine Data Quality – Windows Azure Data Market, InfoChimps, DataMarket.com DataMarkets – InfoChimps, Factual, DataMarket, Windows Azure Data Marketplace, Wolfram Alpha, Datasift http://www.microsoft.com/en-us/sqlazurelabs/default.aspx and http://www.microsoft.com/en-us/sqlazurelabs/labs/dataexplorer.aspx https://datamarket.azure.com/ http://www.freebase.com/ http://code.google.com/p/google-refine/

NoSQL To-Do List Understand CAP & types of NoSQL databases Use NoSQL when business needs designate Use the right type of NoSQL for your business problem Try out NoSQL on the cloud Quick and cheap for behavioral data Mashup cloud datasets Good for specialized use cases, i.e. dev, test , training environments Learn noSQL access technologies New query languages, i.e. MapReduce, R, Infer.NET New query tools (vendor-specific) – Google Refine, Amazon Karmasphere, Microsoft Excel connectors, etc…

The Changing Data Landscape Course Title: Module Title 1-Oct-2011 The Changing Data Landscape NoSQL RDBMS Other Services ©2011 DevelopMentor

www.TeachingKidsProgramming.org recipes) Free Courseware ( Lynn www.TeachingKidsProgramming.org Free Courseware ( Do a Recipe  Teach a Kid (Ages 10 ++) Java or Microsoft SmallBasic 

Toward Data Craftsmanship… Follow me @LynnLangit RSS my blog www.LynnLangit.com Hire me To help build your BI/Big Data solution To teach your team next gen BI To learn more about using NoSQL solutions