Big Data Tools Overview Avi Freedman ServerCentral Technology Executives Club November 13, 2013.

Slides:



Advertisements
Similar presentations
CASSANDRA-A Decentralized Structured Storage System Presented By Sadhana Kuthuru.
Advertisements

Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009.
The NewSQL database you’ll never outgrow Taming the Big Data Fire Hose John Hugg Sr. Software Engineer, VoltDB.
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
A Fast Growing Market. Interesting New Players Lyzasoft.
Jennifer Widom NoSQL Systems Overview (as of November 2011 )
NoSQL Databases: MongoDB vs Cassandra
New SQL: An Alternative to NoSQL and Old SQL for New OLTP Apps An Article by Mike StoneBraker June 16, 2011, Group.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
Chapter 14 The Second Component: The Database.
Presentation by Krishna
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications C. Faloutsos – A. Pavlo How to Scale a Database System.
NoSQL Database.
Daniel Abadi Yale University. * The Big Data phenomenon is the best thing that could have happened to the database community * Despite other definitions.
Data in the cloud O’Reilly MySQL Conference Mårten Mickos CEO, Eucalyptus Systems
Massively Parallel Cloud Data Storage Systems S. Sudarshan IIT Bombay.
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
:: Conférence :: NoSQL / Scalabilite Etat de l’art Samuel BERTHE10 Mars 2014Epitech Nantes.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Goodbye rows and tables, hello documents and collections.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Modern Databases NoSQL and NewSQL Willem Visser RW334.
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
SLIDE 1IS 257 – Fall 2014 NewSQL and VoltDB University of California, Berkeley School of Information IS 257: Database Management.
Lecture 8: Databases and Data Infrastructure CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
Nov 2006 Google released the paper on BigTable.
NoSQL Systems Motivation. NoSQL: The Name  “SQL” = Traditional relational DBMS  Recognition over past decade or so: Not every data management/analysis.
NOSQL DATABASE Not Only SQL DATABASE
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Cloudera Kudu Introduction
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Data Mining Meeting Mar, From SQL to NoSQL Xiao Yu Mar 2012.
An Introduction to Super-Scalability But first…
Big Data Yuan Xue CS 292 Special topics on.
Context Aware RBAC Model For Wearable Devices And NoSQL Databases Amit Bansal Siddharth Pathak Vijendra Rana Vishal Shah Guided By: Dr. Csilla Farkas Associate.
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
BIG DATA/ Hadoop Interview Questions.
1 Analysis on the performance of graph query languages: Comparative study of Cypher, Gremlin and native access in Neo4j Athiq Ahamed, ITIS, TU-Braunschweig.
Microsoft Ignite /28/2017 6:07 PM
BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.
Amirhossein Saberi May CASSANDRA NAME A daughter of the Trojan king Priam, who was given the gift of prophecy by Apollo. When she cheated him, however,
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
From RDBMS to Hadoop A case study Mihaly Berekmeri School of Computer Science University of Manchester Data Science Club, 14th July 2016 Hayden Clark,
Neo4j: GRAPH DATABASE 27 March, 2017
CSCI5570 Large Scale Data Processing Systems
Cloud Computing and Architecuture
Big Data A Quick Review on Analytical Tools
Jeremy Shafer Temple University
Introduction In the computing system (web and business applications), there are enormous data that comes out every day from the web. A large section of.
Operational & Analytical Database
Modern Databases NoSQL and NewSQL
NOSQL.
Introduction to NewSQL
NOSQL databases and Big Data Storage Systems
Ministry of Higher Education
Massively Parallel Cloud Data Storage Systems
NOSQL and CAP Theorem.
NoSQL Databases An Overview
Taming the Big Data Fire Hose
Charles Tappert Seidenberg School of CSIS, Pace University
Transaction Properties: ACID vs. BASE
Big DATA.
NoSQL databases An introduction and comparison between Mongodb and Mysql document store.
Presentation transcript:

Big Data Tools Overview Avi Freedman ServerCentral Technology Executives Club November 13, 2013

What is Big Data? Canonical definition  Volume: Billions or trillions of rows  Variety: Different schemas  Velocity: Hundreds of thousands of records/sec Traditional systems have difficulty handling data in these dimensions, even with scale-up, partitioning and sharding. Clustered/scale-out solutions are required to solve old problems in new ways.  But… That causes problems meeting traditional database integrity requirements. Our focus will be on open source and < $1 million (min) technology stacks  Not focusing on traditional BI and Teradata/traditional “Make SQL work” soltuions

Big Data Search Trends Source: Google Trends Big Data Data Mining Semantics

Tech Background: Scale-up vs Scale-out To Scale up you buy a bigger machine. But there are limits to how far you can go… Scaling out with traditional software designed for single-machine architectures is typically done by making read replicas (doesn’t help for volume or write-heavy workload); Or clustering with master/master architectures, which still doesn’t help with volume and can increase latency; Or with sharding or partitioning…

Tech Background: Sharding/Partitioning When you shard a database, you split it by sets of the data, typically related to the key (so names starting with “A-C” go one place, “D-F” another, etc). Can be difficult to do manually. Partitioning is usually implemented by slicing the database into separate tables (often all on the same machine) by time.

Tech Background: ACID and CAP ACID  Atomicity (transactions are all or nothing)  Consistency (checking the end results)  Isolation (transactions don’t affect each other)  Durability (transactions once committed are forever CAP Theorem says….  Consistency (all nodes have the same data)  Availability (every request gets a response)  Partition tolerance (any part can fail)  Can’t have all 3 Big Data solutions typically “relax” ACID and are subject to the CAP Theorem.

Big Data Technologies Map/Reduce (Hadoop)  Hadoop, HPCC  (emerging) Streaming Databases NoSQL  Key/Value Store  Document/Scheme-Free Databases  Columnar (Dremel, Impala, Drill)  Graph databases (for social media) NewSQL Revival of Classic SQL DBs

NoSQL Introduction No stored procedures Partial to full SQL Clustered High volume Not ACID (Problem for… Funds transfer, power failures, selling the last item twice) Not ACID (Problem for… Funds transfer, power failures, selling the last item twice)

NoSQL: Map/Reduce  Currently mainly used for batch processing, but streaming is being grafted on.  Older versions had single points of failures but newer versions have implemented system-wide redundancy  Not ACID, though there is some basic “check and set” functionality in underlying databases.  There are SQL-like interfaces (Hive and Pig)  Latency is typically VERY high – minutes – to get queries.  And to be efficient, the map/reduce processes are usually written in Java, which is an obstacle for use in many environments.

NoSQL: Key/Value Store One of the first examples of “NoSQL” software was the set of systems developed to deal with Key/Value lookups. In this kind of system, you get to set, delete, or read a key (like “cloud services”) and get one value (like “are fun”). Values can be lists or even more complex data structures. Sample applications:  Web cookies  State for massively online games  Real-time ad placement  Fraud and intrusion detection monitoring Sample applications:  Web cookies  State for massively online games  Real-time ad placement  Fraud and intrusion detection monitoring

NoSQL: Key/Value Store Leading packages that implement Key/Value stores are: memcached (which clusters but isn’t persistent) redis (clusters and is persistent to disk) riak (clusters, persistent to disk, goes up the chain a bit, but not as performant if disk I/O kicks in)

 Related to Key/Value store  Typically a superset where you get a key, but the value can be a large structured set of data (a “document”).  Usually have more sophisticated ability to do pattern- matching lookups.  MongoDB is the thought if not market leader  Riak and Couchbase are be second in the space  All are still evolving, not perfect, and require some tuning. NoSQL: Document DBs

NoSQL: Columnar DBs  Older-generation columnar databases like HBase (part of mysql) were clustered but not fast enough to ‘move the needle’.  Newer implementations, inspired by Google’s Dremel, like Apache Drill and Cloudera Impala are ordered of magnitude faster (in the seconds for some queries), also cluster, and can deal even better with large ingest volumes and variable schemas.  Apache Drill is just out in alpha, and Impala has yet to achieve the performance of Google’s hosted Dremel service.  But these systems may be the closest to threatening the typical Aster and even core Teradata use cases.

NoSQL: Clustered SQL  Cassandra does offer SQL access and clusters, but is not ACID.  Used by many web-scale companies.  Also relatively steep learning curve though there are commercial providers to assist

NoSQL: Graph DBs  Systems like neo4j have been evolved to deal with problems that arise in heavily-connected data where one is looking for instances or patterns in the relationships between items.  One key space where they are used is in social networks and for evil government projects.  Very specialized and we don’t see many instances deployed in the enterprise.

NewSQL NewSQL was coined recently to describe databases that attempt to cluster (scale-out) and maintain ACID properties. Two leaders are: FoundationDB (currently has a 96 core and 100TB limit), does not require in-RAM presence VoltDB (doing complex work requires Java skills, and it is costly because all data must fit in-RAM across the nodes People are watching this space with interest, but many are dubious about how fast they will develop into truly scalable offerings. Both offer easy access for download and testing in customer environments.

 Microsoft, Oracle/mysql, postgres, and MariaDB projects/companies are all thinking and implementing more scale-out functionality.  Most of the initial approaches seem to be automating the process of sharding and partitioning databases.  We see people trying this most in the mysql community, but the vast majority are still sharding and replicating to deal with scale. Revival of Classic SQL DBs

Gotchas to Watch For ACID Compliance?  Do you get transactional ‘correctness’  Or is the system ‘eventually consistent’  At high constant volumes, eventual consistency may never catch up Ease of use  Non-SQL systems (like Map/Reduce) can be difficult to learn and train for  And many Big Data systems can be difficult to learn/install for DB admins  Commercial solutions can address this but also can cost 10x

What are your application requirements of the data backends? Application support  If you don’t code your applications, will they support using a big data solution on the backend?  It’s sometimes possible to write ‘adapters’ underneath commercial applications, but this is dangerous as the applications may change their schemas or methods without notifying users. Gotchas to Watch For

Questions? Avi Freedman ServerCentral Technology Executives Club November 13, 2013