AN OVERVIEW OF DATABASES FOR THE BIG DATA ECOSYSTEM Keith W. Hare JCC Consulting, Inc. September 20, /20/2016Copyright 2016, JCC Consulting, Inc.

Slides:



Advertisements
Similar presentations
C6 Databases.
Advertisements

Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Chapter 14 The Second Component: The Database.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
A Study in NoSQL & Distributed Database Systems John Hawkins.
D4M – Signal Processing On Databases
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Databases with Scalable capabilities Presented by Mike Trischetta.
SC32 WG2 Metadata Standards Tutorial Metadata Registries and Big Data WG2 N1945 June 9, 2014 Beijing, China.
Big Data. What is Big Data? Big Data Analytics: 11 Case Histories and Success Stories
NoSQL Databases Oracle - Berkeley DB Rasanjalee DM Smriti J CSC 8711 Instructor: Dr. Raj Sunderraman.
NoSQL Databases Oracle - Berkeley DB. Content A brief intro to NoSQL About Berkeley Db About our application.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
MANAGING DATA RESOURCES ~ pertemuan 7 ~ Oleh: Ir. Abdul Hayat, MTI.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
BACS 287 Big Data & NoSQL 2016 by Jones & Bartlett Learning LLC.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
IoT Meets Big Data Standardization Considerations
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
Information Eastman. Business Process Skills Order to Cash, Forecasting & Budgeting, etc. Process Modeling Project Management Technical Skills.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Group members: Phạm Hoàng Long Nguyễn Huy Hùng Lê Minh Hiếu Phan Thị Thanh Thảo Nguyễn Đức Trí 1 BIG DATA & NoSQL Topic 1:
MarkLogic The Only Enterprise NoSQL Database Presented by: Aashi Rastogi ( ) Sanket Patel ( )
BIG DATA/ Hadoop Interview Questions.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Managing Data Resources File Organization and databases for business information systems.
© 2017 by McGraw-Hill Education. This proprietary material solely for authorized instructor use. Not authorized for sale or distribution in any manner.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
Data Analytics (CS40003) Introduction to Data Lecture #1
Dr.S.Sridhar,Ph.D., RACI(Paris),RZFM(Germany),RMR(USA),RIEEEProc.
Database Systems: Design, Implementation, and Management Tenth Edition
NO SQL for SQL DBA Dilip Nayak & Dan Hess.
Big Data is a Big Deal!.
Big Data Enterprise Patterns
MapReduce Compiler RHadoop
An Open Source Project Commonly Used for Processing Big Data Sets
Docker Birthday #3.
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Chapter 14 Big Data Analytics and NoSQL
Open Source distributed document DB for an enterprise
Chris Menegay Sr. Consultant TECHSYS Business Solutions
R-GMA as an example of a generic framework for information exchange
Spark Presentation.
NOSQL.
Chapter 13 The Data Warehouse
Introduction to Networks
NOSQL databases and Big Data Storage Systems
Data Warehouse.
Software Architecture in Practice
Database Performance Tuning and Query Optimization
1 Demand of your DB is changing Presented By: Ashwani Kumar
MANAGING DATA RESOURCES
Ch 4. The Evolution of Analytic Scalability
Overview of big data tools
Database Systems Summary and Overview
IST346: Scalability.
Chapter 11 Database Performance Tuning and Query Optimization
AGENDA Buzz word. AGENDA Buzz word What is BIG DATA ? Big Data refers to massive, often unstructured data that is beyond the processing capabilities.
Big DATA.
Data Analysis and R : Technology & Opportunity
NoSQL databases An introduction and comparison between Mongodb and Mysql document store.
Database management systems
Presentation transcript:

AN OVERVIEW OF DATABASES FOR THE BIG DATA ECOSYSTEM Keith W. Hare JCC Consulting, Inc. September 20, /20/2016Copyright 2016, JCC Consulting, Inc.

Abstract The ultimate goal of big data techniques is to be able to identify useful, usable information in a timely fashion – actionable analytics Prerequisites to producing actionable analytics are Ability to analyze lots of disparate data Ability to discover, access, store and retrieve lots of data This presentation provides an overview of data storage and retrieval in a big data ecosystem Focus on the characteristics, not the implementations Useful for understanding how the pieces should fit together Addresses the prerequisites not the end goal 09/20/2016Copyright 2016, JCC Consulting, Inc. 2

Who am I? Senior Consultant with JCC Consulting, Inc. since 1985 High performance database systems Replicating data between database systems SQL Standards committees since 1988 Convenor, ISO/IEC JTC1 SC32 WG3, since 2005 Vice Chair, ANSI INCITS DM32.2, since 2003 Vice Chair, INCITS Big Data Technical Committee since 2015 Education Muskingum College, 1980, BS in Biology and Computer Science Ohio State, 1985, Masters in Computer & Information Science 3 09/20/2016Copyright 2016, JCC Consulting, Inc.

Topics Why is “Big Data” Different? Big Data Buzzwords High Level View Data Distribution Integrating Data from Multiple Sources Data Query Languages Big Data Eco-system Products Summary “Let’s do a deep dive in the Big Data and drill down until we hyperlocalize some disruptive technologies.” (See 09/20/2016Copyright 2016, JCC Consulting, Inc. 4

Why is “Big Data” Different? Often defined in terms of Vs: Volume – exceed capacity of a single “computer” Velocity – speed at which data is generated Variety – new types of data Variability – speed at which data changes Veracity – quality & provenance Visualization – meaningful presentation Value – actionable analytics Focus on primary data rather than extract, load, and transform (ETL) In many ways, “Big Data” is what we have always been doing, only bigger and more complex. 09/20/2016Copyright 2016, JCC Consulting, Inc. 5

Big Data: Driving Forces Inexpensive storage of large volumes of data Inexpensive compute power Next Generation Analytics Moving from off-line to in-line embedded analytics Explaining what happened Predicting what will happen Operating on Data at rest – stored someplace Data in motion – streaming Multiple disparate data sources Look at available data and wonder what answers are hidden there Copyright 2016, JCC Consulting, Inc. 6 09/20/2016

Big Data: Working Definition Requirements cannot be met on a single computer Variety, Volume, Velocity, Variability, Availability Imprecise terms, but useful for understanding problem space All relative – what was impossible yesterday is Big Data today and will be trivial tomorrow Distribute data storage to support volume & velocity Replicate data storage to provide availability Distribute processing Apply compute power in parallel Avoid moving data across the network – move the answers Copyright 2016, JCC Consulting, Inc. 7 09/20/2016

Data Volume – How Big is Big? Gigabyte – 1000**3 Terabytes –1000**4 Petabytes – 1000**5 Exabyte – 1000**6 Zettabyte – 1000**7 Yottabyte – 1000**8 Brontobyte* – 1000**9 Gegobyte* – 1000**10 09/20/2016Copyright 2016, JCC Consulting, Inc. 8 *This terminology is still subject to change.

Big Data Buzzwords NoSQL Databases Sharding Map-Reduce Schema-less New SQL 09/20/2016Copyright 2016, JCC Consulting, Inc. 9

Big Data Buzzwords – NoSQL Originally did not include SQL Rejected complexity of SQL language Rejected overhead and limitations of SQL Databases Now Not Only SQL Turns out that SQL is a powerful language for specifying queries Potentially useful data storage and retrieval techniques 09/20/2016Copyright 2016, JCC Consulting, Inc. 10

Sharding Partitioning data across multiple servers Scaling out Once the data is sharded, send queries to data with Map Reduce 09/20/2016Copyright 2016, JCC Consulting, Inc. 11

Big Data Buzzwords – Map Reduce Patented algorithm for: partitioning queries to run on multiple nodes in parallel Integrating the results Map Reduce details originally created by developer Operations can (and should) be generated by database software 09/20/2016Copyright 2016, JCC Consulting, Inc. 12

Big Data Buzzwords – Schema-less Reduce development time by eliminating up-front schema design Schema information still exists Embedded in the data Embedded in the code to support an API Pinned to a developer’s wall Reinventing databases from the 1960s 09/20/2016Copyright 2016, JCC Consulting, Inc. 13

Big Data Buzzwords – New SQL Combine powerful SQL query language with performance benefits of NoSQL databases Support ACID transactions 09/20/2016Copyright 2016, JCC Consulting, Inc. 14

High level view “Big Data” Data Types Data Storage Models When is data accessed? Data Distribution Integrating Data From Multiple Sources Variety of Data Sets/Sources Variety of Data Source Ownership Data query languages 09/20/2016Copyright 2016, JCC Consulting, Inc. 15

“Big Data” Data Types Traditional Data Types Character Numerical Date/Time/Timestamp Large Objects – LOB/BLOB/CLOB “Big Data” Data Types Multi-dimensional arrays Images/video Documents Loosely formatted data Objects Spatial Copyright 2016, JCC Consulting, Inc /20/2016

Data Storage Models Row Store – Tabular Column Store Key Value Document XML JSON – Java Script Object Notation BSON – Binary JSON Graph Multi dimensional array Object 09/20/2016Copyright 2016, JCC Consulting, Inc. 17

When is data accessed? After being stored Before (or instead of) being stored – Streaming data 09/20/2016Copyright 2016, JCC Consulting, Inc. 18

Data Distribution Single node – vertical scaling Clustered Replicated Horizontally distributed & replicated – horizontal scaling 09/20/2016Copyright 2016, JCC Consulting, Inc. 19

Vertical Scaling Buy a bigger server More CPUs Faster CPUs More Memory More storage Argument for vertical scaling Cores per CPU chip are increasing – 22 cores/CPU Configurable memory is increasing – 2 terabytes/server Storage capacity is increasing – 15.3 Terabyte SSDs Faster networks – 20 Gigabit network adapters Not all problems can be solved with vertical scaling 09/20/2016Copyright 2016, JCC Consulting, Inc. 20

Potential Problems with Vertical Scaling What if data storage breaks? What if server breaks? What if data center breaks? What if a single server cannot handle CPU load? What if a network cannot handle the traffic? What if data doesn’t fit? Horizontal scaling and replication solve these issues but introduce additional complexities. 09/20/2016Copyright 2016, JCC Consulting, Inc. 21

Horizontally Scaled Data Source Copyright 2016, JCC Consulting, Inc. 22 Computer 1Computer 2Computer N Horizontally Scaled Storage & Analytics Horizontal Scaling is one solution to the data volume challenge. 09/20/2016

Horizontal Distribution Levels Single Server Cluster of servers Multiple servers/clusters in a Datacenter Multiple datacenters on a Continent Multiple continents on a Planet Lets not think small Multiple planets in a Solar System Multiple solar systems in a Galaxy Still some challenges around network latency 09/20/2016Copyright 2016, JCC Consulting, Inc. 23

Horizontal Distribution and Replication Distribute processing Distribute query and analysis Map-Reduce algorithms Transmit results, not the entire data set Replicate data for fault tolerance and performance Lots of complexities that do not fit in timeframe for this talk 09/20/2016Copyright 2016, JCC Consulting, Inc. 24

Integrating Data From Multiple Sources Discovering that data exists Data location Access method(s) Understanding what data is available and what it means Schema can programmatically queried Ontologies to identify comparable data Security requirements Privacy requirements Business details Identifying possible operations/analysis Integrating the resulting analysis Challenging problems in this area 09/20/2016Copyright 2016, JCC Consulting, Inc. 25

Data Source Registry N Data Source Registry 2 Integrating Multiple Data Sources Copyright 2016, JCC Consulting, Inc. 26 Computer 1 Computer 2Computer N Horizontally Scaled Storage & Analytics Computer 1 Computer 2Computer N Horizontally Scaled Storage & Analytics Computer 1 Computer 2Computer N Horizontally Scaled Storage & Analytics Analytics Engine Data Source 1 Data Source 2 Data Source N Data Source Registry 1 Disclaimer: This diagram assists in identifying requirements. It is not intended to be a full processing model. 09/20/2016

Variety of Data Representations Tabular data – relations Designed, cleansed, curated Spatial data Images & Video Well defined structures Need additional domain information aerial photos, faces, stars, etc. XML – may have well defined DTD Store everything now, figure it out later JSON/BSON E.g. network packet logs Multiple data models to handle data diversity Copyright 2016, JCC Consulting, Inc /20/2016

Variety of Data Source Ownership Self Owned Publically Available Data for hire Derived Data Copyright 2016, JCC Consulting, Inc /20/2016

Data Source Registry Requirements Language/Interface for registering data source Support for discovering and identifying available data sources Content of the data source Semantics and Syntax of data Available analytic routines Security/Privacy restrictions Provenance of the data Information about connecting to data source Business agreement information Costs Use Restrictions Service Level Agreements Potentially use block chaining (distributed ledger) for agreement Standards support integration of multiple data sources Copyright 2016, JCC Consulting, Inc /20/2016

Data Query Languages JDBC – SQL queries from Java SPARQL – Graph query language XQuery – XML Product and application specific APIs Specify how to access the data SQL – specify what data is needed, not how to access it Traditional Tables with rows & columns Expanded to support: XML JSON Polymorphic Table Functions Multi-dimensional Arrays 09/20/2016Copyright 2016, JCC Consulting, Inc. 30

Data Analysis and Visualization R statistics package Others? 09/20/2016Copyright 2016, JCC Consulting, Inc. 31

Big Data Eco-system Products Open Source Products Minimal upfront license costs Minimal documentation Minimal support Multiple products in the ecosystem Lots of time and effort to implement and deploy Commercial off the shelf (COTS) Products Potentially expensive license costs Documentation Support Lots of time and effort to implement and deploy Commercial products integrating Open Source products 09/20/2016Copyright 2016, JCC Consulting, Inc. 32

Summary In many ways, “Big Data” is the same as we’ve always been doing. Focus on analysis rather than transaction processing New software and techniques for horizontal distribution New buzzwords New datatypes Distribute processing and integrate results One challenge is integrating data from multiple sources Locating the data Understanding what the data contains Requesting and integrating analysis Big Data is a tool – ultimate goal is actionable analytics 09/20/2016Copyright 2016, JCC Consulting, Inc. 33

Questions? Keith W. Hare JCC Consulting, Inc. 600 Newark Granville Road P.O. Box 381 Granville, OH USA 09/20/2016Copyright 2016, JCC Consulting, Inc. 34

References May 2014 “Understanding Big Data: The Seven V’s”, Eileen McNulty Eileen McNulty. National Research Council “Frontiers in Massive Data Analysis”, Washington, D.C., The National Academies Press. May 2014, “Big Data: Seizing Opportunities, Preserving Values”, Executive Office of the President. eport_may_1_2014.pdf eport_may_1_2014.pdf 2015, “ISO/IEC JTC1 Big Data Preliminary Report 2014” September 2015, NIST Big Data Public Working Group Reports (NIST.SP , 2, 3, 4, 5, 6, 7) Copyright 2016, JCC Consulting, Inc /20/2016