Big Data and NoSQL BUS 782.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

Data Management in the Cloud Paul Szerlip. The rise of data Think about this o For the past two decades, the largest generator of data was humans -- now.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
A Survey of Distributed Database Management Systems Brady Kyle CSC
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Chapter 14 The Second Component: The Database.
NoSQL and NewSQL Justin DeBrabant CIS Advanced Systems - Fall 2013.
Inexpensive Scalable Information Access Many Internet applications need to access data for millions of concurrent users Relational DBMS technology cannot.
A Social blog using MongoDB ITEC-810 Final Presentation Lucero Soria Supervisor: Dr. Jian Yang.
Massively Parallel Cloud Data Storage Systems S. Sudarshan IIT Bombay.
1 Yasin N. Silva Arizona State University This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Big Data. What is Big Data? Big Data Analytics: 11 Case Histories and Success Stories
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Database Design – Lecture 16
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
NoSQL Databases NoSQL Concepts SoftUni Team Technical Trainers Software University
1 Dennis Kafura – CS5204 – Operating Systems Big Table: Distributed Storage System For Structured Data Sergejs Melderis 1.
The exponential growth of data –Challenges for Google,Yahoo,Amazon & Microsoft in web search and indexing The volume of data being made publicly available.
+ Hbase: Hadoop Database B. Ramamurthy. + Motivation-0 Think about the goal of a typical application today and the data characteristics Application trend:
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
Lecture 8: Databases and Data Infrastructure CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier.
Windows Azure. Azure Application platform for the public cloud. Windows Azure is an operating system You can: – build a web application that runs.
Hadoop implementation of MapReduce computational model Ján Vaňo.
1 Melanie Alexander. Agenda Define Big Data Trends Business Value Challenges What to consider Supplier Negotiation Contract Negotiation Summary 2.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
NoSQL Or Peles. What is NoSQL A collection of various technologies meant to work around RDBMS limitations (mostly performance) Not much of a definition...
BACS 287 Big Data & NoSQL 2016 by Jones & Bartlett Learning LLC.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Big Data Analytics with Excel Peter Myers Bitwise Solutions.
Big Data Yuan Xue CS 292 Special topics on.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Group members: Phạm Hoàng Long Nguyễn Huy Hùng Lê Minh Hiếu Phan Thị Thanh Thảo Nguyễn Đức Trí 1 BIG DATA & NoSQL Topic 1:
BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.
BIG DATA/ Hadoop Interview Questions.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Big Data-An Analysis. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult.
NO SQL for SQL DBA Dilip Nayak & Dan Hess.
An Open Source Project Commonly Used for Processing Big Data Sets
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Chapter 14 Big Data Analytics and NoSQL
Operational & Analytical Database
NOSQL.
NOSQL databases and Big Data Storage Systems
Ministry of Higher Education
Massively Parallel Cloud Data Storage Systems
NoSQL Databases An Overview
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Hadoop Basics.
MANAGING DATA RESOURCES
Overview of big data tools
Big Data Young Lee BUS 550.
Big DATA.
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Big Data and NoSQL BUS 782

What is Big Data? https://www.youtube.com/watch?v=c4BwefH5Ve8 Employee-generated data User-generated data Machine-generated data Big Data Analytics: 11 Case Histories and Success Stories https://www.youtube.com/watch?annotation_id=annotation_3535169775&f eature=iv&src_vid=c4BwefH5Ve8&v=t4wtzIuoY0w

Big Data Data Size: Gigabyte Terabyte: Terabyte USB Petabyte: Wal-Mart handles more than 1m customer transactions every hour at more than 2.5 petabytes Exabyte: the amount of traffic flowing over the internet about 700 exabytes annually Zettabyte

Big Data: Some Facts World’s information is doubling every two years World generated 1.8 ZB of information in 2011 Cisco predicts that by 2016 global IP traffic will reach 1.3 zettabytes There will be 19 billion networked devices by 2016 70% of this data is being generated by individuals as opposed to enterprises & organizations

Big Data Sources Web sites Social media Machine generated RFID Image, video, and audio Etc.

Big Data Challenges Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. “3Vs": Volume: Size >= 30-50 TBs Velocity: Processing speed Variety: Structured: able to fit in a database table unstructured data

Do Companies care about Data? Not really, What they care about are Key Performance Indicators (KPIs) Some examples of KPIs are Revenue Profit Revenue per customer/employee Customer Attrition: the loss of clients or customers Big Data is only useful if it helps drive KPIs

Big Data to KPIs

Applications Text mining: deriving high-quality information from text. text categorization, text clustering, concept/entity extraction, sentiment analysis, etc. Web mining: Web usage mining Web content mining Social media mining Salesforce Radian6 Social Marketing Cloud http://www.youtube.com/watch?v=EH1dcFh_-I4

Advantages of Relational Databases Well-defined database schema Flexible query language Maintain database consistency in business transactions: Concurrent database processing with multiple users Reading/updating Locking

Transaction ACID Properties Atomic Transaction cannot be subdivided All or nothing Consistent Constraints don’t change from before transaction to after transaction A transaction transforms a database from one consistent state to another consistent state. Isolated Transactions execute independently of one another. Database changes not revealed to users until after transaction has completed Durable Database changes are permanent and must not be lost.

Problems with relational databases in managing Big Data High overhead in maintaining database consistency Do not support unstructured data search very well (i.e. google type searching) Do not handle data in unexpected formats well Don’t scale well to very large size databases: Expensive “scale up”: adding processer, storage Slow query response time Data must move to server Server failure Organizations such as Facebook, Yahoo, Google, and Amazon were among the first to decide that relational databases were not good solutions for the volumes and types of data that they were dealing with.

What is needed in new approach Deal with data size never imagined before. Hardware failure should be expected. Data has gravity, compute has to move to data.

What is Hadoop? Open source project by Apache Foundation Based on papers published by Google Google File System ( Oct, 2003) MapReduce ( Dec, 2004) Consists of two core components Hadoop Distributed File System (Storage) MapReduce (Compute)

How Hadoop fits in the new approach Run on cluster of low cost commodity servers so can accommodate petabytes of data cost effectively. Embraces partial failures Data locality (computation on local node where data resides) Horizontally Scales Scale Out Hadoop file is: Distributed: a file is stored in many servers Replicated: a file is replicated with many copies

Hadoop HDFS: Hadoop Distributed File System Based on GFS Designed to store very large amount of data (TBs and PBs) and much larger file sizes Write-once, read many-times access pattern Designed to run on clusters of commodity hardware and does replication for reliability Allows data to be read and processed locally Supports limited operations on files - write, delete, append and reads but no updates

MapReduce: a programming model for distributed processing of data Rather than take the conventional step of moving data over a network to be processed by software, MapReduce moves the processing software to the data. Each node does both store and compute, and does best to process local data. MapReduce has two main phases: Map Reduce

Example: Word Count

Hadoop Ecosystem Hbase–a column-oriented data store Hive –provides a SQL like query capability Pig –a high-level language for creating MapReducejobs HCatalog–takes Hive’s metadata and makes it available across the Hadoop ecosystem Mahout –a library of algorithms for clustering, classification, and filtering Sqoop–accelerates bulk loads of data between Hadoop and RDMS Flume –streams large volumes of log data from multiple sources into Hadoop

NoSQL Database NotOnlySQL is a broad class of database management systems identified by non-adherence to the widely used relational database management system model. They are useful when working with a huge quantity of data when the data's nature does not require a relational model.

Types of NoSQL Databases Column-oriented database Example: Cassandra Document-oriented database: Example: MongoDB, CouchDB Data stored in JSON, JavaScript Object Notation, format

JSON, JavsScript Object Notation http://www. w3schools JSON Example {"employees":[ {"firstName":"John", "lastName":"Doe"}, {"firstName":"Anna", "lastName":"Smith"}, {"firstName":"Peter", "lastName":"Jones"} ]}

Cassandra is essentially a key-value store Cassandra is essentially a key-value store. This means that all data is stored only in one ‘table’, each row of which is uniquely identified by a key, with JSON representation. https://blog.safaribooksonline.com/2012/12/11/modeling-data-in-cassandra/ { "user1": { "Bio": { "name": "Shaneeb Kamran", "age" : 23 } }, "user2": { "name": "Salman ul Haq", "profession": "Developer" "Education": { "bachelors": "NUST"

Column Data Model http://www. sinbadsoft A column is a key-value pair consisting of three elements: 1: Unique name: Used to reference the column 2: Value: The content of the column. 3: Timestamp: used to determine the valid content. Column Family: A container for columns sorted by their names. Column Families are referenced and sorted by row keys. Super Column: A sorted associative array of columns Example: Multi-value attribute Super column family: A container for super columns sorted by their names. Super Column Families are referenced and sorted by row keys. Keyspace: Top level element. Container for column families.

Column Family Super column family

Migrate a Relational Database Structure into a NoSQL Cassandra Structure http://www.divconq.com/2010/migrate-a-relational-database-structure-into-a-nosql-cassandra-structure-part-i/ { "biologicalfeatures": { "forests" : { "forest003" : { "name" : "Black Forest", "trees" : "two million", "bushes" : "three million“ }, "forest045" : { "name" : "100 Acre Woods", "trees" : "four thousand", "bushes" : "five thousand“ }, "forest127" : { "name" : "Lonely Grove", "trees" : "none", "bushes" : "one hundred“ } }, "famoustrees" : { "tree12345" : { "forestID" : "forest003", "name" : "Der Tree", "species" : "Red Oak“ }, "tree12399" : { "forestID" : "forest045", "name" : "Happy Hunny Tree", "species" : "Willow“ }, "tree32345" : { "forestID" : "forest003", "name" : "Das Ubertree", "species" : "Blue Spruce“ } }

Document database: MongoDB http://docs. mongodb MongoDB stores business subjects in documents. A document is the basic unit of data in MongoDB. Documents are analogous to JSON objects but exist in the database in a more type-rich format known as BSON, Bin­ary JSON, is a bin­ary-en­coded seri­al­iz­a­tion of JSON-like doc­u­ments. The structure of MongoDB documents and how the application represents relationships between data: references and embedded documents.

Example using reference

Embedded Data Models

CouchDB A CouchDB document is a JSON object that consists of named fields. Field values may be strings, numbers, dates, or even ordered lists and associative maps. An example of a document would be a blog post: { "Subject": "I like Plankton", "Author": "Rusty", "PostedDate": "5/23/2006", "Tags": ["plankton", "baseball", "decisions"], "Body": "I decided today that I don't like baseball. I like plankton." }

Problems with NoSQL Databases Does not support transaction consistency as relational database systems. There is no standard query language for NoSQL databases

NewSQL Databases http://en.wikipedia.org/wiki/NewSQL NewSQL is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for online transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional database system.

Approaches of NewSQL Systems 1. Distributed cluster of shared-nothing nodes: node owns a subset of the data. These databases include components such as distributed concurrency control and distributed query processing. 2. Transparent sharding: These systems provide a sharding middleware layer to automatically split databases across multiple nodes. 3. Highly optimized SQL engines 4. In-memory database

In-Memory Database An in-memory database is a database management system that primarily relies on main memory for computer data storage. It is contrasted with database management systems that employ a disk storage mechanism. Main memory databases are faster than disk-optimized databases. Good for Big Data analytics. Use non-volatile main memory module that retains data even when electrical power is removed.

SAP HANA, High-Speed Analytical Appliance SAP HANA is an in-memory, column-oriented, relational database management system developed and marketed by SAP. HANA's architecture is designed to handle both high transaction rates and complex query processing on the same platform HANA's performance is 10,000 times faster when compared to standard disks, which allows companies to analyze data in a matter of seconds instead of long hours.