CPS 216: Advanced Database Systems Shivnath Babu.

Slides:



Advertisements
Similar presentations
Large Scale Computing Systems
Advertisements

CS 540 Database Management Systems
计算机学院 数据库系统原理 1 Introduction to Databases 杨宁 1/23.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
IiWAS2002, Bandung, Indonesia Teaching and Learning Databases Dr. Stéphane Bressan National University of Singapore.
CPS 516: Data-intensive Computing Systems Instructor: Shivnath Babu TA: Jie Li.
By: Chris Hayes. Facebook Today, Facebook is the most commonly used social networking site for people to connect with one another online. People of all.
Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Big Data and.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 16 – Intro. to Transactions.
1 Transaction Management Overview Yanlei Diao UMass Amherst March 15, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Chapter 8 : Transaction Management. u Function and importance of transactions. u Properties of transactions. u Concurrency Control – Meaning of serializability.
1 ICS 223: Transaction Processing and Distributed Data Management Winter 2008 Professor Sharad Mehrotra Information and Computer Science University of.
Databases and Database Management System. 2 Goals comprehensive introduction to –the design of databases –database transaction processing –the use of.
Transaction Management WXES 2103 Database. Content What is transaction Transaction properties Transaction management with SQL Transaction log DBMS Transaction.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
1 Course: Database Management Systems Credits: 3 Prepared by: Assoc. Prof. Dr. Duong Tuan Anh Faculty of Computer Science & Engineering HoChiMinh City.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
1: IntroductionData Management & Engineering1 Course Overview: CS 395T Semantic Web, Ontologies and Cloud Databases Daniel P. Miranker Objectives: Get.
Database Systems Chapter 1 The Worlds of Database Systems.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
CPS 216: Data-intensive Computing Systems Shivnath Babu.
HADOOP ADMIN: Session -2
CPS 216: Advanced Database Systems (Data-intensive Computing Systems) Shivnath Babu.
INTRODUCTION TO TRANSACTION PROCESSING CHAPTER 21 (6/E) CHAPTER 17 (5/E)
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Ch 4. The Evolution of Analytic Scalability
Managing Multi-User Databases AIMS 3710 R. Nakatsu.
Overview of a Database Management System
Introduction. 
The Worlds of Database Systems Chapter 1. Database Management Systems (DBMS) DBMS: Powerful tool for creating and managing large amounts of data efficiently.
CPS 216: Advanced Database Systems Shivnath Babu.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
CPS 216: Advanced Database Systems Shivnath Babu Fall 2006.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Hadoop and HDFS
CSET 3300 Databases & ER Data Models. Databases A database is a collection of data (information). A DataBase Management System (DBMS) is a software system.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
CPS216: Advanced Database Systems Notes 02:Query Processing (Overview) Shivnath Babu.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
CSED421 Database Systems Lab. Welcome Lab Class –Library 501, Fri 9:00 – 10:40 Teacher Assistants – 안석현, 이상훈 –{ashworld, –IDS.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Intro – Part 2 Introduction to Database Management: Ch 1 & 2.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Introduction to Database Systems1. 2 Basic Definitions Mini-world Some part of the real world about which data is stored in a database. Data Known facts.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Database Systems Lecture 1. In this Lecture Course Information Databases and Database Systems Some History The Relational Model.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
연세대학교 Yonsei University Data Processing Systems for Solid State Drive Yonsei University Mincheol Shin
CPS 216: Advanced Database Systems Shivnath Babu.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
CPS 49S Google: The Computer Science Within and its Impact on Society Shivnath Babu Spring 2008.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 16 – Intro. to Transactions.
Data Engineering Shivnath Babu Introduction to Parallel Execution.
Understanding DBMSs. Data Management Data Query Application DataBase Management System (DBMS)
BIG DATA/ Hadoop Interview Questions.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
Chapter 11 Database System Implementation Spring 2001 Prof. Sang Ho Lee School of Computing, Soongsil Univ.
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Chapter 1: Introduction
Chapter 1: Introduction
CS122B: Projects in Databases and Web Applications Winter 2017
Introduction to NewSQL
The Basics of Apache Hadoop
Ch 4. The Evolution of Analytic Scalability
Presentation transcript:

CPS 216: Advanced Database Systems Shivnath Babu

Outline for Today What this class is about: Data management What we will cover in this class Logistics What does a Database System mean to you? (Hint: What are they used for? Give examples)

Data Management Data Query User/Application DataBase Management System (DBMS)

Example: At a Company IDNameDeptIDSalary… 10Nemo12120K… 20Dory15679K… 40Gill8976K… 52Ray3485K… …………… IDName… 12IT… 34Accounts… 89HR… 156Marketing… ……… EmployeeDepartment Query 1: Is there an employee named “Nemo”? Query 2: What is “Nemo’s” salary? Query 3: How many departments are there in the company? Query 4: What is the name of “Nemo’s” department? Query 5: How many employees are there in the “Accounts” department?

DataBase Management System (DBMS)High-level Query Q DBMS Data Answer Translates Q into best execution plan for current conditions, runs plan

Example: Store that Sells Cars MakeModelOwnerID HondaAccord12 ToyotaCamry34 MiniCooper89 HondaAccord156 ……… IDNameAge 12Nemo22 34Ray42 89Gill36 156Dory21 ……… Cars Owners Filter (Make = Honda and Model = Accord) Join (Cars.OwnerID = Owners.ID) MakeModelOwnerIDIDNameAge HondaAccord12 Nemo22 HondaAccord156 Dory21 Owners of Honda Accords who are <= 23 years old Filter (Age <= 23)

DataBase Management System (DBMS)High-level Query Q DBMS Data Answer Translates Q into best execution plan for current conditions, runs plan Keeps data safe and correct despite failures, concurrent updates, online processing, etc.

DBMS is multi-user Example Get account balance from database; If balance > amount of withdrawal then balance = balance - amount of withdrawal; dispense cash; store new balance into database; Homer at ATM1 withdraws $100 Marge at ATM2 withdraws $50 Initial balance = $400, final balance = ? –Should be $250 no matter who goes first

Final balance = $250 read balance; $400 if balance > amount then balance = balance - amount; $300 write balance; $300 read balance; $300 if balance > amount then balance = balance - amount; $250 write balance; $250 Homer withdraws $100: Marge withdraws $50:

Final balance = $300 read balance; $400 if balance > amount then balance = balance - amount; $300 write balance; $300 read balance; $400 If balance > amount then balance = balance - amount; $350 write balance; $350 Homer withdraws $100: Marge withdraws $50:

Final balance = $350 read balance; $400 if balance > amount then balance = balance - amount; $300 write balance; $300 read balance; $400 if balance > amount then balance = balance - amount; $350 write balance; $350 Homer withdraws $100: Marge withdraws $50:

Concurrency control in DBMS Similar to concurrent programming problems –But data is not all in main-memory Appears similar to file system concurrent access? –Approach taken by MySQL initially; now MySQL offers better alternatives But want to control at much finer granularity Or else one withdrawal would lock up all accounts!

Recovery in DBMS Example: balance transfer decrement the balance of account X by $100; increment the balance of account Y by $100; Scenario 1: Power goes out after the first instruction Scenario 2: DBMS buffers and updates data in memory (for efficiency); before they are written back to disk, power goes out Log updates; undo/redo during recovery

DataBase Management System (DBMS)High-level Query Q DBMS Data Answer Translates Q into best execution plan for current conditions, runs plan Keeps data safe and correct despite failures, concurrent updates, online processing, etc.

Summary of modern DBMS features Persistent storage of data Logical data model; declarative queries and updates ! physical data independence Multi-user concurrent access Safety from system failures Performance, performance, performance –Massive amounts of data (terabytes ~ petabytes) –High throughput (thousands ~ millions transactions per minute) –High availability (¸ % uptime)

Modern DBMS Architecture Disk(s) Applications OS Parser Query Optimizer Query Executor Storage Manager Logical query plan Physical query plan Access method API calls SQL File system API calls Storage system API calls DBMS

World of “Big Data” Numbers reported by Google from 2007: –Data processed per month is 400 PB (PetaBytes) –Average job size is 180 GB For 180 GB of data, it takes: –30 minutes to read from disk MB/s) –600 minutes to download at 5 MB/s Big data is hard to move (but easy to store – few cents per GB) Can throw parallelism at the problem

Word Count over a Given Set of Web Pages see bob throw see1 bob1 throw 1 see 1 spot 1 run 1 bob1 run 1 see 2 spot 1 throw1 see spot run Can we do word count in parallel?

The MapReduce Framework (pioneered by Google)

Automatic Parallel Execution in MapReduce Handles failures automatically, e.g., restarts tasks if a node fails; runs multiples copies of the same task to avoid a slow task slowing down the whole job

Course Outline Principles of query processing (25%) –Indexes –Query execution plans and operators –Query optimization Data storage (15%) –Databases Vs. filesystems (Google FileSystem, Hadoop Distributed FileSystem) –Row-oriented Vs. column-oriented storage –Flash memory and Solid State Drives Scalable data processing (30%) –Parallel query plans and operators –Systems based on MapReduce –Scalable key-value stores Concurrency control and recovery (15%) –Consistency models for data (ACID, BASE, Serializability) –Write-ahead logging Information retrieval and Data mining (15%) –Web search (Google PageRank, inverted indexes) –Association rules and clustering

Course Logistics Useful reference: Database Systems: The Complete Book, by H. Garcia-Molina, J. D. Ullman, and J. Widom Web site: Grading: –Project 40% –Homework Assignments 15% –Midterm 20% –Final 25%

Summary: Data Management is Important Core aspect of most sciences and engineering today Core need in industry (esp., “big data”) Cool mix of theory and systems Chances are you will find something interesting even if you primary interest is elsewhere