CPS 516: Data-intensive Computing Systems Instructor: Shivnath Babu TA: Jie Li.

Slides:



Advertisements
Similar presentations
Three Perspectives & Two Problems Shivnath Babu Duke University.
Advertisements

Large Scale Computing Systems
Introduction to Advanced Computing Platforms for Data Analysis Ruoming Jin.
Quiz 2 Review. For which of the following attributes would a hash- index most likely be a better fit than a B+-tree index? A. Social Security Number B.
Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Big Data and.
Cs44321 CS4432: Database Systems II Lecture #19 Database Consistency and Transactions Professor Elke A. Rundensteiner.
MAD SKILLS NEW ANALYSIS PRACTICES FOR BIG DATA XXXXXXXXXX JEFF COHENGREENPLUM BRIAN DOLANFOX AUDIENCE NETWORK MARK DUNLAPEVERGREEN TECHNOLOGIES JOE HELLERSTEINUC.
Cs4432recovery1 CS4432: Database Systems II Lecture #19 Database Consistency and Violations? Professor Elke A. Rundensteiner.
1 ICS 223: Transaction Processing and Distributed Data Management Winter 2008 Professor Sharad Mehrotra Information and Computer Science University of.
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
1: IntroductionData Management & Engineering1 Course Overview: CS 395T Semantic Web, Ontologies and Cloud Databases Daniel P. Miranker Objectives: Get.
CPS 216: Data-intensive Computing Systems Shivnath Babu.
CERN IT Department CH-1211 Geneva 23 Switzerland t XLDB 2010 (Extremely Large Databases) conference summary Dawid Wójcik.
CPS 216: Advanced Database Systems (Data-intensive Computing Systems) Shivnath Babu.
Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
DLRL Cluster Matt Bollinger, Joseph Pontani, Adam Lech Client: Sunshin Lee CS4624 Capstone Project March 3, 2014 Virginia Tech, Blacksburg, VA.
Distributed and Parallel Processing Technology Chapter1. Meet Hadoop Sun Jo 1.
CSCI-2950u :: Data-Intensive Scalable Computing Rodrigo Fonseca (rfonseca)
Page 1 Course Description CPS510 Database Systems Fall 2004 School of Computer Science Ryerson University.
Introduction. 
CPS 216: Advanced Database Systems Shivnath Babu.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
CPS 216: Advanced Database Systems Shivnath Babu Fall 2006.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
CSET 3300 Databases & ER Data Models. Databases A database is a collection of data (information). A DataBase Management System (DBMS) is a software system.
O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee.
CPS216: Advanced Database Systems Notes 02:Query Processing (Overview) Shivnath Babu.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
CPS 216: Advanced Database Systems Class Project Shivnath Babu.
Introduction to Database Systems1. 2 Basic Definitions Mini-world Some part of the real world about which data is stored in a database. Data Known facts.
Hadoop implementation of MapReduce computational model Ján Vaňo.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Database Systems Lecture 1. In this Lecture Course Information Databases and Database Systems Some History The Relational Model.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Nov 2006 Google released the paper on BigTable.
History & Motivations –RDBMS History & Motivations (cont’d) … … Concurrent Access Handling Failures Shared Data User.
CPS 216: Advanced Database Systems Shivnath Babu.
CPS 216: Advanced Database Systems Shivnath Babu.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
ORACLE & VLDB Nilo Segura IT/DB - CERN. VLDB The real world is in the Tb range (British Telecom - 80Tb using Sun+Oracle) Data consolidated from different.
Database Management Systems.  Instructor: Yrd. Doç. Dr. Cengiz Örencik   Course material.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
《数据库系统原理》 Principles of Database Systems. Textbook A First Course in Database Systems (Third Edition) J. D. Ullman, J. Widom 机械工业出版社, Lu Chaojun,
《数据库系统原理》 Principles of Database Systems. Textbook A First Course in Database Systems (Third Edition) J. D. Ullman, J. Widom 机械工业出版社, Lu Chaojun,
Data Engineering Shivnath Babu Introduction to Parallel Execution.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
CS422 Principles of Database Systems Introduction to NoSQL Chengyu Sun California State University, Los Angeles.
Understanding DBMSs. Data Management Data Query Application DataBase Management System (DBMS)
By: Joel Dominic and Carroll Wongchote 4/18/2012.
BIG DATA/ Hadoop Interview Questions.
Lecture 1 Book: Hadoop in Action by Chuck Lam Online course – “Cloud Computing Concepts” lecture notes by Indranil Gupta.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
CPSC-310 Database Systems
Database System Concepts
Introduction to MapReduce and Hadoop
Central Florida Business Intelligence User Group
Ch 4. The Evolution of Analytic Scalability
Overview of big data tools
Zoie Barrett and Brian Lam
Dep. of Information Technology By: Raz Dara Mohammad Amin
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

CPS 516: Data-intensive Computing Systems Instructor: Shivnath Babu TA: Jie Li

A Brief History Relational database management systems Time Let us first see what a relational database system is

Data Management Data Query User/Application DataBase Management System (DBMS)

Example: At a Company IDNameDeptIDSalary… 10Nemo12120K… 20Dory15679K… 40Gill8976K… 52Ray3485K… …………… IDName… 12IT… 34Accounts… 89HR… 156Marketing… ……… EmployeeDepartment Query 1: Is there an employee named “Nemo”? Query 2: What is “Nemo’s” salary? Query 3: How many departments are there in the company? Query 4: What is the name of “Nemo’s” department? Query 5: How many employees are there in the “Accounts” department?

DataBase Management System (DBMS)High-level Query Q DBMS Data Answer Translates Q into best execution plan for current conditions, runs plan

Example: Store that Sells Cars MakeModelOwnerID HondaAccord12 ToyotaCamry34 MiniCooper89 HondaAccord156 ……… IDNameAge 12Nemo22 34Ray42 89Gill36 156Dory21 ……… Cars Owners Filter (Make = Honda and Model = Accord) Join (Cars.OwnerID = Owners.ID) MakeModelOwnerIDIDNameAge HondaAccord12 Nemo22 HondaAccord156 Dory21 Owners of Honda Accords who are <= 23 years old Filter (Age <= 23)

DataBase Management System (DBMS)High-level Query Q DBMS Data Answer Translates Q into best execution plan for current conditions, runs plan Keeps data safe and correct despite failures, concurrent updates, online processing, etc.

A Brief History Relational database management systems Time Semi-structured and unstructured data (Web) Hardware developments Developments in system software Changes in data sizes Assumptions and requirements changed over time

Big Data: How much data? Google processes 20 PB a day (2008) Wayback Machine has 3 PB TB/month (3/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009) Facebook has 36 PB of user data TB/day (6/2010) CERN’s LHC: 15 PB a year (any day now) LSST: 6-10 PB a year (~2015) 640K ought to be enough for anybody. From

From:

NEW REALITIES TB disks < $100 Everything is data Rise of data-driven culture Very publicly espoused by Google, Wired, etc. Sloan Digital Sky Survey, Terraserver, etc. The quest for knowledge used to begin with grand theories. Now it begins with massive amounts of data. Welcome to the Petabyte Age. From:

FOX AUDIENCE NETWORK Greenplum parallel DB 42 Sun X4500s (“Thumper”) each with: GB drives 16GB RAM 2 dual-core Opterons Big and growing 200 TB data (mirrored) Fact table of 1.5 trillion rows Growing 5TB per day 4-7 Billion rows per day Also extensive use of R and Hadoop As reported by FAN, Feb, 2009 From: Yahoo! runs a 4000 node Hadoop cluster (probably the largest). Overall, there are 38,000 nodes running Hadoop at Yahoo!

A SCENARIO FROM FAN Open-ended question about statistical densities (distributions) How many female WWF fans under the age of 30 visited the Toyota community over the last 4 days and saw a Class A ad? How are these people similar to those that visited Nissan? From:

MULTILINGUAL DEVELOPMENT SQL or MapReduce Sequential code in a variety of languages Perl Python Java R Mix and Match! SE HABLA MAPREDUCE SQL SPOKEN HERE QUI SI PARLA PYTHON HIER JAVA GESPROCKEN R PARLÉ ICI From:

From:

What we will cover Scalable data processing (40%) –Parallel query plans and operators –Systems based on MapReduce –Scalable key-value stores –Processing rapid, high-speed data streams Principles of query processing (35%) –Indexes –Query execution plans and operators –Query optimization Data storage (15%) –Databases Vs. Filesystems (Google/Hadoop Distributed FileSystem) –Data layouts (row-stores, column-stores, partitioning, compression) Concurrency control and recovery (10%) –Consistency models for data (ACID, BASE, Serializability) –Write-ahead logging

Course Logistics Web: Books: –(Recommended) Hadoop: The Definitive Guide, by Tom White –Database Systems: The Complete Book, by H. Garcia-Molina, J. D. Ullman, and J. Widom Grading: –Project 25% (Hopefully, on Amazon Cloud!) –Homeworks 25% –Midterm 25% –Final 25%

See Course Web Page For Course outline Homeworks Projects Tentative date for midterm