CPS 216: Advanced Database Systems (Data-intensive Computing Systems) Shivnath Babu.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

CSE 5522: Survey of Artificial Intelligence II: Advanced Techniques Instructor: Alan Ritter TA: Fan Yang.
CPS 516: Data-intensive Computing Systems Instructor: Shivnath Babu TA: Jie Li.
Paula Ta-Shma, IBM Haifa Research 1 “Advanced Topics on Storage Systems” - Spring 2013, Tel-Aviv University Big Data and.
MAD SKILLS NEW ANALYSIS PRACTICES FOR BIG DATA XXXXXXXXXX JEFF COHENGREENPLUM BRIAN DOLANFOX AUDIENCE NETWORK MARK DUNLAPEVERGREEN TECHNOLOGIES JOE HELLERSTEINUC.
1 ICS 223: Transaction Processing and Distributed Data Management Winter 2008 Professor Sharad Mehrotra Information and Computer Science University of.
ETM Hadoop. ETM IDC estimate put the size of the “digital universe” at zettabytes in forecasting a tenfold growth by 2011 to.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
Introduction to Data Science Kamal Al Nasr, Matthew Hayes and Jean-Claude Pedjeu Computer Science and Mathematical Sciences College of Engineering Tennessee.
1: IntroductionData Management & Engineering1 Course Overview: CS 395T Semantic Web, Ontologies and Cloud Databases Daniel P. Miranker Objectives: Get.
CPS 216: Data-intensive Computing Systems Shivnath Babu.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Cloud Computing. Cloud Computing Overview Course Content
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Ch 4. The Evolution of Analytic Scalability
CSCI-2950u :: Data-Intensive Scalable Computing Rodrigo Fonseca (rfonseca)
Facebook (stylized facebook) is a Social Networking System and website launched in February 2004, operated and privately owned by Facebook, Inc. As.
Introduction. 
CPS 216: Advanced Database Systems Shivnath Babu.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
CPS 216: Advanced Database Systems Shivnath Babu Fall 2006.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
CSET 3300 Databases & ER Data Models. Databases A database is a collection of data (information). A DataBase Management System (DBMS) is a software system.
O’Reilly – Hadoop: The Definitive Guide Ch.1 Meet Hadoop May 28 th, 2010 Taewhi Lee.
CPS216: Advanced Database Systems Notes 02:Query Processing (Overview) Shivnath Babu.
CSED421 Database Systems Lab. Welcome Lab Class –Library 501, Fri 9:00 – 10:40 Teacher Assistants – 안석현, 이상훈 –{ashworld, –IDS.
Overviews of ITCS 6161/8161: Advanced Topics on Database Systems Dr. Jianping Fan Department of Computer Science UNC-Charlotte
Introduction. » How the course works ˃Homework ˃Project ˃Exams ˃Grades » prerequisite ˃CSCI 6441: Mandatory prerequisite ˃Take the prereq or get permission.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
CPS 216: Advanced Database Systems Class Project Shivnath Babu.
Understanding the field & setting expectations.  Personal  International  UNT Alumni (Mathematics)  Academic  Economics & Mathematics  Professional.
Hadoop implementation of MapReduce computational model Ján Vaňo.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Database Systems Lecture 1. In this Lecture Course Information Databases and Database Systems Some History The Relational Model.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
CPS 216: Advanced Database Systems Shivnath Babu.
CPS 216: Advanced Database Systems Shivnath Babu.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
ORACLE & VLDB Nilo Segura IT/DB - CERN. VLDB The real world is in the Tb range (British Telecom - 80Tb using Sun+Oracle) Data consolidated from different.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Mining of Massive Datasets Edited based on Leskovec’s from
Data Engineering Shivnath Babu Introduction to Parallel Execution.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
CS422 Principles of Database Systems Introduction to NoSQL Chengyu Sun California State University, Los Angeles.
Understanding DBMSs. Data Management Data Query Application DataBase Management System (DBMS)
By: Joel Dominic and Carroll Wongchote 4/18/2012.
BIG DATA/ Hadoop Interview Questions.
1 Cloud-Native Data Warehousing Bob Muglia. 2 Scenarios with affinity for cloud Gartner 2016 Predictions: By 2018, six billion connected things will be.
Lecture 1 Book: Hadoop in Action by Chuck Lam Online course – “Cloud Computing Concepts” lecture notes by Indranil Gupta.
CPSC-310 Database Systems
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Hadoop.
Introduction to MapReduce and Hadoop
Central Florida Business Intelligence User Group
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Ch 4. The Evolution of Analytic Scalability
Overview of big data tools
Zoie Barrett and Brian Lam
Dep. of Information Technology By: Raz Dara Mohammad Amin
Welcome! Knowledge Discovery and Data Mining
Analysis of Structured or Semi-structured Data on a Hadoop Cluster
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

CPS 216: Advanced Database Systems (Data-intensive Computing Systems) Shivnath Babu

A Brief History Relational database management systems Time Let us first see what a relational database system is

Data Management Data Query User/Application DataBase Management System (DBMS)

Example: At a Company IDNameDeptIDSalary… 10Nemo12120K… 20Dory15679K… 40Gill8976K… 52Ray3485K… …………… IDName… 12IT… 34Accounts… 89HR… 156Marketing… ……… EmployeeDepartment Query 1: Is there an employee named “Nemo”? Query 2: What is “Nemo’s” salary? Query 3: How many departments are there in the company? Query 4: What is the name of “Nemo’s” department? Query 5: How many employees are there in the “Accounts” department?

DataBase Management System (DBMS)High-level Query Q DBMS Data Answer Translates Q into best execution plan for current conditions, runs plan

Example: Store that Sells Cars MakeModelOwnerID HondaAccord12 ToyotaCamry34 MiniCooper89 HondaAccord156 ……… IDNameAge 12Nemo22 34Ray42 89Gill36 156Dory21 ……… Cars Owners Filter (Make = Honda and Model = Accord) Join (Cars.OwnerID = Owners.ID) MakeModelOwnerIDIDNameAge HondaAccord12 Nemo22 HondaAccord156 Dory21 Owners of Honda Accords who are <= 23 years old Filter (Age <= 23)

DataBase Management System (DBMS)High-level Query Q DBMS Data Answer Translates Q into best execution plan for current conditions, runs plan Keeps data safe and correct despite failures, concurrent updates, online processing, etc.

A Brief History Relational database management systems Time Semi-structured and unstructured data (Web) Hardware developments Developments in system software Changes in data sizes Assumptions and requirements changed over time

Big Data: How much data? Google processes 20 PB a day (2008) Wayback Machine has 3 PB TB/month (3/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009) Facebook has 36 PB of user data TB/day (6/2010) CERN’s LHC: 15 PB a year (any day now) LSST: 6-10 PB a year (~2015) 640K ought to be enough for anybody. From

From:

NEW REALITIES TB disks < $100 Everything is data Rise of data-driven culture Very publicly espoused by Google, Wired, etc. Sloan Digital Sky Survey, Terraserver, etc. The quest for knowledge used to begin with grand theories. Now it begins with massive amounts of data. Welcome to the Petabyte Age. From:

THE NEW PRACTITIONERS Hal Varian, UC Berkeley, Chief Google “Looking for a career where your services will be in high demand? … Provide a scarce, complementary service to something that is getting ubiquitous and cheap. So what’s ubiquitous and cheap? Data. And what is complementary to data? Analysis. the sexy job in the next ten years will be statisticians From:

THE NEW PRACTITIONERS Aggressively Datavorous Statistically savvy Diverse in training, tools From:

FOX AUDIENCE NETWORK Greenplum parallel DB 42 Sun X4500s (“Thumper”) each with: GB drives 16GB RAM 2 dual-core Opterons Big and growing 200 TB data (mirrored) Fact table of 1.5 trillion rows Growing 5TB per day 4-7 Billion rows per day Also extensive use of R and Hadoop As reported by FAN, Feb, 2009 From: Yahoo! runs a 4000 node Hadoop cluster (probably the largest). Overall, there are 38,000 nodes running Hadoop at Yahoo!

A SCENARIO FROM FAN Open-ended question about statistical densities (distributions) How many female WWF fans under the age of 30 visited the Toyota community over the last 4 days and saw a Class A ad? How are these people similar to those that visited Nissan? From:

MULTILINGUAL DEVELOPMENT SQL or MapReduce Sequential code in a variety of languages Perl Python Java R Mix and Match! SE HABLA MAPREDUCE SQL SPOKEN HERE QUI SI PARLA PYTHON HIER JAVA GESPROCKEN R PARLÉ ICI From:

From:

Teaching/Learning Methodology Relational database management systems Time Semi-structured and unstructured data (Web) Hardware developments Developments in system software Changes in data sizes Assumptions and requirements changed over time

Course Outline Principles of query processing (30%) –Indexes –Query execution plans and operators –Query optimization Data storage (10%) –Databases Vs. filesystems (Google/Hadoop Distributed FileSystem) –Flash memory and Solid State Drives Scalable data processing (35%) –Parallel query plans and operators –Systems based on MapReduce –Scalable key-value stores Concurrency control and recovery (15%) –Consistency models for data (ACID, BASE, Serializability) –Write-ahead logging Information retrieval and Data mining (10%) –Web search (Google PageRank, inverted indexes) –Association rules and clustering

Course Logistics Web: TA: Gang Luo References: –Hadoop: The Definitive Guide, by Tom White –Database Systems: The Complete Book, by H. Garcia- Molina, J. D. Ullman, and J. Widom Grading: –Project 35% (Hopefully, on Amazon Cloud!) –Homework Assignments 15% –Midterm 25% –Final 25%