Mike Carey Information Systems Group Computer Science Department UC Irvine.

1 Mike Carey Information Systems Group Computer Science Department UC Irvine

2 ©2003 BEA Systems, Inc. | 2  Carnegie-Mellon University, 1975-80  B.S. and M.S. Student, EE/ECE  UC Berkeley, 1980-83  Ph.D. Student, CS  University of Wisconsin, 1983-95  Assistant/Associate/Full Professor, CS  IBM, 1995-2000  Industrial Researcher & Software R&D Manager  Propel Software, 2000-01  Startup Company Fellow/CTO/VP of Software  BEA Systems, Inc., 2001-08 (acquired by Oracle)  Industrial Software Architect & Sr. Engineering Director  And now I’m here… Trivia tidbit: Here’s a photo of my first (ever) CS TA

3  Okay, so just what is a database system?  Based on lecture notes from the UW-Madison database curriculum, as immortalized in Database Management Systems (Ramakrishnan & Gehrke, a.k.a. “the Cow book”)  The database field is a vertical slice of all of CS!  You’ll see what I mean (and why)…  What’s exciting in “database systems” today?  UCI Information Systems Group (ISG) and beyond!

4  So what’s a database?  A very large, integrated collection of data  Usually a model of a real-world enterprise or a history of real-world events  Entities (e.g., students, courses, Facebook users, …)  Relationships (e.g., Susan is taking CS 234, Susan is a friend of Lynn, Mike filed a grade change for Lynn, …)  What’s a database management system (DBMS)?  A software system designed to store, manage, and provide access to one or more such databases

5 Files CODASYL/IMS Relational Manual Coding Byte streams Majority of application development effort goes towards building and then maintaining data access logic Relational DB Systems Declarative approach Tables + views bring “data independence” Details left to system Designed to simplify data-centric application development Early DBMS Technologies Records and pointers Large, carefully tuned data access programs that have dependencies on physical access paths, indexes, etc. New Data ??? … … New Data ??? … … New Data ??? … …

6  Data independence  Efficient (and automatic) data access  Reduced application development time  Data integrity and security  Uniform data administration  Concurrent access and recovery from crashes

7  Shift from computation to information  At the “low end”: explosion of the web (a mess!)  At the “high end”: scientific applications  Datasets increasing in diversity and volume  Digital libraries, interactive video, social media, genomic data, big science data, … ... need for DBMS exploding!  DBMS field encompasses most of CS  OS, languages, theory, AI, multimedia, logic, … ?!

8  A data model is a collection of concepts for describing data (to one another or to a DBMS)  A schema is a description of a particular collection of data, using a given data model  The relational model is the most widely used data model today  Relation – basically a table with rows and (named) columns  Schema – describes the tables and their columns

9  Many views of one conceptual (logical) schema and an underlying physical schema  Views describe how different users or groups see the data  Conceptual schema defines the logical structure of the database  Physical schema describes the files and indexes used “under the covers” Physical Schema Conceptual Schema View 1View 2View 3 Bits On-Disk Data Structures Logical Model Lies!

10  Conceptual schema:  Students(sid: string, name: string, login: string, age: integer, gpa: real)  Courses(cid: string, cname: string, credits: integer)  Enrolled(sid: string, cid: string, grade: string)  Physical schema:  Relations each stored as unordered files  Have indexes on first and third columns of Students  External schema (a.k.a. view):  CourseInfo(cid: string, cname: string, enrollment: integer)

11  Applications are insulated from how data is actually structured and stored!  Logical data independence: Protection from changes in the logical structure of data  Physical data independence: Protection from changes in the physical structure of data  One of the most important benefits of using a DBMS!  Allows changes to be made w/o application rewrites

12  User query (in SQL, against the external schema):  SELECT c.cid, c.enrollment FROM CourseInfo c WHERE c.cname = ‘Computer Game Design’  Equivalent query (against the conceptual schema):  SELECT e.cid, count(e.*) FROM Enrolled e, Courses c WHERE e.cid = c.cid AND c.cname = ‘Computer Game Design’ GROUP BY c.cid  Under the hood (against the physical schema)  Access Courses – use index on cname to find associated cid  Access Enrolled – use index on cid to count the enrollments

13  A typical DBMS has a layered architecture  The figure doesn’t show the concurrency control and recovery components  This is one of several possible architectures; each actual system has its own variations Query Optimization and Execution Relational Operators Files and Access Methods Buffer Management Disk Space Management DB Note: These layers must consider concurrency control and recovery Queries

14  “I like programming languages and compilers”  Consider high-level, declarative languages like SQL  “I like low-level operating systems issues”  DBMSs manage records, memory, locks, logs, …  “I really want to work on distributed systems”  Distributed and parallel database systems are ripe with distributed algorithms and systems issues (!)  “Data structure and algorithm design is really cool”  Database indexes are data structures on disk (or flash) (And so on!)

15  The Web is full of database challenges (“Big Data”!)  A box for keywords only goes so far… ▪ How can I query the web, e.g., “Find me 5-string Fender bass guitars for sale in the $1000-1500 price range”  Click streams and social networks generate lots of data ▪ How can I query and analyze all that data (e.g., to act on it)?  Ubiquitous computing is data-rich, too  Build, deploy, and use location-based data services  Query and aggregate streams of sensor or video data  There’s data everywhere, and of all shapes and sizes  How do we integrate it, e.g., for rapid crisis response?  And when we do, how do we ensure privacy/security?

16  Data store for low-latency, high-traffic Web sites  Only have a few hundred milliseconds to generate an entire page  Data heavily cached outside the DBMS today, which is “far from ideal”  Data systems for offline/batch-oriented processing  I mentioned this before: clickstream analysis, graph analysis, etc.  Potentially interested in faster, approximate answers  Would like to do this in real time as well, as data arrives  Hardware trends (always) present new opportunities  Flash storage, for example  Multicore CPUs (nobody uses them very well yet)  Cool open source work at Facebook related to DBs  Hive: Open source SQL on top of Hadoop  Cassandra: Large-scale distributed storage for semistructured data

17 Disk Main Memory Disk CPU(s) ADM Data Main Memory Disk CPU(s) ADM Data ADM Data Hi-Speed Interconnect Data loads & feeds from external sources (XML, JSON, …) AQL queries & scripting requests and programs Data publishing to external sources and apps ASTERIX Goal: To ingest, digest, persist, index, manage, query, analyze, and publish massive quantities of semi-structured information… (ADM = ASTERIX Data Model, AQL = ASTERIX Query Language) Main Memory CPU(s) 17

18  A DBMS is for storing and querying big datasets  Benefits of using a DBMS are many: enables rapid development of new applications (“what, not how”), recovers after crashes, supports (safe) concurrent access, helps maintain data integrity and security, …  Levels of schema abstraction  data independence  DB research is a vertical slice of all of CS (“for data”)  Big Data experts are in high industrial demand! ( )  Data is what it’s all about today! So, consider taking our three classes: CS 122A/B/C.


