Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 540 Database Management Systems

Similar presentations


Presentation on theme: "CS 540 Database Management Systems"— Presentation transcript:

1 CS 540 Database Management Systems
Course overview

2 Welcome to CS540! Instructor: Arash Termehchy
Assistant Professor at EECS Research on data management and analytics Information & Data Management and Analytics (IDEA) Lab

3 The Era of Big Data Both opportunities and challenges.
Technological shifts, e.g., mobile devices, have created a staggering number of enormous data sets. Both opportunities and challenges.

4 Opportunities: unreasonable effectiveness of data
A. Halevy, et al. The unreasonable effectiveness of data, IEEE Intelligence Systems, 2009. Observation from working with large datasets in Google. More data generally outperforms complex statistical models in the data-centric prediction and discovery. Conclusion: Usually, no need for overly complex statistical models.

5 Opportunities are priceless! The story of John Snow
“In the mid-1850s, Dr. John Snow plotted cholera deaths on a map, and in the corner of a particularly hard-hit buildings was a water pump. A 19th-century version of Big Data, which suggested an association between cholera and the water pump.” Integrating data sets has saved millions of lives!

6 Paradigm shifting influence on scientific discovery
“The Fourth Paradigm: Data-Intensive Scientific Discovery”, Jim Gray Empirical Theoretical Computational Data-centric Sloan Sky Server database is a top cited resource in the field of astronomy. Astronomical observation => database query Spread of diseases by analyzing Google query log Personalized medicine, drug discovery, …

7 Challenges: data volume
Sloan Sky Server will soon store 30 terabyte per day. Hardon Colider can generate 500 exabyte per day. 90% of world data generated in the last two years (2013) Every two year : ten times more data

8 Challenges: data variety/ diversity
Database systems used to deal with a single static database. Need to transform and or integrate large number of evolving data sets. Impossible to do manually. “A data integration expert is never without a job”

9 Challenges: usability
“….(in the next few years) we project a need for 1.5 million additional analysts in the United States who can analyze data effectively…“, -- McKinsey Big Data Study, 2012 Current systems are not built for scientists and normal users. “It may take a PhD in computer science to successfully deploy a data analytics algorithm!”

10 The notion of database management system (DBMS)
Data processing used to be mostly ad-hoc programming. W. McGee, Generalization: Key to Successful Electronic Data Processing, Journal of ACM, 1959. Generalization, aka abstraction/ data modeling File: A sequence of records. Operation: sort, select part of the file, … Makes data management and processing usable. People can learn and use the abstraction instead of developing new data processing programs. How to build models that provide nice generalizations How to implement the efficiently

11 Abstraction is the key How to develop usable abstractions for our data? Data models, query languages, Relational data model, graph data model, … How to implement these abstractions efficiently? Database systems internal Storage management, indexing, ….

12 What this course is not about
We do not discuss the basic concepts ER model, relational model, relational algebra, SQL, database design, database programming You should know them already If you are not, drop the course and take CS 340 We do not discuss how to tune or implement an application using MySQL, Oracle, …

13 Topics How to develop usable abstractions for our data?
data independence principle relational data model graph data model How to implement these abstractions efficiently? storage management and indexing query processing algorithms query optimization concurrency control and recovery parallel and distributed data processing data transformation & integration

14 This is a research-oriented course
Learn & discuss the concepts and algorithms Read and summarize classic and new research papers. Discuss them in our lectures. Develop data systems Apply the lessons learned to interesting data problems.

15 Learning the fundamentals: paper review
Read and summarize the papers before the lecture: What is the main problem discussed in the paper? Why is it important? What are the main ideas of the proposed solutions? What are the final results of the paper? References on the course website on “how to read scientific papers”. Post a 300-words summary of the paper on Piazza by 12:00 pm of the day of the class. Private posts One paper per lecture marked by * in the course website You can skip two reviews. 10% of the total grade.

16 Learning the fundamentals: Lectures
Review and discuss the papers. Slides will be available on the website after the class. Provide the road map for studying Attendance is not required but encouraged. Participate and ask questions!

17 Apply your understanding: Assignments
Six assignments: Announced on Piazza, posted on the course website. Both written and programming. Submit using TEACH Write using word processors and submit in pdf. Start early! 25% of the overall grade

18 Learning the fundamentals: Exam
Midterm exam in class. Closed books and notes Tests your knowledge of the papers and subjects discussed in the class. 30% of the total grade. No final exam (instead you work on your projects).

19 Apply your understanding: project
System/ research project on data management / analytics System: build a rather complex system using available methods Solve a real-problem over large data sets More challenging than a well-defined implementation Identify and solve design choices and tradeoff. Research: build a system with some novel ideas Identify an interesting problem and read the state-of-the-art papers on the problem. Propose and implement some new ideas to solve the problem. Groups of 1 – 3 students. Larger groups are not allowed. 35% of the total grade.

20 Projects themes You should pick a project on following themes.
Data interaction An interactive query interface an interface that learns from previous interactions. An interactive and usable query interface easier to use than SQL. effective and efficient keyword, visual, or natural language interface. Data cleaning & transformation Most datasets are not clean: missing values, … Most data sets are not in relational format: Online posts, spreadsheets, ... A system that cleans or restructures data and load it in a relational database to get interesting insights. Reduce the manual cleaning/ restructuring burden

21 Projects themes Data integration Predictive modeling
A system that integrates multiple datasets into one relational database. Reduce the amount of manual work in integration. Sample: how to integrate online posts from different websites. Predictive modeling A system that learns predictive models over large relational databases. automatic feature extraction, relational learning, deep learning A system that efficiently performs probabilistic inference over relational databases.

22 Projects themes A lecture to go over sample projects next week.
Your project may combine multiple themes A system that learns a predictive model over multiple relational databases. Combines data integration and predictive modeling themes. A lecture to go over sample projects next week.

23 Project millstones Project proposal:
Group members What do you want to solve? Relevant references Which tools, data sets, systems you will use? Midterm presentation: 7 minutes (3 + 4 Q&A) Detailed description of the problem Your approach to solve it. Review of the related work Your progress, challenges, and your plan to solve them. Presentation in class: 15 minutes ( Q&A) Final report: Problem & solution Detailed comparison with the related work Analysis of empirical studies Conclusion.

24 Project Discuss your ideas and progress with the course staff during the term. Graded based on technical depth and presentation. Check out the course website for more information. Start early!

25 Communication Communicate with the course staff
TA: Jose Picado, Parisa Ataie Piazza preferred method of communication Office hours Arash: Thursday 4:30 – 5:30 pm Jose: Tuesday/ Thursday 10 – 11 am Parisa: Wednesday 9 – 10 am the staff for other types of questions Use [cs540] tag in the subject line. Communicate with your peers on course materials and lectures. Check the Piazza and course website for announcements, course policies and schedule.

26 What is next? A classic paper on the relational model by its inventor.
Physical and logical data independence and how they have evolved.


Download ppt "CS 540 Database Management Systems"

Similar presentations


Ads by Google