CS 69995 & CS 79995 ST: Probabilistic Data Management Xiang Lian Department of Computer Science Kent State University Email: xlian@kent.edu Homepage: http://www.cs.kent.edu/~xlian/
Probabilistic Data Management An Overview of Probabilistic Data Management Data Uncertainty Model Probabilistic Query Answering Over Probabilistic and Uncertain Databases Probabilistic Graph Databases Data Quality in Probabilistic Databases
Background Needed Probability & statistics (math) Database techniques (e.g., index) Programming (e.g., C++, Java, or Python etc.) You need to be able to look up how to get things done (for example, read papers/surveys from online resources, using digital library, Google, Wikipedia, etc.)
Skills This course is a seminar course, in which you need to learn how to do research Lecture Literature review (survey) Project report Presentations & demonstrations Research collaborations
Study Group Please form a team with 2-3 members The workload should be distributed evenly to each team member Each team needs to finish 1 survey + 1 project report + 1 presentation + 1 bonus presentation (optional): A survey on a selected research topic A project report (including introduction, problem definition, related work, the proposed approaches, experimental evaluation, and conclusions); A presentation & demonstration on your research paper An optional presentation on 1-2 existing research papers in your selected research directions (20-25 minutes)
Survey & Research Project I will post a reading list of papers It does not include all related works, but only a few typical papers in different research directions You need to search digital libraries (e.g., ACM portal, IEEE Xplore, etc.) and Google the Web to find more related works in each direction You need to decide which topics/problems you want to study Please make an appointment with me to discuss research directions of your teams (within the first 3 weeks of the semester; on or before Sept. 14)
Scoring and Grading 5% - Attendance & Questions 50% - 5 Homeworks (10 points each) 15% - 1 Survey on papers for the selected research topics in recent database conferences/journals 20% - Research Project Report Code and report for the research project in paper format 10% - Presentations & Demonstration Presentation and demonstration for the proposed research project 5% - Bonus Points, rated by other team members 10% - (Optional) Presentation for 1-2 related works in the selected research direction
Scoring and Grading (cont'd) B = 80 - 89 C = 70 - 79 D = 60 - 69 F = <60 The maximum score you can get is: 115!
Use of the Textbook No textbooks!! Reference books Charu C. Aggarwal. Managing and Mining Uncertain Data. Springer Publishing Company, 2009. ISBN: 978-0-387-09689-6 (Print) 978-0-387-09690-2 (Online), https://link.springer.com/book/10.1007%2F978-0-387-09690-2 Lei Chen and Xiang Lian. Query Processing over Uncertain Databases. In Synthesis Lectures on Data Management, Vol. 4, No. 6, pages 1-101, Morgan & Claypool Publishers, 2012. ISBN: 9781608458929, http://www.morganclaypool.com/doi/abs/10.2200/S00465ED1V01Y201212DT M033 Dan Suciu, Dan Olteanu, Christopher Re, and Christoph Koch. Probabilistic Databases. In Synthesis Lectures on Data Management, Morgan & Claypool Publishers, 2011. ISBN-13: 978-1608456802, ISBN-10: 1608456803, http://www.morganclaypool.com/doi/abs/10.2200/S00362ED1V01Y201105DT M016
Online Resources The only resources are papers!! ACM digital library http://dl.acm.org/ IEEE Xplore Digital Library http://ieeexplore.ieee.org/Xplore/home.jsp DBLP http://dblp.uni-trier.de/ Database Conferences SIGMOD, PVLDB, ICDE, EDBT, CIKM Database Journals TODS, VLDBJ, TKDE
The Schedule for the Class I expect to give lectures and introduce the concepts and techniques of probabilistic data management for the first 2 months (September & October) Then, each team will submit a survey on related works in the literature (October) Finally, each team will start to identify research problems and find solutions. You need to write a project report in the paper format, do experiments (comparing with the existing approaches), and present/demonstrate your paper in class (November & December).
Advices & Suggestions Editor Tools: Survey Project Report Latex vs. MS Word Survey Check "Related Work" sections in most recent papers, and you can obtain more related papers Read abstract/introductions of papers, and classify papers into different categories (this will help you later to identify problems that have not been solved before) Project Report Even if you are not familiar with some topics, try to read as many related works as possible to understand the general problems and solutions in these topics (you can skip some part, if it is too hard to understand) Stick to the problem you want to solve, and use any resource you can find to solve the problem (note: DO NOT simply apply previous techniques to your problem, since it is not counted as your contributions!!)
Advices & Suggestions (cont'd) Project Report Introduction Related works Problem definition Solutions Experiments Conclusions In the project, please add a module to visualize your experimental results
Advices & Suggestions (cont'd) Do not copy from any sources (even for the survey) Any form of academic dishonesty will be strictly forbidden and will be punished to the maximum extent Allowing another student to copy one's work will be treated as an act of academic dishonesty, leading to the same penalty as copying
Advices & Suggestions (cont'd) If the resulting surveys and papers are of high quality and novel, I highly recommend you to submit them to database conferences or journals After this class, self-motivated, hardworking, and creative students with good performance on surveys/papers may have the chance to join my lab (Big Data Science Research Lab)!
Examples of Probabilistic Data (1) Witnessed Person location t.p PID1 A 0.9 PID2 B 0.2 PID3 0.1 Person ID Zip code Disease PID1 44224 (pneumonia,0.3), (flu, 0.7) PID2 44242 (AIDS, 0.9)
Examples of Probabilistic Data (2) GPS samples Location data are imprecise
Examples of Probabilistic Data (3) Inaccuracy of the data integration Unreliability of the data sources Data inconsistency … …
Queries k nearest neighbor query Range query Top-k query Skyline query …
Probabilistic Query Processing on Uncertain Data How to efficiently answer probabilistic queries over large-scale uncertain data? How to retrieve accurate query answers with confidence guarantees?