CSCI5570 Large Scale Data Processing Systems Course Overview Instructor: Prof. James Cheng
Course Webpage Check course webpage regularly http://www.cse.cuhk.edu.hk/~jcheng/5570.html Remark: I prefer to put the course webpage under my own directory to make it easier for off-campus access.
Topics Overview Topic Tentative Schedule Introduction and Course Project Week 1 Prerequisite: Relational Database Systems & Distributed Database Systems Weeks 2-4 Self Reading Distributed Data Analytics Systems Weeks 2-5 NoSQL Weeks 5-8 NewSQL Weeks 9-10 Distributed Graph Processing Systems Weeks 11-12 Distributed Stream Processing Systems Weeks 12-13 Other Large Scale Data Processing Systems ??? Distributed Machine Learning Systems Course Project
Prerequisites Fundamental concepts of distributed database systems, prerequisite to NoSQL and NewSQL, as well as other distributed data processing systems Parallel query processing Distributed query processing
Distributed Data Analytics Systems Focus on state-of-the-art big data platforms, widely adopted by industry (e.g., Hadoop, Spark) or best in research (e.g., Naiad, Husky) Fundamental concepts of big data analytics systems Applications (too ad hoc to teach them all, but you can try them out with the systems taught in class): Data collecting, data extraction, data cleaning … Machine learning (e.g., classification, clustering, recommendation, feature selection, dimensionality reduction …) OLAP, data cube Data mining Graph analytics (including social network analysis) Similarity search (e.g., scalable locality sensitive hashing)
NoSQL/NewSQL Relational databases are the foundation of western civilization, but now is the era of NoSQL databases NoSQL databases, such as MongoDB, Cassandra, CouchDB, etc., are rapidly taking large shares of the market from traditional vendors such as Oracle Must learn for big data analytics NewSQL databases try to combine the pros of both traditional DBMS and NoSQL
Distributed Graph Processing Systems Graph data: web graphs, online social networks, mobile communication networks, financial networks, biological networks, neutral networks … Distributed systems that make the analysis of these large scale graphs/networks possible Key techniques and algorithms for large scale graph data processing
Distributed Stream Processing Systems Streaming data become common today, e.g., tweets, news feeds, … How to analyze such massive high-speed data in real time? Key techniques and applications
Distributed Data Storage Systems How to store massive volumes of different types of data, retrieve them, and update them efficiently? How to handle consistency issues? How to handle availability issues?
Reading List A list of papers for each topic (except for the older topics such as Relational Database Systems and Distributed Database Systems) will be released weekly
Reference Database Systems – The Complete Book Second edition (Prentice Hall) Hector Garcia-Molina, Jeffrey Ullman Jenifer Widom
Reference Database Management Systems Third edition Raghu Ramakrishnan, Johannes Gehrke
Assessment Criteria Survey paper: 30 marks Select one of the following topics: (1) Distributed Data Analytics Systems, (2) NoSQL, (3) NewSQL, (4) Distributed Graph Processing Systems, (5) Distributed Stream Processing Systems, or (6) any other related topic (please seek the approval of the course instructor first) Write a survey paper for this topic The survey paper much contain most of the seminal works and the state-of-the-art works related to this topic, including a clear introduction to each of these works, a description of the problems they solved and their main ideas, a comparative analysis highlighting the strengths and limitations of these works, and your own conclusions and comments on this topic and its future development, etc. Deadline: Nov 30, 2017 HK time (submit a pdf file with filename “Lastname Firstname” to jcheng@cse.cuhk.edu.hk with email title “5570 survey Lastname Firstname”)
Assessment Criteria Course Project: 70 marks See details in the project specification
Assessment Criteria You will receive an F grade for the course if your score for the survey paper is less than 10 marks, OR your score for the course project is less than 30 marks You will receive at least a B- if your score for the survey paper is at least 20 marks, AND your score for the course project is at least 40 marks
Academic Honesty Plagiarism, cheating, misconduct in test/exam should be reported to the Faculty Disciplinary Committee for handling. University Guidelines to Academic Honesty: http://www.cuhk.edu.hk/policy/academichonesty/
Student/Faculty Expectations Let’s join hands to create a positive, respectful, and engaged academic environment inside and outside classroom. Full version of Student/Faculty Expectations on Teaching and Learning: http://www.erg.cuhk.edu.hk/upload/StaffStudentE xpectations.pdf