CSCI5570 Large Scale Data Processing Systems

Slides:



Advertisements
Similar presentations
Introduction to Advanced Computing Platforms for Data Analysis Ruoming Jin.
Advertisements

Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
1 ICS 223: Transaction Processing and Distributed Data Management Winter 2008 Professor Sharad Mehrotra Information and Computer Science University of.
CS346: Advanced Databases Graham Cormode Term 2.
CS 405G: Introduction to Database Systems 24 NoSQL Reuse some slides of Jennifer Widom Chen Qian University of Kentucky.
CSCI 347 – Data Mining Lecture 01 – Course Overview.
1 CS222: Principles of Database Management Fall 2010 Professor Chen Li Department of Computer Science University of California, Irvine Notes 01.
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, 2.
CSc 2310 Principles of Programming (Java) Dr. Xiaolin Hu.
Object Oriented Programming (OOP) Design Lecture 1 : Course Overview Bong-Soo Sohn Assistant Professor School of Computer Science and Engineering Chung-Ang.
CpSc 462/662: Database Management Systems (DBMS) (TEXNH Approach) Introduction James Wang.
NOSQL DATABASES Please remember to read the NOSQL Distilled book and the Seven Databases book.
Database Design and Implementation ITCS6160 & ITCS 8160 Instructor: Jianping Fan Webpage:
INFS614, Dr. Brodsky, GMU1 Database Management Systems INFS 614 Instructor: Professor Alex Brodsky
Overviews of ITCS 6161/8161: Advanced Topics on Database Systems Dr. Jianping Fan Department of Computer Science UNC-Charlotte
COMP-421: Database Systems
COSC 6340 Databases Jehan-François Pâris
Object Oriented Programming (OOP) Design Lecture 1 : Course Overview Bong-Soo Sohn Associate Professor School of Computer Science and Engineering Chung-Ang.
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
Most of contents are provided by the website Introduction TJTSD66: Advanced Topics in Social Media Dr.
CSC 411/511: DBMS Design CSC411_L0_OutlineDr. Nan Wang 1 Course Outline.
ITIS 4510/5510 Web Mining Spring Overview Class hour 5:00 – 6:15pm, Tuesday & Thursday, Woodward Hall 135 Office hour 3:00 – 5:00pm, Tuesday, Woodward.
The Savvy Cyber Teacher ® Using the Internet Effectively in the K-12 Classroom Copyright  2003 Stevens Institute of Technology, CIESE, All Rights Reserved.
CS 440 Database Management Systems Lecture1: Course overview.
1 Advanced Database System Design Instructor: Ruoming Jin Fall 2010.
CSE3330/5330 DATABASE SYSTEMS AND FILE STRUCTURES (DB I) CSE3330/5330 DB I, Summer2012 Department of Computer Science and Engineering, University of Texas.
ITIS 5160 Applied Databases Fall Overview Class hour 6:30 – 9:15pm, Wedn, Woodward Hall 125 Office hour 3:00 – 5:00pm, Wedn Instructor - Dr. Xintao.
ITIS 5160 Applied Databases Fall Overview Class hour 9:30am – 12:15pm, Friday, Woodward 120 Office hour 1:30 – 2:30pm, Wednesday Instructor - Dr.
Mining of Massive Datasets Edited based on Leskovec’s from
Big Data Yuan Xue CS 292 Special topics on.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Introduction to CSCI 242 Compiled by S. Zhang 1. Syllabus Syllabus has the most updated information! –Use the information on the syllabus for the grading.
CS3431-B111 CS3431 – Database Systems I Logistics Instructor: Mohamed Eltabakh
Book web site:
Data Analytics (CS40003) Introduction to Data Lecture #1
CSE202 Database Management Systems
CSCI5570 Large Scale Data Processing Systems
CS 405G: Introduction to Database Systems
Information Modeling and Database System
Introduction to Operating Systems
Course Overview - Database Systems
CS 440 Database Management Systems
CSc 1302 Principles of Computer Science II
Big Data A Quick Review on Analytical Tools
CS122B: Projects in Databases and Web Applications Winter 2017
Introduction to Information Security
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Instructor: Elke Rundensteiner
Syed Sohail Ahmed Assistant Professor, UET Taxila
Welcome to the a Department of Engineering Education !
NOSQL.
Database Design and Implementation
David Ostrovsky | Couchbase
Computer Science 102 Data Structures CSCI-UA
Assessment Criteria Course Project: 50%
Big Data Analytics in Parallel Systems
CENG2400 Embedded System Design Chapter 0: Introduction
Course Overview - Database Systems
CS & CS Capstone Project & Software Development Project
MIS2502: Data Analytics The Information Architecture of an Organization Aaron Zhi Cheng Acknowledgement:
Proposal for Term Project Operating Systems, Fall 2018
Logistics Instructor: Mohamed Eltabakh
Data Mining Chapter 6 Search Engines
Parallel Analytic Systems
IS 651: Distributed Systems
CMSC 341 Course Introduction July 2011 UMBC CMSC 341 Intro.
Big DATA.
Data Management and Information Processing
Welcome! Knowledge Discovery and Data Mining
CSCE 4143 Section 001: Data Mining Spring 2019.
Presentation transcript:

CSCI5570 Large Scale Data Processing Systems Course Overview Instructor: Prof. James Cheng

Course Webpage Check course webpage regularly http://www.cse.cuhk.edu.hk/~jcheng/5570.html Remark: I prefer to put the course webpage under my own directory to make it easier for off-campus access.

Topics Overview Topic Tentative Schedule Introduction and Course Project Week 1 Prerequisite: Relational Database Systems & Distributed Database Systems Weeks 2-4 Self Reading Distributed Data Analytics Systems Weeks 2-5 NoSQL Weeks 5-8 NewSQL Weeks 9-10 Distributed Graph Processing Systems Weeks 11-12 Distributed Stream Processing Systems Weeks 12-13 Other Large Scale Data Processing Systems ??? Distributed Machine Learning Systems Course Project

Prerequisites Fundamental concepts of distributed database systems, prerequisite to NoSQL and NewSQL, as well as other distributed data processing systems Parallel query processing Distributed query processing

Distributed Data Analytics Systems Focus on state-of-the-art big data platforms, widely adopted by industry (e.g., Hadoop, Spark) or best in research (e.g., Naiad, Husky) Fundamental concepts of big data analytics systems Applications (too ad hoc to teach them all, but you can try them out with the systems taught in class): Data collecting, data extraction, data cleaning … Machine learning (e.g., classification, clustering, recommendation, feature selection, dimensionality reduction …) OLAP, data cube Data mining Graph analytics (including social network analysis) Similarity search (e.g., scalable locality sensitive hashing)

NoSQL/NewSQL Relational databases are the foundation of western civilization, but now is the era of NoSQL databases NoSQL databases, such as MongoDB, Cassandra, CouchDB, etc., are rapidly taking large shares of the market from traditional vendors such as Oracle Must learn for big data analytics NewSQL databases try to combine the pros of both traditional DBMS and NoSQL

Distributed Graph Processing Systems Graph data: web graphs, online social networks, mobile communication networks, financial networks, biological networks, neutral networks … Distributed systems that make the analysis of these large scale graphs/networks possible Key techniques and algorithms for large scale graph data processing

Distributed Stream Processing Systems Streaming data become common today, e.g., tweets, news feeds, … How to analyze such massive high-speed data in real time? Key techniques and applications

Distributed Data Storage Systems How to store massive volumes of different types of data, retrieve them, and update them efficiently? How to handle consistency issues? How to handle availability issues?

Reading List A list of papers for each topic (except for the older topics such as Relational Database Systems and Distributed Database Systems) will be released weekly

Reference Database Systems – The Complete Book Second edition (Prentice Hall) Hector Garcia-Molina, Jeffrey Ullman Jenifer Widom

Reference Database Management Systems Third edition Raghu Ramakrishnan, Johannes Gehrke

Assessment Criteria Survey paper: 30 marks Select one of the following topics: (1) Distributed Data Analytics Systems, (2) NoSQL, (3) NewSQL, (4) Distributed Graph Processing Systems, (5) Distributed Stream Processing Systems, or (6) any other related topic (please seek the approval of the course instructor first) Write a survey paper for this topic The survey paper much contain most of the seminal works and the state-of-the-art works related to this topic, including a clear introduction to each of these works, a description of the problems they solved and their main ideas, a comparative analysis highlighting the strengths and limitations of these works, and your own conclusions and comments on this topic and its future development, etc. Deadline: Nov 30, 2017 HK time (submit a pdf file with filename “Lastname Firstname” to jcheng@cse.cuhk.edu.hk with email title “5570 survey Lastname Firstname”)

Assessment Criteria Course Project: 70 marks See details in the project specification

Assessment Criteria You will receive an F grade for the course if your score for the survey paper is less than 10 marks, OR your score for the course project is less than 30 marks You will receive at least a B- if your score for the survey paper is at least 20 marks, AND your score for the course project is at least 40 marks

Academic Honesty Plagiarism, cheating, misconduct in test/exam should be reported to the Faculty Disciplinary Committee for handling. University Guidelines to Academic Honesty: http://www.cuhk.edu.hk/policy/academichonesty/

Student/Faculty Expectations Let’s join hands to create a positive, respectful, and engaged academic environment inside and outside classroom. Full version of Student/Faculty Expectations on Teaching and Learning: http://www.erg.cuhk.edu.hk/upload/StaffStudentE xpectations.pdf