Download presentation
Presentation is loading. Please wait.
1
CSE572: Data Mining by H. Liu
Huan Liu, CSE, CEAS, ASU Agenda Self introduction of all What’s the course Questions/Answers Organization 9/10/2019 CSE572: Data Mining by H. Liu
2
CSE572: Data Mining by H. Liu
Contents of basic and advanced topics Classification, Clustering, Association, and Applications Format – An interactive course with ample opportunities to work, create and share Paper reading, discussion, project, presentation, or any learning activities you can suggest Assessment Class participation, assignments, quizzes, a course project, presentations, 1 or 2 exams 9/10/2019 CSE572: Data Mining by H. Liu
3
CSE572: Data Mining by H. Liu
You: our future successful data miner, a likely a zillionaire TA: Alan Zheng Zhao, Me: Huan Liu, Where: Brickyard 566 When: see on the course website, or by appointment “No pain, no gain”, or “As you sow, so you shall reap”, we will also learn the principle of “No Free Lunch”. MyASU will be used, so make sure your address is correct & won’t miss important announcement 9/10/2019 CSE572: Data Mining by H. Liu
4
CSE572: Data Mining by H. Liu
Course Format What is the effective teaching of graduate data mining ? Your feedback is keenly sought. Current research papers - the main categories to be found on the course web site. You can choose one of the textbooks listed. It is an entering point for you to access related subjects. The truth is It is a fast changing field. Everyone is expected to read research papers and participate in class discussion. Paper presentations. Project presentations. Presentations will also be evaluated in class. 9/10/2019 CSE572: Data Mining by H. Liu
5
Point distribution (tentative)
Projects (35%) Reading/presentation assignment (10%) Exam(s) (40%) Assignments (15%), and class participation, quizzes (up to 10% extra credit) Late penalty, YES, increased exponentially. Academic integrity ( 9/10/2019 CSE572: Data Mining by H. Liu
6
Research paper reading
We will provide a reading list and you can also choose your favorite All are expected to search for and read the selected papers. What is it about (e.g., key idea, basic algorithm)? What are points to discuss and improve? What can we do with it? What to submit? (see more on the class website) A brief report that describes the above and 2 questions suitable for quizzes/tests with solutions A set of presentation slides for 20 minutes Due date: TBA, use digital drop box Grading criteria include (1) quality of additional papers you select, (2) slides for presentation, (3) the report, and (4) oral presentation will be selected among the best submissions and presenters will be given extra credit based on presentation Presentation can start as early as in September, if possible. 9/10/2019 CSE572: Data Mining by H. Liu
7
CSE572: Data Mining by H. Liu
Project Proposal Proposal presentation, discussion, revision A project worth the effort of a semester’s work Progress report Final report Class presentation and/or demo One key goal of this course is to take advantage of your intelligence and (limited) experience to expand your knowledge in creating something useful and interesting 9/10/2019 CSE572: Data Mining by H. Liu
8
Topic Distribution (tentative)
9/10/2019 CSE572: Data Mining by H. Liu
9
Categories of interests (including design and implementation)
Data and application security Data mining and privacy Data reduction and selection Streaming data reduction Dealing with large data (column- & row-wise) Search bias, overfitting Learning algorithms Ensemble methods Semi-supervised learning Active learning and co-training Bioinformatics or others A discussion board will be created, if needed 9/10/2019 CSE572: Data Mining by H. Liu
10
Your first assignment – to think
Think about what you want to accomplish. List 2 of your areas of interests (don’t be restricted by the previous list, and this is one of some rare opportunities that allow you to day-dream and earn grade points). Pick an area of interests and choose a general topic for paper presentation. Submission via MyASU or hardcopy Complete the above and submit it in the 3rd class (next Tuesday Sept 5th). 9/10/2019 CSE572: Data Mining by H. Liu
11
2nd Assignment due on Sept 13
First, state your category of interest Second, form presentation groups (3-4 a group) Third, each student picks a paper from the given category and find 1 high-quality relevant paper Submit it through myASU The first student in each category will present the given paper (s/he does not need to look for another paper) TA will organize and help you and compile a list of all papers at the end Write a summary for your selected paper including What is it about Why is it significant and relevant Where is it published and when 9/10/2019 CSE572: Data Mining by H. Liu
12
CSE572: Data Mining by H. Liu
Introduction The need for data mining Data mining Text mining Image mining Web mining (log, link, content) Bioinformatics Many products and abundant applications Where do we stand 9/10/2019 CSE572: Data Mining by H. Liu
13
CSE572: Data Mining by H. Liu
What is data mining Data mining is extraction of useful patterns from data sources, e.g., databases, texts, web, image. the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. 9/10/2019 CSE572: Data Mining by H. Liu
14
CSE572: Data Mining by H. Liu
Patterns (1) Patterns are the relationships and summaries derived through a data mining exercise. Patterns must be: valid novel potentially useful understandable 9/10/2019 CSE572: Data Mining by H. Liu
15
CSE572: Data Mining by H. Liu
Patterns (2) Patterns are used for prediction or classification describing the existing data segmenting the data (e.g., the market) profiling the data (e.g., your customers) Detection (e.g., intrusion, fault, anomaly) 9/10/2019 CSE572: Data Mining by H. Liu
16
CSE572: Data Mining by H. Liu
Data mining typically deals with data that have already been collected for some purpose other than data mining. Data miners usually have no influence on data collection strategies. Large bodies of data cause new problems: representation, storage, retrieval, analysis, ... 9/10/2019 CSE572: Data Mining by H. Liu
17
CSE572: Data Mining by H. Liu
Even with a very large data set, we are usually faced with just a sample from the population. Data exist in many types (continuous, nominal) and forms (credit card usage records, supermarket transactions, government statistics, text, images, medical records, human genome databases, molecular databases). 9/10/2019 CSE572: Data Mining by H. Liu
18
CSE572: Data Mining by H. Liu
Typical DM tasks Classification: mining patterns that can classify future data into known classes. Association rule mining: mining any rule of the form X Y, where X and Y are sets of data items. Clustering: identifying a set of similar groups in the data 9/10/2019 CSE572: Data Mining by H. Liu
19
CSE572: Data Mining by H. Liu
Sequential pattern mining: A sequential rule: A B, says that event A will be immediately followed by event B with a certain confidence Deviation/anomaly/exception detection: discovering the most significant changes in data Data visualization (or visual analytics): using graphical methods to show patterns in data. High performance computing Bioinformatics 9/10/2019 CSE572: Data Mining by H. Liu
20
CSE572: Data Mining by H. Liu
Why data mining Rapid computerization of businesses produces huge amounts of data How to make best use of data? A growing realization: knowledge discovered from data can be used for competitive advantage and to increase business intelligence. There are problems that might not be suitable for data mining – Top 10 Statistics Problems for CapitalOne (Bill Khan’s invited talk at SIGKDD’06) 9/10/2019 CSE572: Data Mining by H. Liu
21
CSE572: Data Mining by H. Liu
Make use/sense of your data assets Many interesting things you want to find cannot be found using database queries “find me people likely to buy my products” “Who are likely to respond to my promotion” Fast identify underlying relationships and respond to emerging opportunities 9/10/2019 CSE572: Data Mining by H. Liu
22
Why now and for the near future
The data is abundant. The data is being collected or warehoused. The computing power is affordable. The competitive pressure is increasing. Data mining tools have become available. New challenges New data types evolve New applications emerge 9/10/2019 CSE572: Data Mining by H. Liu
23
CSE572: Data Mining by H. Liu
DM fields Data mining is an emerging multi-disciplinary field: Statistics Machine learning Databases Visualization OLAP and data warehousing High-performance computing ... 9/10/2019 CSE572: Data Mining by H. Liu
24
CSE572: Data Mining by H. Liu
Summary What is data mining? KDD - knowledge discovery in databases: non-trivial extraction of implicit, previously unknown and potentially useful information Why do we need data mining? Wide use of computer systems - data explosion - knowledge is power – but we’re data rich, knowledge poor – useful, understandable and actionable knowledge ... Data mining is not a plug-and-play, so we are not done yet and need to continue this class … 9/10/2019 CSE572: Data Mining by H. Liu
25
An Overview of KDD Process (Guess which is which)
Various databases Integration Datawarehouse Preprocessing Data mining Post processing Knolwedge 9/10/2019 CSE572: Data Mining by H. Liu
26
Web mining – an application
The Web is a massive database Semi-structured data XML and RDF Web mining Content Structure Usage Link analysis 9/10/2019 CSE572: Data Mining by H. Liu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.