Download presentation
Presentation is loading. Please wait.
Published byAllen Alan Neal Modified over 8 years ago
1
Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Final Review Peixiang Zhao
2
Final Exam Time: Wednesday 4/27/2016 5:30pm --- 7:30pm – Plan your time well Venue: LOV 301, in-class exam Closed book, closed note, but you can bring a one- page cheat sheet (A4, double side) – Plan your strategy well No calculators or other electronic devices – Laptops, ipads, smart phones, etc. are prohibited Any form of cheating on the examination will result in a zero grade, and will be reported to the university 1
3
Final Exam Bring you FSU ID to attend the final exam 40% of your final score Coverage – All materials taught in the class AND in the textbook, starting from Introduction, to Clustering 2
4
Format One set of true/false questions with brief answers – e.g., k-Means can be used to cluster datasets with any arbitrary shape – Answer: False. Because …… Short-answer questions – e.g, What are the key differences between decision tree based classification and kNN classification? Several more questions – e.g., Compute frequent itemsets and strong association rules 100 points I believe you have enough time (120 minutes) 3
5
Final Exam How to do well in the exam? – Review the materials carefully and make sure you understand them Both in slides and in the textbook – Reexamine the homework and make sure you can work out the solutions independently – Discuss with your peer students – Discuss with the TA and me Monday: 2pm-4pm – Relax 4
6
Final Exam 5
7
What is Data Mining Non-trivial extraction of implicit, previously unknown, and potentially useful information from data – a.k.a. KDD (knowledge discovery in databases) Typical procedure – Data Knowledge Action/Decision Goal Representative Examples – Frequent pattern & association rule mining – Classification – Clustering – Outlier detection 6
8
Data Mining Tasks Prediction Methods: Use some variables to predict unknown or future values of other variables – Classification – Regression – Outlier detection Description Methods: Find human-interpretable patterns that describe the data – Clustering – Association rule mining 7
9
Data Types of attributes – Nominal, ordinal, interval, ratio – Discrete, continuous Basic statistics – Mean, median, mode – Quantiles: Q1, Q3; IQR – Variance; standard deviation Visualization tools – Boxplot – Histogram – Q-Q plot – Scatter plot 8
10
Similarity Proximity measure for binary attributes – Contingency table; symmetric, asymmetric measures; Jaccard coefficient Minkowski distance – Metric – Manhattan, Euclidean, supremum distance – Cosine similarity 9
11
Data Preprocessing Data quality Major tasks in data preprocessing – Cleaning, integration, reduction, transformation, discretization Clean Noisy data – Binning, regression, clustering, human inspection Handling redundancy in data integration – Correlation analysis Χ 2 (chi-square) test Covariance analysis 10
12
Data Preprocessing Data reduction – Dimensionality reduction Curse of dimensionality PCA vs. SVD Feature selection – Numerosity reduction Regression Histogram, clustering, sampling – Data compression Data transformation – Normalization – Discretization 11
13
Frequent Pattern Mining Definition – Frequent itemsets Closed itemsets Maximal itemsets – Association rules Support, confidence Complexity – The overall search space formulated as a lattice Methods – Apriori – FPGrowth – Eclat 12
14
Apriori The downward closure property – Or anti-monotone property of support Apriori algorithm – Candidate generation Self-join – Frequency counting Hash tree Further improvement 13
15
FP-Growth Major philosophy – grow long patterns from short ones using local frequent items only FP-tree – Augmented prefix tree – Properties Completeness and non- redundancy FP-growth algorithm – Progressive subspace projection – Early termination condition 14
16
ECLAT Vertical representation of transactional DB – Tid-lists Algorithm – DFS-like 15
17
Association Rules The number of association rules can be exponentially large! Algorithm Pattern evaluation – Is confidence always an interesting measure for association analysis? 16
18
Classification Problem definition – Training & Test Classification models – Decision tree: Gini index, information gain, error rate – Naïve Bayes – KNN – SVM Ensemble Methods – Bagging – Boosting Model Evaluation 17
19
Clustering Definition Types of clustering Methods – K-means – Hierarchical clustering – DBSCAN – Graph based clustering Impossibility for clustering Cluster validity Semi-supervised clustering 18
20
19
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.