Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Final Review Peixiang Zhao.

1 Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Final Review Peixiang Zhao

2 Final Exam Time: Wednesday 4/27/2016 5:30pm --- 7:30pm – Plan your time well Venue: LOV 301, in-class exam Closed book, closed note, but you can bring a one- page cheat sheet (A4, double side) – Plan your strategy well No calculators or other electronic devices – Laptops, ipads, smart phones, etc. are prohibited Any form of cheating on the examination will result in a zero grade, and will be reported to the university 1

3 Final Exam Bring you FSU ID to attend the final exam 40% of your final score Coverage – All materials taught in the class AND in the textbook, starting from Introduction, to Clustering 2

4 Format One set of true/false questions with brief answers – e.g., k-Means can be used to cluster datasets with any arbitrary shape – Answer: False. Because …… Short-answer questions – e.g, What are the key differences between decision tree based classification and kNN classification? Several more questions – e.g., Compute frequent itemsets and strong association rules 100 points I believe you have enough time (120 minutes) 3

5 Final Exam How to do well in the exam? – Review the materials carefully and make sure you understand them Both in slides and in the textbook – Reexamine the homework and make sure you can work out the solutions independently – Discuss with your peer students – Discuss with the TA and me Monday: 2pm-4pm – Relax 4

7 What is Data Mining Non-trivial extraction of implicit, previously unknown, and potentially useful information from data – a.k.a. KDD (knowledge discovery in databases) Typical procedure – Data  Knowledge  Action/Decision  Goal Representative Examples – Frequent pattern & association rule mining – Classification – Clustering – Outlier detection 6

8 Data Mining Tasks Prediction Methods: Use some variables to predict unknown or future values of other variables – Classification – Regression – Outlier detection Description Methods: Find human-interpretable patterns that describe the data – Clustering – Association rule mining 7

9 Data Types of attributes – Nominal, ordinal, interval, ratio – Discrete, continuous Basic statistics – Mean, median, mode – Quantiles: Q1, Q3; IQR – Variance; standard deviation Visualization tools – Boxplot – Histogram – Q-Q plot – Scatter plot 8

10 Similarity Proximity measure for binary attributes – Contingency table; symmetric, asymmetric measures; Jaccard coefficient Minkowski distance – Metric – Manhattan, Euclidean, supremum distance – Cosine similarity 9

11 Data Preprocessing Data quality Major tasks in data preprocessing – Cleaning, integration, reduction, transformation, discretization Clean Noisy data – Binning, regression, clustering, human inspection Handling redundancy in data integration – Correlation analysis Χ 2 (chi-square) test Covariance analysis 10

12 Data Preprocessing Data reduction – Dimensionality reduction Curse of dimensionality PCA vs. SVD Feature selection – Numerosity reduction Regression Histogram, clustering, sampling – Data compression Data transformation – Normalization – Discretization 11

13 Frequent Pattern Mining Definition – Frequent itemsets Closed itemsets Maximal itemsets – Association rules Support, confidence Complexity – The overall search space formulated as a lattice Methods – Apriori – FPGrowth – Eclat 12

14 Apriori The downward closure property – Or anti-monotone property of support Apriori algorithm – Candidate generation Self-join – Frequency counting Hash tree Further improvement 13

15 FP-Growth Major philosophy – grow long patterns from short ones using local frequent items only FP-tree – Augmented prefix tree – Properties Completeness and non- redundancy FP-growth algorithm – Progressive subspace projection – Early termination condition 14

16 ECLAT Vertical representation of transactional DB – Tid-lists Algorithm – DFS-like 15

17 Association Rules The number of association rules can be exponentially large! Algorithm Pattern evaluation – Is confidence always an interesting measure for association analysis? 16

18 Classification Problem definition – Training & Test Classification models – Decision tree: Gini index, information gain, error rate – Naïve Bayes – KNN – SVM Ensemble Methods – Bagging – Boosting Model Evaluation 17

19 Clustering Definition Types of clustering Methods – K-means – Hierarchical clustering – DBSCAN – Graph based clustering Impossibility for clustering Cluster validity Semi-supervised clustering 18

