Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Midterm Review Peixiang Zhao.

Similar presentations


Presentation on theme: "Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Midterm Review Peixiang Zhao."— Presentation transcript:

1 Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Midterm Review Peixiang Zhao

2 Midterm Exam Time: Wednesday 3/2/2016 5:15pm --- 6:30pm – Plan your time well Venue: LOV 301, in-class exam Closed book, closed note, but you can bring a one- page cheat sheet (A4, double side) – Plan your strategy well No calculators or other electronic devices – Laptops, ipads, smart phones, etc. are prohibited Any form of cheating on the examination will result in a zero grade, and will be reported to the university 1

3 Midterm Exam 15% of your final score Format 1.True/False questions w. explanations 2.Short-answer questions: testing for basic concepts Make your answers clear and succinct Example 1: What is the difference between Apriori and FP- Growth? Example 2: Compute the Manhattan distance between data points Coverage – From “Introduction” to “Frequent Pattern Mining” 2

4 Midterm Exam How to do well in the midterm exam? – Review the materials carefully and make sure you understand them Both in slides and in the textbook – Reexamine the homework and make sure you can work out the solutions independently – Discuss with your peer students – Discuss with the TA and me – Relax 3

5 What is Data Mining Non-trivial extraction of implicit, previously unknown, and potentially useful information from data – a.k.a. KDD (knowledge discovery in databases) Typical procedure – Data  Knowledge  Action/Decision  Goal Representative Examples – Frequent pattern & association rule mining – Classification – Clustering – Outlier detection 4

6 Data Mining Tasks Prediction Methods: Use some variables to predict unknown or future values of other variables – Classification – Regression – Outlier detection Description Methods: Find human-interpretable patterns that describe the data – Clustering – Association rule mining 5

7 Data Types of attributes – Nominal, ordinal, interval, ratio – Discrete, continuous Basic statistics – Mean, median, mode – Quantiles: Q1, Q3; IQR – Variance; standard deviation Visualization tools – Boxplot – Histogram – Q-Q plot – Scatter plot 6

8 Similarity Proximity measure for binary attributes – Contingency table; symmetric, asymmetric measures; Jaccard coefficient Minkowski distance – Metric – Manhattan, Euclidean, supremum distance – Cosine similarity 7

9 Data Preprocessing Data quality Major tasks in data preprocessing – Cleaning, integration, reduction, transformation, discretization Clean Noisy data – Binning, regression, clustering, human inspection Handling redundancy in data integration – Correlation analysis Χ 2 (chi-square) test Covariance analysis 8

10 Data Preprocessing Data reduction – Dimensionality reduction Curse of dimensionality PCA vs. SVD Feature selection – Numerosity reduction Regression Histogram, clustering, sampling – Data compression 9

11 Principal Component Analysis (PCA) Motivation and objective – The direction with the largest projected variance is called the first principal component – The orthogonal direction that captures the second largest projected variance is called the second principal component – and so on… General procedure – Preprocessing – Compute the covariance matrix – Derive eigenvectors for projection Relationship between PCA and SVD 10

12 Numerosity Reduction Parametric method – Regression Non-parametric method – Histogram Equal-width Equal-frequency – Sampling Simple, sampling w/o replacement, stratified sampling 11

13 Data Transformation Normalization – Min-max – Z-score – Decimal scaling Discretization – Binning Equal-width Equal-depth 12

14 Frequent Pattern Mining Definition – Frequent itemsets Closed itemsets Maximal itemsets – Association rules Support, confidence Complexity – The overall search space formulated as a lattice 13

15 Apriori The downward closure property – Or anti-monotone property of support Apriori algorithm – Candidate generation Self-join – Frequency counting Hash tree Further improvement 14

16 FP-Growth Major philosophy – grow long patterns from short ones using local frequent items only FP-tree – Augmented prefix tree – Properties Completeness and non- redundancy FP-growth algorithm – Progressive subspace projection – Early termination condition 15

17 ECLAT Vertical representation of transactional DB – Tid-lists Algorithm – DFS-like 16

18 Association Rules The number of association rules can be exponentially large! Algorithm Pattern evaluation – Is confidence always an interesting measure for association analysis? 17

19 18


Download ppt "Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Midterm Review Peixiang Zhao."

Similar presentations


Ads by Google