Download presentation
Presentation is loading. Please wait.
Published byAdelia Small Modified over 8 years ago
1
Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Midterm Review Peixiang Zhao
2
Midterm Exam Time: Wednesday 3/2/2016 5:15pm --- 6:30pm – Plan your time well Venue: LOV 301, in-class exam Closed book, closed note, but you can bring a one- page cheat sheet (A4, double side) – Plan your strategy well No calculators or other electronic devices – Laptops, ipads, smart phones, etc. are prohibited Any form of cheating on the examination will result in a zero grade, and will be reported to the university 1
3
Midterm Exam 15% of your final score Format 1.True/False questions w. explanations 2.Short-answer questions: testing for basic concepts Make your answers clear and succinct Example 1: What is the difference between Apriori and FP- Growth? Example 2: Compute the Manhattan distance between data points Coverage – From “Introduction” to “Frequent Pattern Mining” 2
4
Midterm Exam How to do well in the midterm exam? – Review the materials carefully and make sure you understand them Both in slides and in the textbook – Reexamine the homework and make sure you can work out the solutions independently – Discuss with your peer students – Discuss with the TA and me – Relax 3
5
What is Data Mining Non-trivial extraction of implicit, previously unknown, and potentially useful information from data – a.k.a. KDD (knowledge discovery in databases) Typical procedure – Data Knowledge Action/Decision Goal Representative Examples – Frequent pattern & association rule mining – Classification – Clustering – Outlier detection 4
6
Data Mining Tasks Prediction Methods: Use some variables to predict unknown or future values of other variables – Classification – Regression – Outlier detection Description Methods: Find human-interpretable patterns that describe the data – Clustering – Association rule mining 5
7
Data Types of attributes – Nominal, ordinal, interval, ratio – Discrete, continuous Basic statistics – Mean, median, mode – Quantiles: Q1, Q3; IQR – Variance; standard deviation Visualization tools – Boxplot – Histogram – Q-Q plot – Scatter plot 6
8
Similarity Proximity measure for binary attributes – Contingency table; symmetric, asymmetric measures; Jaccard coefficient Minkowski distance – Metric – Manhattan, Euclidean, supremum distance – Cosine similarity 7
9
Data Preprocessing Data quality Major tasks in data preprocessing – Cleaning, integration, reduction, transformation, discretization Clean Noisy data – Binning, regression, clustering, human inspection Handling redundancy in data integration – Correlation analysis Χ 2 (chi-square) test Covariance analysis 8
10
Data Preprocessing Data reduction – Dimensionality reduction Curse of dimensionality PCA vs. SVD Feature selection – Numerosity reduction Regression Histogram, clustering, sampling – Data compression 9
11
Principal Component Analysis (PCA) Motivation and objective – The direction with the largest projected variance is called the first principal component – The orthogonal direction that captures the second largest projected variance is called the second principal component – and so on… General procedure – Preprocessing – Compute the covariance matrix – Derive eigenvectors for projection Relationship between PCA and SVD 10
12
Numerosity Reduction Parametric method – Regression Non-parametric method – Histogram Equal-width Equal-frequency – Sampling Simple, sampling w/o replacement, stratified sampling 11
13
Data Transformation Normalization – Min-max – Z-score – Decimal scaling Discretization – Binning Equal-width Equal-depth 12
14
Frequent Pattern Mining Definition – Frequent itemsets Closed itemsets Maximal itemsets – Association rules Support, confidence Complexity – The overall search space formulated as a lattice 13
15
Apriori The downward closure property – Or anti-monotone property of support Apriori algorithm – Candidate generation Self-join – Frequency counting Hash tree Further improvement 14
16
FP-Growth Major philosophy – grow long patterns from short ones using local frequent items only FP-tree – Augmented prefix tree – Properties Completeness and non- redundancy FP-growth algorithm – Progressive subspace projection – Early termination condition 15
17
ECLAT Vertical representation of transactional DB – Tid-lists Algorithm – DFS-like 16
18
Association Rules The number of association rules can be exponentially large! Algorithm Pattern evaluation – Is confidence always an interesting measure for association analysis? 17
19
18
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.