Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Midterm Review Peixiang Zhao.

Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Midterm Review Peixiang Zhao

Midterm Exam Time: Wednesday 3/2/2016 5:15pm --- 6:30pm – Plan your time well Venue: LOV 301, in-class exam Closed book, closed note, but you can bring a one- page cheat sheet (A4, double side) – Plan your strategy well No calculators or other electronic devices – Laptops, ipads, smart phones, etc. are prohibited Any form of cheating on the examination will result in a zero grade, and will be reported to the university 1

Midterm Exam 15% of your final score Format 1.True/False questions w. explanations 2.Short-answer questions: testing for basic concepts Make your answers clear and succinct Example 1: What is the difference between Apriori and FP- Growth? Example 2: Compute the Manhattan distance between data points Coverage – From “Introduction” to “Frequent Pattern Mining” 2

Midterm Exam How to do well in the midterm exam? – Review the materials carefully and make sure you understand them Both in slides and in the textbook – Reexamine the homework and make sure you can work out the solutions independently – Discuss with your peer students – Discuss with the TA and me – Relax 3

What is Data Mining Non-trivial extraction of implicit, previously unknown, and potentially useful information from data – a.k.a. KDD (knowledge discovery in databases) Typical procedure – Data  Knowledge  Action/Decision  Goal Representative Examples – Frequent pattern & association rule mining – Classification – Clustering – Outlier detection 4

Data Mining Tasks Prediction Methods: Use some variables to predict unknown or future values of other variables – Classification – Regression – Outlier detection Description Methods: Find human-interpretable patterns that describe the data – Clustering – Association rule mining 5

Data Types of attributes – Nominal, ordinal, interval, ratio – Discrete, continuous Basic statistics – Mean, median, mode – Quantiles: Q1, Q3; IQR – Variance; standard deviation Visualization tools – Boxplot – Histogram – Q-Q plot – Scatter plot 6

Similarity Proximity measure for binary attributes – Contingency table; symmetric, asymmetric measures; Jaccard coefficient Minkowski distance – Metric – Manhattan, Euclidean, supremum distance – Cosine similarity 7

Data Preprocessing Data quality Major tasks in data preprocessing – Cleaning, integration, reduction, transformation, discretization Clean Noisy data – Binning, regression, clustering, human inspection Handling redundancy in data integration – Correlation analysis Χ 2 (chi-square) test Covariance analysis 8

Data Preprocessing Data reduction – Dimensionality reduction Curse of dimensionality PCA vs. SVD Feature selection – Numerosity reduction Regression Histogram, clustering, sampling – Data compression 9

Principal Component Analysis (PCA) Motivation and objective – The direction with the largest projected variance is called the first principal component – The orthogonal direction that captures the second largest projected variance is called the second principal component – and so on… General procedure – Preprocessing – Compute the covariance matrix – Derive eigenvectors for projection Relationship between PCA and SVD 10

Numerosity Reduction Parametric method – Regression Non-parametric method – Histogram Equal-width Equal-frequency – Sampling Simple, sampling w/o replacement, stratified sampling 11

Data Transformation Normalization – Min-max – Z-score – Decimal scaling Discretization – Binning Equal-width Equal-depth 12

Frequent Pattern Mining Definition – Frequent itemsets Closed itemsets Maximal itemsets – Association rules Support, confidence Complexity – The overall search space formulated as a lattice 13

Apriori The downward closure property – Or anti-monotone property of support Apriori algorithm – Candidate generation Self-join – Frequency counting Hash tree Further improvement 14

FP-Growth Major philosophy – grow long patterns from short ones using local frequent items only FP-tree – Augmented prefix tree – Properties Completeness and non- redundancy FP-growth algorithm – Progressive subspace projection – Early termination condition 15

ECLAT Vertical representation of transactional DB – Tid-lists Algorithm – DFS-like 16

Association Rules The number of association rules can be exponentially large! Algorithm Pattern evaluation – Is confidence always an interesting measure for association analysis? 17

Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Midterm Review Peixiang Zhao.

Similar presentations

Presentation on theme: "Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Midterm Review Peixiang Zhao."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Midterm Review Peixiang Zhao.

Similar presentations

Presentation on theme: "Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Midterm Review Peixiang Zhao."— Presentation transcript:

Similar presentations

About project

Feedback