Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Midterm Review Peixiang Zhao.

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

Noise & Data Reduction. Paired Sample t Test Data Transformation - Overview From Covariance Matrix to PCA and Dimension Reduction Fourier Analysis - Spectrum.
Association Analysis (Data Engineering). Type of attributes in assoc. analysis Association rule mining assumes the input data consists of binary attributes.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Clustering Basic Concepts and Algorithms
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.

Lecture Notes for Chapter 2 Introduction to Data Mining
Exploratory Data Mining and Data Preparation
6/10/2015Data Mining: Concepts and Techniques1 Chapter 2: Data Preprocessing Why preprocess the data? Descriptive data summarization Data cleaning Data.
Data Mining Association Analysis: Basic Concepts and Algorithms
Chapter 3 Pre-Mining. Content Introduction Proposed New Framework for a Conceptual Data Warehouse Selecting Missing Value Point Estimation Jackknife estimate.
Association Analysis: Basic Concepts and Algorithms.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining Association Analysis: Basic Concepts and Algorithms
Pre-processing for Data Mining CSE5610 Intelligent Software Systems Semester 1.
1 Basic statistics Week 10 Lecture 1. Thursday, May 20, 2004 ISYS3015 Analytic methods for IS professionals School of IT, University of Sydney 2 Meanings.
Data Preprocessing.
Major Tasks in Data Preprocessing(Ref Chap 3) By Prof. Muhammad Amir Alam.
CS2032 DATA WAREHOUSING AND DATA MINING
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
COSC 4335 DM: Preprocessing Techniques
Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls.
Data Mining Techniques
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
The Knowledge Discovery Process; Data Preparation & Preprocessing
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
● Final exam Wednesday, 6/10, 11:30-2:30. ● Bring your own blue books ● Closed book. Calculators and 2-page cheat sheet allowed. No cell phone/computer.
Outline Introduction Descriptive Data Summarization Data Cleaning Missing value Noise data Data Integration Redundancy Data Transformation.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Fall Final Topics by “Notecard”.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
Data Preprocessing Dr. Bernard Chen Ph.D. University of Central Arkansas.
Chapter 2: Getting to Know Your Data
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Summary „Data mining” Vietnam national university in Hanoi, College of technology, Feb.2006.
Tallahassee, Florida, 2015 COP4710 Database Systems Midterm Review Fall 2015.
Math 4030 Final Exam Review. Probability (Continuous) Definition of pdf (axioms, finding k) Cdf and probability (integration) Mean and variance (short-cut.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Data Lecture Notes for Chapter 2 Introduction to Data Mining by Tan, Steinbach,
Data Mining and Decision Support
Motivation Data in the real world is dirty
Data Analytics CMIS Short Course part II Day 1 Part 1: Clustering Sam Buttrey December 2015.
Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Getting to Know Your Data Peixiang Zhao.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Univariate Point Estimation Confidence Interval Estimation Bivariate: Linear Regression Multivariate: Multiple Regression 1 Chapter 4: Statistical Approaches.
Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.
Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Final Review Peixiang Zhao.
Statistics and probability Dr. Khaled Ismael Almghari Phone No:
Pattern Recognition Lecture 20: Data Mining 2 Dr. Richard Spillman Pacific Lutheran University.
Data Transformation: Normalization
Data Mining: Concepts and Techniques
Introduction to Data Mining
Noisy Data Noise: random error or variance in a measured variable.
Descriptive Statistics:
Data Mining: Concepts and Techniques Course Outline
Association Rule Mining
SEG 4630 E-Commerce Data Mining — Final Review —
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
What Is Good Clustering?
Data Transformations targeted at minimizing experimental variance
CS4433 Database Systems Midterm Review.
Data Preprocessing Copyright, 1996 © Dale Carnegie & Associates, Inc.
Group 9 – Data Mining: Data
Fall Final Topics by “Notecard”.
Introductory Statistics
Presentation transcript:

Tallahassee, Florida, 2016 CIS4930 Introduction to Data Mining Midterm Review Peixiang Zhao

Midterm Exam Time: Wednesday 3/2/2016 5:15pm --- 6:30pm – Plan your time well Venue: LOV 301, in-class exam Closed book, closed note, but you can bring a one- page cheat sheet (A4, double side) – Plan your strategy well No calculators or other electronic devices – Laptops, ipads, smart phones, etc. are prohibited Any form of cheating on the examination will result in a zero grade, and will be reported to the university 1

Midterm Exam 15% of your final score Format 1.True/False questions w. explanations 2.Short-answer questions: testing for basic concepts Make your answers clear and succinct Example 1: What is the difference between Apriori and FP- Growth? Example 2: Compute the Manhattan distance between data points Coverage – From “Introduction” to “Frequent Pattern Mining” 2

Midterm Exam How to do well in the midterm exam? – Review the materials carefully and make sure you understand them Both in slides and in the textbook – Reexamine the homework and make sure you can work out the solutions independently – Discuss with your peer students – Discuss with the TA and me – Relax 3

What is Data Mining Non-trivial extraction of implicit, previously unknown, and potentially useful information from data – a.k.a. KDD (knowledge discovery in databases) Typical procedure – Data  Knowledge  Action/Decision  Goal Representative Examples – Frequent pattern & association rule mining – Classification – Clustering – Outlier detection 4

Data Mining Tasks Prediction Methods: Use some variables to predict unknown or future values of other variables – Classification – Regression – Outlier detection Description Methods: Find human-interpretable patterns that describe the data – Clustering – Association rule mining 5

Data Types of attributes – Nominal, ordinal, interval, ratio – Discrete, continuous Basic statistics – Mean, median, mode – Quantiles: Q1, Q3; IQR – Variance; standard deviation Visualization tools – Boxplot – Histogram – Q-Q plot – Scatter plot 6

Similarity Proximity measure for binary attributes – Contingency table; symmetric, asymmetric measures; Jaccard coefficient Minkowski distance – Metric – Manhattan, Euclidean, supremum distance – Cosine similarity 7

Data Preprocessing Data quality Major tasks in data preprocessing – Cleaning, integration, reduction, transformation, discretization Clean Noisy data – Binning, regression, clustering, human inspection Handling redundancy in data integration – Correlation analysis Χ 2 (chi-square) test Covariance analysis 8

Data Preprocessing Data reduction – Dimensionality reduction Curse of dimensionality PCA vs. SVD Feature selection – Numerosity reduction Regression Histogram, clustering, sampling – Data compression 9

Principal Component Analysis (PCA) Motivation and objective – The direction with the largest projected variance is called the first principal component – The orthogonal direction that captures the second largest projected variance is called the second principal component – and so on… General procedure – Preprocessing – Compute the covariance matrix – Derive eigenvectors for projection Relationship between PCA and SVD 10

Numerosity Reduction Parametric method – Regression Non-parametric method – Histogram Equal-width Equal-frequency – Sampling Simple, sampling w/o replacement, stratified sampling 11

Data Transformation Normalization – Min-max – Z-score – Decimal scaling Discretization – Binning Equal-width Equal-depth 12

Frequent Pattern Mining Definition – Frequent itemsets Closed itemsets Maximal itemsets – Association rules Support, confidence Complexity – The overall search space formulated as a lattice 13

Apriori The downward closure property – Or anti-monotone property of support Apriori algorithm – Candidate generation Self-join – Frequency counting Hash tree Further improvement 14

FP-Growth Major philosophy – grow long patterns from short ones using local frequent items only FP-tree – Augmented prefix tree – Properties Completeness and non- redundancy FP-growth algorithm – Progressive subspace projection – Early termination condition 15

ECLAT Vertical representation of transactional DB – Tid-lists Algorithm – DFS-like 16

Association Rules The number of association rules can be exponentially large! Algorithm Pattern evaluation – Is confidence always an interesting measure for association analysis? 17

18