Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls.

Slides:



Advertisements
Similar presentations
Probabilistic analog of clustering: mixture models
Advertisements

Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
An Overview of Machine Learning
Model assessment and cross-validation - overview
Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.
Introduction to Data Mining with XLMiner
COMP 328: Final Review Spring 2010 Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Introduction to Predictive Learning
Statistical Methods Chichang Jou Tamkang University.
SVM QP & Midterm Review Rob Hall 10/14/ This Recitation Review of Lagrange multipliers (basic undergrad calculus) Getting to the dual for a QP.
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Additive Models and Trees
Data mining and statistical learning - lecture 13 Separating hyperplane.
Classification Based in part on Chapter 10 of Hand, Manilla, & Smyth and Chapter 7 of Han and Kamber David Madigan.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
Classification and Prediction: Basic Concepts Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Ensemble Learning (2), Tree and Forest
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Overview DM for Business Intelligence.
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
This week: overview on pattern recognition (related to machine learning)
Data mining and machine learning A brief introduction.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Generalizing Linear Discriminant Analysis. Linear Discriminant Analysis Objective -Project a feature space (a dataset n-dimensional samples) onto a smaller.
XLMiner – a Data Mining Toolkit QuantLink Solutions Pvt. Ltd.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
1 Data Mining dr Iwona Schab Decision Trees. 2 Method of classification Recursive procedure which (progressively) divides sets of n units into groups.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression Regression Trees.
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
Data Mining and Decision Support
Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015.
ECE 471/571 - Lecture 19 Review 11/12/15. A Roadmap 2 Pattern Classification Statistical ApproachNon-Statistical Approach SupervisedUnsupervised Basic.
Classification and Regression Trees
CIS 335 CIS 335 Data Mining Classification Part I.
Supervised learning in high-throughput data  General considerations  Dimension reduction with outcome variables  Classification models.
Evaluating Classifiers. Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website)
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning with Spark MLlib
Evaluating Classifiers
XLMiner – a Data Mining Toolkit
Data Mining CAS 2004 Ratemaking Seminar Philadelphia, Pa.
Boosting and Additive Trees (2)
Ch8: Nonparametric Methods
CH 5: Multivariate Methods
SEG 4630 E-Commerce Data Mining — Final Review —
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
What is Regression Analysis?
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Feature Selection Methods
Machine Learning – a Probabilistic Perspective
Hairong Qi, Gonzalez Family Professor
What is Artificial Intelligence?
Presentation transcript:

Midterm Review

1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls –With lots of data you can find anything Data privacy and security –Good and bad examples

2- EDA and Visualization Good visualization is good analysis Examples of vis –1-d, 2-d, multivariate –Histograms, boxplots, scatterplots, density estimates, etc –Overplotting with many points –Conditional plots (small multiples) –Good, bad examples

3- Data mining concepts Preparing data for analysis –How to deal with missing data? –What are good transformations? –How to deal with outliers Data reduction –Reducing n: sampling, subsetting –Reducing p: Principal components: finding projections that preserve variance –Scree plot shows how much variance is accounted for in the PC MDS: –Needs a distance matrix –Mimimizes ‘stress function’ –mostly used for visualization and EDA In-vs-out of sample evaluation –In-sample: must penalize for complexity –Out-of-sample: use cross-validation to evaluate predictive performance

3- Data mining concepts Complexity/Performance tradeoff Evaluating Classification models –Accuracy (how many did I get right): not the best choice –Precision/recall or Sensitivity/specificity tradeoff –Selecting different thresholds for ROC curve.

4-Regression Linear regression –What is it, what are the assumptions, how do you check them –Model selection Exhaustive or Greedy (forward/backward selection) search Extensions of Linear regression –Non-linear in parameters, linear in form –Generalized Linear Models Logisitic regression Poisson regression –Shrinkage Ridge regression Lasso regression Profile plots show the trace of parameter estimates –Principal component regression –Nonparametric models Smoothing splines

5-Classification Categorical or binary response – ‘supervised’ learning LDA: fit a parametric model to each class Classification (decision) trees –Binary splits on any predictor X –Best split found algorithmically by gini or entropy to maximize purity –Best size can be found via cross validation –Can be unstable K-Nearest Neighbors –Tradeoff of large/small k Probabilistic models –Bayes error rate: best possible error if model is correct –Naïve Bayes Independence assumption on p(x i |c)

6-Clustering No response variable – ‘unsupervised’ learning Needs distance measures –Euclidean, cosine, jaccard, edit, ordinal and categorical K-means –Select initial solution –Classify points, than re-calculate means Hierarchical clustering –Solutions for all k from 1 to n –Dendrogram effective visualization –Different distance functions (links) will result in different clusterings Probabilistic –Mixture models fit using EM algorithm –Model based clustering