Linear methods: Regression & Discrimination Sec 4.6.

Slides:



Advertisements
Similar presentations
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Advertisements

K-means method for Signal Compression: Vector Quantization
Linear Separators.
K nearest neighbor and Rocchio algorithm
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Intro to Linear Methods Reading: DH&S, Ch 5.{1-4,8} hip to be hyperplanar...
More Methodology; Nearest-Neighbor Classifiers Sec 4.7.
Intro to Linear Methods Reading: Bishop, 3.0, 3.1, 4.0, 4.1 hip to be hyperplanar...
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Spatial and Temporal Data Mining
Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl.
Linear Separators. Bankruptcy example R is the ratio of earnings to expenses L is the number of late payments on credit cards over the past year. We will.
Ensemble Learning: An Introduction
Margins, support vectors, and linear programming, oh my! Reading: Bishop, 4.0, 4.1, 7.0, 7.1 Burges tutorial (on class resources page)
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Linear Methods, cont’d; SVMs intro. Straw poll Which would you rather do first? Unsupervised learning Clustering Structure of data Scientific discovery.
SVMs Reprised Reading: Bishop, Sec 4.1.1, 6.0, 6.1, 7.0, 7.1.
Steep learning curves Reading: DH&S, Ch 4.6, 4.5.
Visual Recognition Tutorial
Linear and generalised linear models
NN Cont’d. Administrivia No news today... Homework not back yet Working on it... Solution set out today, though.
Nearest-Neighbor Classifiers Sec minutes of math... Definition: a metric function is a function that obeys the following properties: Identity:
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Monte Carlo Methods in Partial Differential Equations.
SIMS-201 Audio Digitization. 2  Overview Chapter 12 Digital Audio Digitization of Audio Samples Quantization Reconstruction Quantization error.
Please open your laptops, log in to the MyMathLab course web site, and open Quiz 3.3/4 IMPORTANT NOTE: If you have time left you finish this quiz, use.
Collaborative Filtering Matrix Factorization Approach
Please CLOSE YOUR LAPTOPS, and turn off and put away your cell phones, and get out your note-taking materials. Today’s daily quiz will be given at the.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Module 04: Algorithms Topic 07: Instance-Based Learning
Please open Daily Quiz 34. A scientific calculator may be used on this quiz. You can keep your yellow formula sheets out when you take the quiz. Remember.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Linear Discrimination Reading: Chapter 2 of textbook.
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
IT-101 Section 001 Lecture #9 Introduction to Information Technology.
Chapter 4: Feature representation and compression
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Dimensionality reduction
MAT 2401 Linear Algebra 2.5 Applications of Matrix Operations
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Ch. 4: Feature representation
Support Vector Machines
Data Transformation: Normalization
Linear Algebra Review.
Large Margin classifiers
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
CH 5: Multivariate Methods
Classification with Perceptrons Reading:
Ch. 4: Feature representation
Support Vector Machines
K Nearest Neighbor Classification
Collaborative Filtering Matrix Factorization Approach
Ying shen Sse, tongji university Sep. 2016
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Instance Based Learning
Fundamentals of Data Representation
Contact: Machine Learning – (Linear) Regression Wilson Mckerrow (Fenyo lab postdoc) Contact:
Mathematical Foundations of BME
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Linear methods: Regression & Discrimination Sec 4.6

Corrections & Hints The formula for the entropy of a split (Lec 4, Sep 2) is incorrect. Should be: Where is fraction of complete data that ends up in branch i

Corrections & Hints Best way to do a random shuffle of a data set? Given: Want something like:

Corrections & Hints Usual way is an index shuffle Generate an index vector and a random vector: Then sort on the random vector: Finally, use the “sorted” index row as index into X

Corrections & Hints Concavity: The definition given in the homework is buggy The final criterion, is wrong (to see this, consider A more usual definition is: then g() is concave iff: )

Corrections & Hints Missing values Both hepatitis and hypothyroid have many missing attribute (feature) values You have 3 choices Drop all points (instances) w/ missing values Report on effects on your data/results Handle missing attr as in Sec 4.7 or 4.1 of text Report (if you can) on effects Find new similar data sets with no missing values Many available in UCI data or other data sets from Weka group

Reminders Office hours, Wed 9:00-11:00 AM, FEC345B Homework due at start of next class (Thurs) Late policy (from front material): 1 day late => 50% off I will relax for this assignment. Policy for HW1: 15% per day late (incl weekend days) You may “checkpoint” by handing in what you have on the due date -- penalty points applied only to “delta” E.g. Turn in 60% of proj on Thurs, 40% on Sat before 4:00 (electronic) Total score is 60%+0.7*40%=88% Read cheating policy -- don’t do it!!!

Reminder: 1-NN alg Nearest neighbor: find the nearest instance to the query point in feature space, return the class of that instance Simplest possible distance-based classifier With more notation: Distance function is anything appropriate to your data

Reminder: k -NN Slight generalization: k -Nearest neighbors ( k - NN) Find k training instances closest to query point Vote among them for label Q: How does this affect system? Q: Why does it work?

Exercise Show that k-NN does something reasonable: Assume binary data Let X be query point, X’ be any k -neighbor of X Let p=Pr[Class(X’)==Class(X)] ( p>1/2 ) What is Pr[X receives correct label] ? What happens as k grows? But there are tradeoffs... Let V(k,N)=volume of sphere enclosing k neighbors of X, assuming N points in data set For fixed N, what happens to V(k,N) as k grows? For fixed k, what happens to V(k,N) as N grows? What about radius of V(k,N) ?

The volume question Let V(k,N)=volume of sphere enclosing k neighbors of X, assuming N points in data set Assume uniform point distribution Total volume of data is ( 1, w.l.o.g.) So, on average,

1-NN in practice Most common use of 1-nearest neighbor in daily life?

1-NN & speech proc. 1-NN closely related to technique of Vector Quantization (VQ) Used to compress speech data for cell phones CD quality sound: 16 bit/sample 44.1 kHz ⇒ 88.2 kB/sec ⇒ ~705 kbps Telephone (land line) quality: ~10 bit/sample 10 kHz ⇒ ~12.5 kB/sec ⇒ 100 kpbs Cell phones run at ~9600 bps...

Speech compression via VQ Speech source Raw audio signal

Speech compression via VQ Raw audio “Framed” audio

Speech compression via VQ Framed audio Cepstral (~ smoothed frequency) representation

Speech compression via VQ Cepstrum Downsampled cepstrum

Speech compression via VQ D.S. cepstrum Vector representation Vector quantize (1-NN) Transmitted exemplar (cell centroid)

Compression ratio Original: 10 bits 10 kHz; 250 samples/“frame” (25ms/frame) ⇒ 100 kbps; 2500 bits/frame VQ compressed: 40 frames/sec 1 VC centroid/frame ~1M centroids ⇒ ~20 bits/centroid ⇒ ~800 bits/sec!

Signal reconstruction Transmitted cell centroid Look up cepstral coefficients Reconstruct cepstrum Convert to audio

Not lossless, though!

Linear Methods

Linear methods Both methods we’ve seen so far: Are classification methods Use intersections of linear boundaries What about regression? What can we do with a single linear surface? Not as much as you’d like but still surprisingly a lot Linear regression is the proto-learning method

Linear regression prelims Basic idea: assume y is a linear function of X : Our job: find best to fit y “as well as possible” By “as well as possible”, we mean here, minimum squared error:

Useful definitions Definition: A trick is a clever mathematical hack Definition: A method is a trick you use more than once

A helpful “method” Recall Want to be able to easily write Introduce “pseudo-feature” of X, Now have: And: So: And our “loss function” becomes:

Minimizing loss Finally, can write: Want the “best” set of w : the weights that minimize the above Q: how do you find the minimum of a function w.r.t. some parameter?

Minimizing loss Back up to the 1-d case Suppose you had the function: And wanted to find w that minimizes l() Std answer: take derivative, set equal to 0, and solve: To be sure of a min, check 2nd derivative too...

5 minutes of math... Some useful linear algebra identities: If A and B are matrices, (for invertible square matrices)

5 minutes of math... What about derivatives of vectors/matrices? There’s more than one kind... For the moment, we’ll need the derivative of a vector function with respect to a vector If x is a vector of variables, y is a vector of constants, and A is a matrix of constants, then:

Exercise Derive the vector derivative expressions: Find an expression for the minimum squared error weight vector, w, in the loss function: