CSE 4705 Artificial Intelligence

Slides:



Advertisements
Similar presentations
Component Analysis (Review)
Advertisements

Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Dimension reduction (1)
An Overview of Machine Learning
What is Statistical Modeling
Probability Review 1 CS479/679 Pattern Recognition Dr. George Bebis.
Principal Component Analysis
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Chapter 5 Data mining : A Closer Look.
Crash Course on Machine Learning
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
: Appendix A: Mathematical Foundations 1 Montri Karnjanadecha ac.th/~montri Principles of.
Computer Science and Engineering Dept.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Presented By Wanchen Lu 2/25/2013
Machine Learning Queens College Lecture 3: Probability and Statistics.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
ECSE 6610 Pattern Recognition Professor Qiang Ji Spring, 2011.
Principles of Pattern Recognition
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 13 Oct 14, 2005 Nanjing University of Science & Technology.
1 Machine Learning in BioMedical Informatics SCE 5095: Special Topics Course Instructor: Jinbo Bi Computer Science and Engineering Dept.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Classification Techniques: Bayesian Classification
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Machine Learning CUNY Graduate Center Lecture 2: Math Primer.
Review of statistical modeling and probability theory Alan Moses ML4bio.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Principal Component Analysis (PCA)
PREDICT 422: Practical Machine Learning
Semi-Supervised Clustering
Object Orie’d Data Analysis, Last Time
Background on Classification
Introduction to Data Mining
School of Computer Science & Engineering
School of Computer Science & Engineering
LECTURE 10: DISCRIMINANT ANALYSIS
CSE 4705 Artificial Intelligence
CH 5: Multivariate Methods
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Prepared by: Mahmoud Rafeek Al-Farra
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Matrix Algebra and Random Vectors
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
Multivariate Methods Berlin Chen
Feature Selection Methods
Multivariate Methods Berlin Chen, 2005 References:
Machine Learning – a Probabilistic Perspective
Hairong Qi, Gonzalez Family Professor
What is Artificial Intelligence?
Presentation transcript:

CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering http://www.engr.uconn.edu/~jinbo

Machine learning (1) Supervised learning algorithms

Topics in machine learning Supervised learning such as classification and regression Unsupervised learning such as cluster analysis, outlier/novelty detection Dimension reduction Semi-supervised learning Active learning Online learning

Common techniques Supervised learning Regularized least squares Least-absolute-shrinkage-and-selection operator Neural networks Logistic regression Decision trees Fisher’s discriminant analysis Support vector machines Graphical models

Common techniques Unsupervised learning K-means Gaussian mixture models Hierarchical clustering Graph-based clustering (e.g., Spectral clustering)

Common techniques Dimension reduction Principal component analysis Independent component analysis Canonical correlation analysis Feature selection Sparse modeling

Machine learning / Data mining Data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information http://www.kdd.org/kdd2016/ ACM SIGKDD conference The ultimate goal of machine learning is the creation and understanding of machine intelligence http://icml.cc/2016/ ICML conference Heavily related to statistical learning theory Artificial intelligence is the intelligence exhibited by machines or software. It is to study how to create computers and computer software that are capable of intelligent behavior. http://www.aaai.org/Conferences/AAAI/aaai16.php AAAI conference

Supervised learning: definition Given a collection of examples (training set ) Each example contains a set of attributes (independent variables), one of the attributes is the target (dependent variables). Find a model to predict the target as a function of the values of other attributes. Goal: previously unseen examples should be predicted as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Supervised learning: definition Given a collection of examples (training set ) Each example contains a set of attributes (independent variables), one of the attributes is the target (dependent variables). Find a model to predict the target as a function of the values of other attributes. Goal: previously unseen examples should be predicted as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Supervised learning: classification When the dependent variable is categorical, a classification problem

Classification: example Face recognition Goal: Predict the identity of a face image Approach: Align all images to derive the features Model the class (identity) based on these features

Supervised learning: regression When the dependent variable is continuous, a regression problem

Regression: example Risk prediction for patients Goal: Predict the likelihood if a patient will suffer major complication after a surgery procedure Approach: Use patients vital signs before and after surgical operation. Heart Rate, Respiratory Rate, etc. Monitor patients by expert medical professionals to rate the likelihood of a patient having complication Learn a model as patient vital signs to map to the risk ratings. Use this model to detect potential high-risk patients for a particular surgical procedure

Unsupervised learning: clustering Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another. Data points in separate clusters are less similar to one another. Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures

Clustering: example High Risky Patient Detection Goal: Predict if a patient will suffer major complication after a surgery procedure Approach: Use patients vital signs before and after surgical operation. Heart Rate, Respiratory Rate, etc. Find patients whose symptoms are dissimilar from most of other patients.

Practice Judge what kind of the problem it is in the following scenarios A student collected a couple of online documents about movies, and try to identify which movie the documents discuss In a cognitive test, a person is asked if he could recognize the “red” color from a screen. The person needs to press a button if he thinks he sees red, or otherwise not. Then an EEG recording is made during the test. A researcher wants to use the EEG recordings to predict whether the red color is recognized. A researcher observed and recorded whether conditions (temperature, wind speed, snow etc.) from the past month, then he wants to use the data to predict the temperature in the next day.

Practice Judge what kind of the problem it is in the following scenarios A student collected a couple of online documents about movies, and try to identify which movie the documents discuss In a cognitive test, a person is asked if he could recognize the “red” color from a screen. The person needs to press a button if he thinks he sees red, or otherwise not. Then an EEG recording is made during the test. A researcher wants to use the EEG recordings to predict whether the red color is recognized. A researcher observed and recorded whether conditions (temperature, wind speed, snow etc.) from the past month, then he wants to use the data to predict the temperature in the next day.

Review of probability and linear algebra

Basics of probability An experiment (random variable) is a well-defined process with observable outcomes. The set or collection of all outcomes of an experiment is called the sample space, S. An event E is any subset of outcomes from S. Probability of an event, P(E) is P(E) = number of outcomes in E / number of outcomes in S.

Probability theory

Probability theory Joint Probability Marginal Probability Conditional Probability Joint Probability

Probability theory Sum Rule Product Rule The marginal prob of X equals the sum of the joint prob of x and y with respect to y Product Rule The joint prob of X and Y equals the product of the conditional prob of Y given X and the prob of X

Illustration Y=1 Y=2 p(X) p(Y) p(X|Y=1) p(X,Y)

The rules of probability Sum Rule Product Rule Bayes’ Rule = p(X|Y)p(Y) posterior  likelihood × prior

Application of probability rules Assume P(Y=r) = 40%, P(Y=b) = 60% P(X=a|Y=r) = 2/8 = 25% P(X=o|Y=r) = 6/8 = 75% P(X=a|Y=b) = 3/4 = 75% P(X=o|Y=b) = 1/4 = 25% p(X=a) = p(X=a,Y=r) + p(X=a,Y=b) = p(X=a|Y=r)p(Y=r) + p(X=a|Y=b)p(Y=b) P(X=o) = 9/20 =0.25*0.4 + 0.75*0.6 = 11/20 p(Y=r|X=o) = p(Y=r,X=o)/p(X=o) = p(X=o|Y=r)p(Y=r)/p(X=o) = 0.75*0.4 / (9/20) = 2/3

Application of probability rules Assume P(Y=r) = 40%, P(Y=b) = 60% P(X=a|Y=r) = 2/8 = 25% P(X=o|Y=r) = 6/8 = 75% P(X=a|Y=b) = 3/4 = 75% P(X=o|Y=b) = 1/4 = 25% p(X=a) = p(X=a,Y=r) + p(X=a,Y=b) = p(X=a|Y=r)p(Y=r) + p(X=a|Y=b)p(Y=b) P(X=o) = 9/20 =0.25*0.4 + 0.75*0.6 = 11/20 p(Y=r|X=o) = p(Y=r,X=o)/p(X=o) = p(X=o|Y=r)p(Y=r)/p(X=o) = 0.75*0.4 / (9/20) = 2/3

Mean and variance The mean of a random variable X is the average value X takes. The variance of X is a measure of how dispersed the values that X takes are. The standard deviation is simply the square root of the variance.

Simple example X= {1, 2} with P(X=1) = 0.8 and P(X=2) = 0.2 Mean 0.8 X 1 + 0.2 X 2 = 1.2 Variance 0.8 X (1 – 1.2) X (1 – 1.2) + 0.2 X (2 – 1.2) X (2-1.2)

Gaussian distribution

Gaussian distribution

Multivariate Gaussian x y

Basics of linear algebra

Matrix multiplication The product of two matrices Special case: vector-vector product, matrix-vector product C A B

Matrix multiplication

Rules of matrix multiplication B

Vector norms

Matrix norms and trace

A bit more on matrix

Orthogonal matrix 1 .

Square matrix – eigenvalue, eigenvector where

Symmetric matrix eigen-decomposition of A

Singular value decomposition orthogonal orthogonal diagonal

Supervised learning – practical issues Underfitting Overfitting Before introducing these important concept, let us study a simple regression algorithm – linear regression

Questions?