Download presentation
Presentation is loading. Please wait.
Published byNelson Gibbs Modified over 9 years ago
1
Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University boyuan@sjtu.edu.cn
2
Review of Lecture One Overview of AI – Knowledge-based rules in logics (expert system, automata, …) : Symbolism in logics – Kernel-based heuristics (neural network, SVM, …) : Connection for nonlinearity – Learning and inference (Bayesian, Markovian, …) : To sparsely sample for convergence – Interactive and stochastic computing (Uncertainty, heterogeneity) : To overcome the limit of Turin Machine Course Content – Focus mainly on learning and inference – Discuss current problems and research efforts – Perception and behavior (vision, robotic, NLP, bionics …) not included Exam – Papers (Nature, Science, Nature Review, Modern Review of Physics, PNAS, TICS) – Course materials
3
Today’s Content Overview of machine learning Linear regression – Gradient decent – Least square fit – Stochastic gradient decent – The normal equation Applications
5
Basic Terminologies x =Input variables/features y =Output variables/target variables (x, y) = Training examples, the i th training example = (x (i), y (i) ) m (j) =Number of training examples (1, …, m) n (i) =Number of input variables/features (0, …,n) h(x) =Hypothesis/function/model that outputs the predicative value under a given input x = Parameter/weight, which parameterizes the mapping of X to its predictive value, thus We define x 0 = 1 (the intercept), thus able to use a matrix representation:
8
Gradient Decent The Cost Function is defined as: Using the matrix to represent the training samples with respect to The gradient decent is based on the partial derivatives with respect to The algorithm is therefore: Loop { } (for every j) There is another alternative to iterate, called stochastic gradient decent:
9
Normal Equation An explicit way to directly obtain
10
The Optimization Problem by the Normal Equation We set the derivatives to zero, and obtain the Normal Equations:
11
Today’s Content Linear Regression – Locally Weighted Regression (an adaptive method) Probabilistic Interpretation – Maxima Likelihood Estimation vs. Least Square (Gaussian Distribution) Classification by Logistic Regression – LMS updating – A Perceptron-based Learning Algorithm
12
Linear Regression 1.Number of Features 2.Over-fitting and under-fitting Issue 3.Feature selection problem (to be covered later) 4.Adaptive issue Some definitions: Parametric Learning (fixed set of with n being constant) Non-parametric Learning (number of grows with m linearly) Locally-Weighted Regression (Loess/Lowess Regression) non-parametric A bell-shape weighting (not a Gaussian) Every time you need to use the entire training data set to train for a given input to predict its output (computational complexity)
13
Extension of Linear Regression Linear Additive (straight-line): x 1 =1, x 2 =x Polynomial: x 1 =1, x 2 =x, …, x n =x n-1 Chebyshev Orthogonal Polynomial: x 1 =1, x 2 =x, …, x n =2x(x n-1 -x n-2 ) Fourier Trigonometric Polynomial: x1=0.5, followed by sin and cos of different frequencies of x n Pairwise Interaction:linear terms + x k1,k2 (k =1, …, N) … The central problem underlying these representations are whether or not the optimization processes for are convex.
14
Probabilistic Interpretation Why Ordinary Least Square (OLE)? Why not other power terms? – Assume – PDF for Gaussian is – This implies that – Or, ~ = Random Noises, ~ Why Gaussian for random variables? Central limit theorem?
15
Consider training data are stochastic Assume are i.i.d. (independently identically distributed) – Likelihood of L( ) = the probability of y given x parameterized by What is Maximum Likelihood Estimation (MLE)? – Chose parameters to maximize the function, so to make the training data set as probable as possible; – Likelihood L( ) of the parameters, probability of the data. Maximum Likelihood (updated)
16
The Equivalence of MLE and OLE = J( ) !?
17
Sigmoid (Logistic) Function Other functions that smoothly increase from 0 to 1 can also be found, but for a couple of good reasons (we will see next time for the Generalize Linear Methods) that the choice of the logistic function is a natural one.
18
Recall (Note the positive sign rather than negative) Let’s working with just one training example (x, y), and to derive the Gradient Ascent rule:
19
One Useful Property of the Logistic Function
20
Identical to Least Square Again?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.