Computational Intelligence: Methods and Applications Lecture 3 Histograms and probabilities. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Slides:



Advertisements
Similar presentations
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Advertisements

2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
OUTLINE Course description, What is pattern recognition, Cost of error, Decision boundaries, The desgin cycle.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Pattern Classification, Chapter 1 1 Basic Probability.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Random Sampling  In the real world, most R.V.’s for practical applications are continuous, and have no generalized formula for f X (x) and F X (x). 
Lecture II-2: Probability Review
Time Series Data Analysis - II
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.
Computational Intelligence: Methods and Applications Lecture 5 EDA and linear transformations. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
STAT02 - Descriptive statistics (cont.) 1 Descriptive statistics (cont.) Lecturer: Smilen Dimitrov Applied statistics for testing and evaluation – MED4.
Computational Intelligence: Methods and Applications Lecture 4 CI: simple visualization. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Lecture 2: Bayesian Decision Theory 1. Diagram and formulation
Classification. An Example (from Pattern Classification by Duda & Hart & Stork – Second Edition, 2001)
Computational Intelligence: Methods and Applications Lecture 19 Pruning of decision trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Chapter 6 Normal Probability Distribution Lecture 1 Sections: 6.1 – 6.2.
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
Computational Intelligence: Methods and Applications Lecture 20 SSV & other trees Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
1 Data Mining: Data Lecture Notes for Chapter 2. 2 What is Data? l Collection of data objects and their attributes l An attribute is a property or characteristic.
Lecture 7: Intro to Computer Graphics. Remember…… DIGITAL - Digital means discrete. DIGITAL - Digital means discrete. Digital representation is comprised.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
1 8. One Function of Two Random Variables Given two random variables X and Y and a function g(x,y), we form a new random variable Z as Given the joint.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
November 30, PATTERN RECOGNITION. November 30, TEXTURE CLASSIFICATION PROJECT Characterize each texture so as to differentiate it from one.
Computational Intelligence: Methods and Applications Lecture 1 Organization and overview Source: Włodzisław Duch; Dept. of Informatics, UMK; Google: W.
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
Descriptive Statistics Review – Chapter 14. Data  Data – collection of numerical information  Frequency distribution – set of data with frequencies.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Computational Intelligence: Methods and Applications Lecture 33 Decision Tables & Information Theory Włodzisław Duch Dept. of Informatics, UMK Google:
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Remote Sensing Theory & Background III GEOG370 Instructor: Yang Shao.
Introduction to Probability and Bayesian Decision Making Soo-Hyung Kim Department of Computer Science Chonnam National University.
Virtual University of Pakistan Lecture No. 26 Statistics and Probability Miss Saleha Naghmi Habibullah.
Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
Binary Notation and Intro to Computer Graphics
Computational Intelligence: Methods and Applications
Introduction Machine Learning 14/02/2017.
IMAGE PROCESSING RECOGNITION AND CLASSIFICATION
Computational Intelligence: Methods and Applications
Computational Intelligence: Methods and Applications
Lecture 26: Faces and probabilities
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Data Mining: Exploring Data
An Introduction to Supervised Learning
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
EE513 Audio Signals and Systems
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Data Transformations targeted at minimizing experimental variance
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Probability and Statistics
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Data exploration and visualization
Presentation transcript:

Computational Intelligence: Methods and Applications Lecture 3 Histograms and probabilities. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch

Features AI uses complex knowledge representation methods. Pattern recognition is based mostly on simple feature spaces. Set of Objects: physical entities (images, patients, clients, molecules, cars, signal samples, software pieces), or states of physical entities (board states, patient states etc). Features: measurements or evaluation of some object properties. Ex: are pixel intensities good features? No - not invariant to translation/scaling/rotation. Better: type of connections, type of lines, number of lines... Selecting good features, transforming raw measurements that are collected, is very important.

Feature space representation Representation: mapping objects/states into vectors, {O i } => X(O i ), with X j (O i ) being j-th attribute of object O i Attribute” and “feature” are used as synonyms, although strictly speaking “age” is an attribute, and “young” is its feature value. Types of features. Categorical: symbolic or discrete – may be nominal (unordered), like “sweet, salty, sour”, or ordinal (can be ordered), like colors or small < medium < large (drink). Continuous: numerical values. x2x2 x1x1 x3x3 x(O) Vector X =(x 1,x 2,x 3... x d ), or a d-dimensional point in the feature space.

Fishy example. Chapter 1.2, Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 Singapore Fisheries automates the process of sorting salmon and sea bass fish, coming on a conveyor belt. Optical sensor and processing software evaluate a number of features: length, lightness, width, #fins Step 1: look at the histograms. Select number of bins, ex. n =20 (discretize data) calculate bin size  =( x max  x min )/n, calculate N (C, r i ) = # samples in class C  {salmon, bass} in each bin r i = [ x min +(i-1) , x min +i  i=1...n normalize P(C,r i )=N(C,r i )/N, where N is the number of all samples. This gives joint probability P(C,r i ) = P(r i |C)P(C)

Fishy histograms. Example of histograms for two features, length (left) and skin lightness (right). Optimal thresholds are marked. P ( r i |C ) is an approximation to the class probability distribution P ( x|C ). How to calculate it? Discretization: replacing continuous values by discrete bins, or integration by summation, very useful when dealing with real data. Alternative: integrate, fit some simple functions to these histograms, for example a sum of several Gaussians, optimize their width and centers.

Fishy example. Exploratory data analysis (EDA): visualize relations in the data. How are histograms created? Select number of bins, ex. n =20 (discretize data into 20 bins) calculate bin size  =( x max  x min )/n, calculate N (C, r i ) = # samples from class C in each bin r i =[ x min +(i-1) , x min +i  i=1...n This may be converted to a joint probability P(C,r i ) that a fish of the type C will be found within length in bin r i P(C,r i ) = N(C,r i )/N, where N is the number of all samples. Histograms show joint probability P(C,r i ) rescaled by N.

Basic probability concepts. Normalized co-occurrence (contingency) table: P ( C,r i )= N ( C,r i )/N P ( C, r i ) – matrix, columns=bins r i, rows = classes. P ( C, r i ) - joint probability distribution, P of finding C and x  r i P ( C ) or a priori class probability, before making any measurements or learning that x  r i is in some bin, is obtained by summing the row. P ( x  r i ) or probability that object from any class is found in bin r i is obtained by summing the column.

PDF and conditional probability What if there x is a continuous variable and there are no natural bins? Then P ( C,x ) is a probability density function (pdf), and for small dx P ( C, [ x,x+dx] ) = P ( C,x ) dx Suppose now that class C is know; what is the probability of finding x  r i or for continuous features finding it in [ x,x+dx] interval? P ( x  r i |C ) denotes conditional probability distribution, knowing C. Because sum over all bins gives: P C ( x )= P ( x|C ) class probability distribution is simple rescaled joint probability, divide a single row of P ( C,x ) matrix by P ( C ). and for joint probability therefore the formula is

Probability formulas Most probability formulas are simple summations rules! Matrix of joint probability distributions: elements in P ( C, x ) for discrete x, or P ( C, x  r i ) after discretization. Just count how many times N(C,x) is found. P ( C, x ) = N(C,x)/N Row of P ( C, x ) sums to: therefore P(x|C)=P(C, x)/P(C) sums to For x continuous P ( C,x ) is the probability density distribution, integrating to P ( C )

Bayes formula Bayes formula allows for calculation of the conditional probability distribution P ( x|C ) and posterior P ( C|x ) probability distribution. These probabilities are renormalized elements of the joint probability P ( C,r i ). They sum to 1 since we know that the object with x  r i is from C class, and that given x it must be in one of the C classes. Therefore Bayes formula is quite obvious!

Example Two types of Iris flowers: Iris Setosa and Iris Virginica Measuring petal lengths in cm in two intervals, r 1 =[0,3] cm and r 2 =[3,6] cm of 100 samples we may get for example the following distribution: Therefore probabilities for finding different types of Iris flowers is: Calculate conditional probabilities and verify all formulas given on the previous pages! Do more examples.

Going to 2D histograms 2D histograms in 3D: still useful, although sometimes hard to analyze. For a single class it is still easy, but for >2 classes rather difficult. Many visualization software packages create such 3D plots. Joint probability P ( C,x,y ) is shown here, for each class on separate drawing; for small N it may be quite different than real distribution.

Examples of histograms Histograms in 2D: still useful, although sometimes hard to analyze. Made by SigmaPlot, Origin, or statistical packages like SPSS. Illustration of discretization (bin width) influence Various chart applets: Image histograms – popular in electronic cameras. How to improve the information content in histograms? How to analyze high-dimensional data?

2D histograms in bioinformatics Very popular, two nominal variables (genes, samples) vs. continuous variable (activity) normalized to [-1,+1]. Example: gene expression data in blood B-cell of 16 types. Use color instead of height. Intensity = -1 => bright green Intensity = 0 => black Intensity =+1 => bright red Intensity = -1 => inhibition Intensity = 0 => normal Intensity =+1 => high expression Gene activity(gen name, cell type)