Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the.

Slides:



Advertisements
Similar presentations
General Linear Model With correlated error terms  =  2 V ≠  2 I.
Advertisements

Chapter 2 Multivariate Distributions Math 6203 Fall 2009 Instructor: Ayona Chatterjee.
Random Variables ECE460 Spring, 2012.
Statistics review of basic probability and statistics.
Discrete Probability Distributions
Statistics.
FREQUENCY ANALYSIS Basic Problem: To relate the magnitude of extreme events to their frequency of occurrence through the use of probability distributions.
Review of Basic Probability and Statistics
The Binomial Probability Distribution and Related Topics
Some Basic Concepts Schaum's Outline of Elements of Statistics I: Descriptive Statistics & Probability Chuck Tappert and Allen Stix School of Computer.
Probability Densities
BCOR 1020 Business Statistics Lecture 15 – March 6, 2008.
Chapter 6 Continuous Random Variables and Probability Distributions
Probability Distributions
1 Engineering Computation Part 5. 2 Some Concepts Previous to Probability RANDOM EXPERIMENT A random experiment or trial can be thought of as any activity.
Visual Recognition Tutorial1 Random variables, distributions, and probability density functions Discrete Random Variables Continuous Random Variables.
A random variable that has the following pmf is said to be a binomial random variable with parameters n, p The Binomial random variable.
Probability Distributions Random Variables: Finite and Continuous Distribution Functions Expected value April 3 – 10, 2003.
Continuous Random Variables and Probability Distributions
Copyright © 2014 by McGraw-Hill Higher Education. All rights reserved.
Review of Probability and Statistics
1 Fin500J Topic 10Fall 2010 Olin Business School Fin500J: Mathematical Foundations in Finance Topic 10: Probability and Statistics Philip H. Dybvig Reference:
Random Variable and Probability Distribution
Lecture II-2: Probability Review
Chapter 21 Random Variables Discrete: Bernoulli, Binomial, Geometric, Poisson Continuous: Uniform, Exponential, Gamma, Normal Expectation & Variance, Joint.
Separate multivariate observations
Review of Probability.
OUTLINE Probability Theory Linear Algebra Probability makes extensive use of set operations, A set is a collection of objects, which are the elements.
Statistics for Linguistics Students Michaelmas 2004 Week 1 Bettina Braun.
Random variables Petter Mostad Repetition Sample space, set theory, events, probability Conditional probability, Bayes theorem, independence,
Short Resume of Statistical Terms Fall 2013 By Yaohang Li, Ph.D.
Theory of Probability Statistics for Business and Economics.
LECTURE IV Random Variables and Probability Distributions I.
1 G Lect 3b G Lecture 3b Why are means and variances so useful? Recap of random variables and expectations with examples Further consideration.
CPSC 531: Probability Review1 CPSC 531:Probability & Statistics: Review II Instructor: Anirban Mahanti Office: ICT 745
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Introduction to Biostatistics, Harvard Extension School © Scott Evans, Ph.D.1 Descriptive Statistics, The Normal Distribution, and Standardization.
Review of Probability Concepts ECON 6002 Econometrics Memorial University of Newfoundland Adapted from Vera Tabakova’s notes.
Random Variables. A random variable X is a real valued function defined on the sample space, X : S  R. The set { s  S : X ( s )  [ a, b ] is an event}.
 A probability function is a function which assigns probabilities to the values of a random variable.  Individual probability values may be denoted by.
ENGR 610 Applied Statistics Fall Week 3 Marshall University CITE Jack Smith.
Chapter 7 Sampling and Sampling Distributions ©. Simple Random Sample simple random sample Suppose that we want to select a sample of n objects from a.
Expectation for multivariate distributions. Definition Let X 1, X 2, …, X n denote n jointly distributed random variable with joint density function f(x.
Math b (Discrete) Random Variables, Binomial Distribution.
Stats Probability Theory Summary. The sample Space, S The sample space, S, for a random phenomena is the set of all possible outcomes.
Statistics Chapter 6 / 7 Review. Random Variables and Their Probability Distributions Discrete random variables – can take on only a countable or finite.
Exam 2: Rules Section 2.1 Bring a cheat sheet. One page 2 sides. Bring a calculator. Bring your book to use the tables in the back.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Continuous Random Variables and Probability Distributions
CSE 474 Simulation Modeling | MUSHFIQUR ROUF CSE474:
Chapter 5 Joint Probability Distributions and Random Samples  Jointly Distributed Random Variables.2 - Expected Values, Covariance, and Correlation.3.
MATH Section 3.1.
Engineering Probability and Statistics - SE-205 -Chap 3 By S. O. Duffuaa.
Sums of Random Variables and Long-Term Averages Sums of R.V. ‘s S n = X 1 + X X n of course.
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
1 Ka-fu Wong University of Hong Kong A Brief Review of Probability, Statistics, and Regression for Forecasting.
Lesson 99 - Continuous Random Variables HL Math - Santowski.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
Introduction to Vectors and Matrices
MECH 373 Instrumentation and Measurements
Chapter 5 Joint Probability Distributions and Random Samples
CH 5: Multivariate Methods
Sample Mean Distributions
Lecture 13 Sections 5.4 – 5.6 Objectives:
Chapter 5 Statistical Models in Simulation
Matrices Definition: A matrix is a rectangular array of numbers or symbolic elements In many applications, the rows of a matrix will represent individuals.
Parametric Methods Berlin Chen, 2005 References:
Introduction to Vectors and Matrices
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Data Basics

Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the data D: dimensionality of the data Univariate analysis: the analysis of a single attribute. Bivariate analysis: simultaneous analysis of two attributes. Multivariate analysis: simultaneous analysis of multiple attributes.

Example for Data Matrix

Attributes Categorical Attributes composed of a set of symbols has a set-valued domain E.g., Sex with domain(Sex) = {M, F}, Education with domain(Education) = { High School, BS, MS, PhD}. Two types of categorical attributes – Nominal values in the domain are unordered Only equality comparisons are allowed E.g. Sex – Ordinal Values are ordered Both equality and inequality comparisons are allowed E.g. Education

Attributes Cont. Numeric Attributes – Has real-valued or integer-valued domain – E.g. Age with domain (Age) = N, where N denotes the set of natural numbers (non-negative integers). Two types of numeric attributes – Discrete: values take on finite or countably infinite set. – Continuous: values take on any real value Another Classification – Interval-scaled for attributes only differences make sense E.g. temperature. – Ratio-scaled Both difference and ratios are meaningful E.g. Age

Algebraic View of Data If the d attributes in the data matrix D are all numeric each row can be considered as a d-dimensional point or equivalently, each row may be considered a d-dimensional column vector Linear combination of the standard basis vectors

Example of Algebraic View of Data

Geometric View of Data

Distance of Angle

Example of Distance and Angle

Mean and Total Variance

Centered Data Matrix The centered data matrix is obtained by subtracting the mean from all the points

Orthogonality Two vectors a and b are said to be orthogonal if and only if It implies that the angle between them is 90◦ or π/2 radians.

Orthogonal Projection P: orthogonal projection of b on the vector a; R: error vector between points b and p

Example of Projection

Linear Independence and Dimensionality : the set of all possible linear combinations of the vectors. If then we say that v1, · · ·, vk is a spanning set for.

Row and Column Space The column space of D, denoted col(D) is the set of all linear combinations of the d column vectors or attributes The row space of D, denoted row(D), is the set of all linear combinations of the n row vectors or points Note also that the row space of D is the column space of

Linear Independence

Dimension and Rank Let S be a subspace of Rm. A basis for S: a set of linearly independent vectors v1, · · ·, vk, and span(v1, · · ·, vk) = S. orthogonal basis for S: If the vectors in the basis are pair-wise orthogonal If in addition they are also normalized to be unit vectors, then they make up an orthonormal basis for S. For instance, the standard basis for Rm is an orthonormal basis consisting of the vectors

Any two bases for S must have the same number of vectors. Dimension: The number of vectors in a basis for S, denoted as dim(S). For any matrix, the dimension of its row and column space are the same, and this dimension is also called as the rank of the matrix.

Data: Probabilistic View Assumes that each numeric attribute Xj is a random variable, defined as a function that assigns a real number to each outcome of an experiment. Given as Xj : O → R, where O, the domain of Xj, called as the sample space R, the range of Xj, is the set of real numbers. If the outcomes are numeric, and represent the observed values of the random variable, then Xj : O → O is simply the identity function: Xj (v) = v for all v ∈ O.

Data: Probabilistic View A random variable X is called a discrete random variable if it takes on only a finite or countably infinite number of values in its range. X is called a continuous random variable if it can take on any value in its range.

Example Be default, consider the attribute X1 to be a continuous random variable, given as the identity function X1(v) = v, since the outcomes are all numeric. On the other hand, if we want to distinguish between iris flowers with short and long sepal lengths, we define a discrete random variable A as follows In this case the domain of A is [4.3, 7.9]. The range of A is {0, 1}, and thus A assumes non-zero probability only at the discrete values 0 and 1.

Example: Bernoulli and Binomial Distribution only 13 irises have sepal length of at least 7cm In this case we say that A has a Bernoulli distribution with parameter p ∈ [0, 1]. p denotes the probability of a success, whereas 1− p represents the probability of a failure

Example: Bernoulli and Binomial Distribution Let us consider another discrete random variable B, denoting the number of irises with long sepal lengths in m independent Bernoulli trials with probability of success p. B takes on the discrete values [0,m], and its probability mass function is given by the Binomial distribution For example, taking p = from above, the probability of observing exactly k = 2 long sepal length irises in m = 10 trials is given as

full probability mass function for different values of k

Probability Density Function If X is continuous, its range is the entire set of real numbers R. probability density function: specifies the probability that the variable X takes on values in any interval [a, b] ⊂ R

Cumulative Distribution Function For any random variable X, whether discrete or continuous, we can define the cumulative distribution function (CDF) F : R → [0, 1], that gives the probability of observing a value at most some given value x

The following examples are from Andrew Moore

Probability Density Function f(x) What is P(X=x) when x is on a real domain » f(x) >=0 and

Normal Distribution Let us assume that these values follow a Gaussian or normal density function, given as

Bivariate Random Variables considering a pair of attributes, X1 and X2, as a bivariate random variable

In 2-Dimensions

Multivariate Random Variable

Numeric Attribute Analysis Sample and Statistics Univariate Analysis Bivariate Analysis Multivariate Analysis Normal Distribution

Random Sample and Statistics Population: is used to refer to the set or universe of all entities under study. However, looking at the entire population may not be feasible, or may be too expensive. Instead, we draw a random sample from the population, and compute appropriate statistics from the sample, that give estimates of the corresponding population parameters of interest.

Univariate Sample Let X be a random variable, and let xi (1 ≤ i ≤ n) denote the observed values of attribute X in the given data, where n is the data size. Given a random variable X, a random sample of size n from X is defined as a set of n independent and identically distributed (IID) random variables S1, S2, · · ·, Sn. since the variables Si are all independent, their joint probability function is given as

Multivariate Sample xi: the value of a d-dimensional vector random variable Si = (X1,X2, · · ·,Xd ). Si are independent and identically distributed, and thus their joint distribution is given as Assume d attributes X1,X2, · · ·,Xd are independent, (1.43) can be rewritten as

Statistic Let Si denote the random variable corresponding to data point xi, then a statistic ˆθ is a function ˆθ : (S1, S2, · · ·, Sn) → R. If we use the value of a statistic to estimate a population parameter, this value is called a point estimate of the parameter, and the statistic is called as an estimator of the parameter.

Numeric Attribute Analysis Sample and Statistics Univariate Analysis Bivariate Analysis Multivariate Analysis Normal Distribution

Univariate Analysis Univariate analysis focuses on a single attribute at a time, thus the data matrix D can be thought of as a n × 1 matrix, or simply a column vector.

Univariate Analysis X is assumed to be a random variable, and each point xi (1 ≤ i ≤ n) is assumed to be the value of a random variable Si, where the variables Si are all independent and identically distributed as X, i.e., they constitute a random sample drawn from X. In the vector view, we treat the sample as an n- dimensional vector, and write X ∈ Rn.

What can sample analysis do? Unknown f(X) and F(X) Parameters(μ,δ)

Empirical Cumulative Distribution Function Where

Inverse Cumulative Distribution Function

Empirical Probability Mass Function Where

Measures of Central Tendency (Mean) Population Mean: Sample Mean (Unbiased, not robust):

Measures of Central Tendency (Median) Population Median: or Sample Median:

Measures of Central Tendency (Mode) Sample Mode: 1. may not be very useful  but not affected by the outliers too much

Example

Measures of Dispersion (Range) Range:  Not robust, sensitive to extreme values Sample Range:

Measures of Dispersion (Inter-Quartile Range) Inter-Quartile Range (IQR):  More robust Sample IQR:

Measures of Dispersion (Variance and Standard Deviation) Standard Deviation: Variance:

Measures of Dispersion (Variance and Standard Deviation) Standard Deviation: Variance: Sample Variance & Standard Deviation:

Normalization Z-Score: Linear Normalization:

Normalization Example

Topics Sample and Statistics Univariate Analysis Bivariate Analysis Multivariate Analysis Normal Distribution

Bivariate Analysis Bivariate analysis focuses on Two attributes at a time, thus the data matrix D can be thought of as a n × 2 matrix, or two column vectors.

Empirical Joint Probability Mass Function or where

Measures of Central Tendency (Mean) Population Mean: Sample Mean:

Measures of Association (Covariance) Covariance: Sample Covariance:

Measures of Association (Correlation) Correlation: Sample Correlation:

Measures of Association (Correlation)

Correlation Example

Topics Sample and Statistic Univariate Analysis Bivariate Analysis Multivariate Analysis Normal Distribution

Multivariate Analysis Multivariate analysis focuses on multiple attributes at a time, thus the data matrix D can be thought of as a n × d matrix, or d column vectors.

Measures of Central Tendency (Mean) Population Mean: Sample Mean:

Measures of Association (Covariance Matrix)

Measures of Association (Correlation) Correlation: Sample Correlation:

Topics Sample and Statistic Univariate Analysis Bivariate Analysis Multivariate Analysis Normal Distribution

Univariate Normal Distribution

Multivariate Normal Distribution

Thank You!