Random Variables and Distributions COMP5318 Knowledge Discovery and Data Mining.

Random Variables and Distributions COMP5318 Knowledge Discovery and Data Mining

Examples

We have heard of statements like “Height is Normally Distributed” mean Standard deviation

Why distributions are important Distribution capture the essence of data associated with a particular variable(s) (e.g., height). If we know height is Normally distributed then a small random sample is enough to provide a very good idea about the general population. Can answer questions like: what is the probability of finding a 2 meter tall Australian? Need to understand the concept of random variable.

Random Variable Let S be the sample space. A random variable X is a function X: S  Real Suppose we toss a coin twice. Let X be the random variable number of heads

Random Variable (Number of Heads in two coin tosses) SX TT0 TH1 HT1 HH2 We also associate a probability with X attaining that value.

Random Variable (Number of Heads in two coin tosses) SProbX TT1/40 TH1/41 HT1/41 HH1/42 XP(X=x) 01/4 11/2 21/4

Random Variables follow a Distribution The height of Australian soldiers is a random variable which follows a Normal distribution with mean 180 cm and standard deviation 15 cm. The frequency of words in a text is a random variable which follows a Zipf distribution. The speed of a hurricane is a random variable which follows a Cauchy distribution. The number of car accidents in a fixed time duration is a random variable which follows a Poisson distribution. The number of heads in a sequence of coin tosses is a random variable which follows a Binomial distribution. The number of web hits in a given time period is a r.v. which follows a Pareto distribution. Many times we don’t know what named distribution a r.v. follows or whether it follows any named distribution at all!

Distribution Definitions Discrete Probability Distribution Continuous Probability Distribution Cumulative Distribution Function

Discrete Distribution A r.v. X is discrete if it takes countably many values {x 1,x 2,….} The probability function or probability mass function for X is given by –f X (x)= P(X=x) From previous example

Continuous Distributions A r.v. X is continuous if there exists a function f X such that

Example: Continuous Distribution Suppose X has the pdf This is the Uniform (0,1) distribution

Binomial Distribution A coin flips Heads with probability p. Flip it n times and let X be the number of Heads. Assume flips are independent. Let f(x) =P(X=x), then

Binomial Example Let p =0.5; n = 5 then In Matlab >>binopdf(4,5,0.5)

Normal Distribution X has a Normal (Gaussian) distribution with parameters μ and σ if X is standard Normal if μ =0 and σ =1. It is denoted as Z. If X ~ N(μ, σ 2 ) then

Normal Example The number of spam emails received by a email server in a day follows a Normal Distribution N(1000,500). What is the probability of receiving 2000 spam emails in a day? Let X be the number of spam emails received in a day. We want P(X = 2000)? The answer is P(X=2000) = 0; It is more meaningful to ask P(X >= 2000);

Normal Example This is In Matlab: >> 1 –normcdf(2000,1000,500) The answer is 1 – 0.9772 = 0.0228 or 2.28% This type of analysis is so common that there is a special name for it: cumulative distribution function F.

Outliers In data mining we are often interested in outliers –especially in high dimensional data which we cannot easily visualize A knowledge of distributions can be very useful in this context. Lets see how?

Outliers in Normal Distribution Conventionally something is considered an outlier if it is at least three standard deviations away from the mean: Lets assume we have a standard Normal Distribution: N(0,1) We want P(X 3) = normcdf(-3,0,1) + 1 – normcdf(3,0,1)=0.0027

Outliers using Univariate Normal Distribution Typically we are given data and we want to find outliers in the data –if any. Here are the steps: 1.Make the assumption that the data come from a Normal distribution. 2.Estimate the parameters of the Normal distribution. 3.Find all data points which are more than three standard deviations away from the mean.

Outliers in Multidimensional Data Recall, in the Iris data, we have four attributes and one class label. This is an example of multidimensional data set. Look at the exponent of the Normal distribution. This is the square of the distance from a point x to the mean μ in units of standard deviation σ

Outliers in Multidimensional Data In multidimensional data this can be generalized to: This is called the Mahalanobis Distance (squared) Σ is d x d matrix called the variance-covariance matrix

Variance-Covariance Matrix If the Data set is an N x d matrix then

In Matlab Suppose we generate a random 100x5 data >> data = rand(100,5); The covariance matrix is >>cv =cov(data) 0.0998 -0.0022 0.0006 -0.0080 -0.0025 -0.0022 0.0933 -0.0051 -0.0100 -0.0010 0.0006 -0.0051 0.0810 -0.0085 0.0083 -0.0080 -0.0100 -0.0085 0.0820 0.0071 -0.0025 -0.0010 0.0083 0.0071 0.0859

Intuitive: Mahalanobis Distance

Distribution of Mahalanobis Distance It turns out that if an N x d data set A if from a multivariate Normal Distribution then the Mahalanobis distance follows a a Chi-Square distribution with d degrees of freedom.

Chi-Square Distribution Curse of dimensionality

Algorithm for Finding Outliers >>chi2inv(.975,d)

Homework Define first, second, third quantile in terms of cumulative distribution function? Use that to understand the previous algorithm. Start looking up Matlab help files in the Statistics toolbox. Also, figure out what is the meaning of “estimating the parameter of a distribution from data”.

Random Variables and Distributions COMP5318 Knowledge Discovery and Data Mining.

Similar presentations

Presentation on theme: "Random Variables and Distributions COMP5318 Knowledge Discovery and Data Mining."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Random Variables and Distributions COMP5318 Knowledge Discovery and Data Mining.

Similar presentations

Presentation on theme: "Random Variables and Distributions COMP5318 Knowledge Discovery and Data Mining."— Presentation transcript:

Similar presentations

About project

Feedback