Essential CS & Statistics (Lecture for CS498-CXZ Algorithms in Bioinformatics) Aug. 30, 2005 ChengXiang Zhai Department of Computer Science University.

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Probabilistic models Haixu Tang School of Informatics.
1 Essential Probability & Statistics (Lecture for CS598CXZ Advanced Topics in Information Retrieval ) ChengXiang Zhai Department of Computer Science University.
Flipping A Biased Coin Suppose you have a coin with an unknown bias, θ ≡ P(head). You flip the coin multiple times and observe the outcome. From observations,
Binomial Distribution & Bayes’ Theorem. Questions What is a probability? What is the probability of obtaining 2 heads in 4 coin tosses? What is the probability.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Parameter Estimation using likelihood functions Tutorial #1
Visual Recognition Tutorial
Middle Term Exam 03/04, in class. Project It is a team work No more than 2 people for each team Define a project of your own Otherwise, I will assign.
Probability theory Much inspired by the presentation of Kren and Samuelsson.
Machine Learning CMPT 726 Simon Fraser University CHAPTER 1: INTRODUCTION.
Lecture 5: Learning models using EM
Probabilistic Graphical Models Tool for representing complex systems and performing sophisticated reasoning tasks Fundamental notion: Modularity Complex.
Maximum likelihood (ML) and likelihood ratio (LR) test
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Machine Learning CMPT 726 Simon Fraser University
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Information Theory Rong Jin. Outline  Information  Entropy  Mutual information  Noisy channel model.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Maximum likelihood (ML)
Relevance Feedback Users learning how to modify queries Response list must have least some relevant documents Relevance feedback `correcting' the ranks.
Probability and Statistics Review Thursday Sep 11.
Albert Gatt Corpora and Statistical Methods. Probability distributions Part 2.
1 Statistical NLP: Lecture 5 Mathematical Foundations II: Information Theory.
If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.
Basic Concepts in Information Theory
Recitation 1 Probability Review
Some basic concepts of Information Theory and Entropy
2. Mathematical Foundations
Chapter Two Probability Distributions: Discrete Variables
1 9/8/2015 MATH 224 – Discrete Mathematics Basic finite probability is given by the formula, where |E| is the number of events and |S| is the total number.
1 9/23/2015 MATH 224 – Discrete Mathematics Basic finite probability is given by the formula, where |E| is the number of events and |S| is the total number.
Random Sampling, Point Estimation and Maximum Likelihood.
Dr. Gary Blau, Sean HanMonday, Aug 13, 2007 Statistical Design of Experiments SECTION I Probability Theory Review.
1 Foundations of Statistical Natural Language Processing By Christopher Manning & Hinrich Schutze Course Book.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Elementary manipulations of probabilities Set probability of multi-valued r.v. P({x=Odd}) = P(1)+P(3)+P(5) = 1/6+1/6+1/6 = ½ Multi-variant distribution:
NLP. Introduction to NLP Very important for language processing Example in speech recognition: –“recognize speech” vs “wreck a nice beach” Example in.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
Mathematical Foundations Elementary Probability Theory Essential Information Theory Updated 11/11/2005.
Learning Simio Chapter 10 Analyzing Input Data
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Chapter 2: Probability. Section 2.1: Basic Ideas Definition: An experiment is a process that results in an outcome that cannot be predicted in advance.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory II AI-lab
CHAPTER 3: BAYESIAN DECISION THEORY. Making Decision Under Uncertainty Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
2010 © University of Michigan Probabilistic Models in Information Retrieval SI650: Information Retrieval Winter 2010 School of Information University of.
AP Statistics From Randomness to Probability Chapter 14.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Statistical methods in NLP Course 2 Diana Trandab ă ț
Statistical methods in NLP Course 2
Essential Probability & Statistics
Lecture 1.31 Criteria for optimal reception of radio signals.
OPERATING SYSTEMS CS 3502 Fall 2017
Bayesian Estimation and Confidence Intervals
Learning Tree Structures
Statistical Language Models
CS 2750: Machine Learning Density Estimation
CS 2750: Machine Learning Probability Review Density Estimation
Bayes Net Learning: Bayesian Approaches
Maximum Likelihood Estimation
Oliver Schulte Machine Learning 726
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parametric Methods Berlin Chen, 2005 References:
CS590I: Information Retrieval
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Essential CS & Statistics (Lecture for CS498-CXZ Algorithms in Bioinformatics) Aug. 30, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Essential CS Concepts Programming languages: Languages that we use to communicate with a computer –Machine language ( …) –Assembly language (move a, b; add c, b; …) –High-level language (x= a+2*b… ), e.g., C++, Perl, Java –Different languages are designed for different applications System software: Software “assistants” to help a computer –Understand high-level programming languages (compilers) –Manage all kinds of devices (operating systems) –Communicate with users (GUI or command line) Application software: Software for various kinds of applications –Standing alone (running on a local computer, e.g., Excel, Word) –Client-server applications (running on a network, e.g., web browser)

Intelligence/Capacity of a Computer The intelligence of a computer is determined by the intelligence of the software it can run Capacities of a computer for running software are mainly determined by its –Speed –Memory –Disk space Given a particular computer, we would like to write software that is highly intelligent, that can run fast, and that doesn’t need much memory (contradictory goals)

Algorithms vs. Software Algorithm is a procedure for solving a problem –Input: description of a problem –Output: solution(s) –Step 1: We first do this –Step 2: …. –…. –Step n: here’s the solution! Software implements an algorithm (with a particular programming language)

Example: Change Problem Input: –M (total amount of money) –C 1 > c 2 > … >C d (denominations) Output –i 1, i 2, …,i d (number of coins of each kind), such that i 1 *C 1 + i 2 *C 2 + … + i d *C d =M and i 1 + i 2 + … + i d is as small as possible

Algorithm Example: BetterChange BetterChange(M,c,d) 1r=M; 2for k=1 to d { 1i k =r/c k 3 r=r-r-i k *c k 4} 5Return (i 1, i 2, …, i d ) Input variables Output variables Take only the integer part (floor) Properties of an algorithms: - Correct vs. Incorrect algorithms (Is BetterChange correct?) - Fast vs. Slow algorithms (How do we quantify it?)

Big-O Notation How can we compare the running time of two algorithms in a computer-independent way? Observations: –In general, as the problem size grows, the running time increases (sorting 500 numbers would take more time than sorting 5 elements) –Running time is more critical for large problem size (think about sorting 5 numbers vs. sorting numbers) How about measuring the growth rate of running time?

Big-O Notation (cont.) Define problem size (e.g., the lengths of a sequence, n) Define “basic steps” (e.g., addition, division,…) Express the running time as a function of the problem size ( e.g., 3*n*log(n) +n) As the problem size approaches the positive infinity, only the highest-order term “counts” Big-O indicates the highest-order term E.g., the algorithm has O(n*log(n)) time complexity Polynomial (O(n 2 )) vs. exponential (O(2 n )) NP-complete

Basic Probability & Statistics

Purpose of Prob. & Statistics Deductive vs. Plausible reasoning Incomplete knowledge -> uncertainty How do we quantify inference under uncertainty? –Probability: models of random process/experiments (how data are generated) –Statistics: draw conclusions on the whole population based on samples (inference on data)

Basic Concepts in Probability Sample space: all possible outcomes, e.g., –Tossing 2 coins, S ={HH, HT, TH, TT} Event: E  S, E happens iff outcome in E, e.g., –E={HH} (all heads) –E={HH,TT} (same face) Probability of Event : 1  P(E)  0, s.t. –P(S)=1 (outcome always in S) –P(A  B)=P(A)+P(B) if (A  B)= 

Basic Concepts of Prob. (cont.) Conditional Probability :P(B|A)=P(A  B)/P(A) –P(A  B) = P(A)P(B|A) =P(B)P(B|A) –So, P(A|B)=P(B|A)P(A)/P(B) –For independent events, P(A  B) = P(A)P(B), so P(A|B)=P(A) Total probability: If A 1, …, A n form a partition of S, then –P(B)= P(B  S)=P(B  A 1 )+…+P(B  A n ) –So, P(A i |B)=P(B|A i )P(A i )/P(B) (Bayes’ Rule)

Interpretation of Bayes’ Rule Hypothesis space: H={H 1, …, H n }Evidence: E If we want to pick the most likely hypothesis H*, we can drop p(E) Posterior probability of H i Prior probability of H i Likelihood of data/evidence if H i is true

Random Variable X: S   (“measure” of outcome) Events can be defined according to X –E(X=a) = {s i |X(s i )=a} –E(X  a) = {s i |X(s i )  a} So, probabilities can be defined on X –P(X=a) = P(E(X=a)) –P(a  X) = P(E(a  X)) (f(a)=P(a>x): cumulative dist. func) Discrete vs. continuous random variable (think of “partitioning the sample space”)

An Example Think of a DNA sequence as results of tossing a 4- face die many times independently P(AATGC)=p(A)p(A)p(T)p(G)p(C) A model specifies {p(A),p(C), p(G),p(T)}, e.g., all 0.25 (random model M0) P(AATGC|M0) = 0.25*0.25*0.25*0.25*0.25 Comparing 2 models –M1: coding regions –M2: non-coding regions –Decide if AATGC is more likely a coding region

Probability Distributions Binomial: Times of successes out of N trials Gaussian: Sum of N independent R.V.’s Multinomial: Getting n i occurrences of outcome i

Parameter Estimation General setting: –Given a (hypothesized & probabilistic) model that governs the random experiment –The model gives a probability of any data p(D|  ) that depends on the parameter  –Now, given actual sample data X={x 1,…,x n }, what can we say about the value of  ? Intuitively, take your best guess of  -- “best” means “best explaining/fitting the data” Generally an optimization problem

Maximum Likelihood Estimator Data: a sequence d with counts c(w 1 ), …, c(w N ), and length |d| Model: multinomial M with parameters {p(w i )} Likelihood: p(d|M) Maximum likelihood estimator: M=argmax M p(d|M) We’ll tune p(w i ) to maximize l(d|M) Use Lagrange multiplier approach Set partial derivatives to zero ML estimate

Maximum Likelihood vs. Bayesian Maximum likelihood estimation –“Best” means “data likelihood reaches maximum” –Problem: small sample Bayesian estimation –“Best” means being consistent with our “prior” knowledge and explaining data well –Problem: how to define prior?

Bayesian Estimator ML estimator: M=argmax M p(d|M) Bayesian estimator: – First consider posterior: p(M|d) =p(d|M)p(M)/p(d) – Then, consider the mean or mode of the posterior dist. p(d|M) : Sampling distribution (of data) P(M)=p(  1,…,  N ) : our prior on the model parameters conjugate = prior can be interpreted as “extra”/“pseudo” data Dirichlet distribution is a conjugate prior for multinomial sampling distribution “extra”/“pseudo” counts e.g.,  i =  p(w i |REF)

Dirichlet Prior Smoothing (cont.) Posterior distribution of parameters: The predictive distribution is the same as the mean: Bayesian estimate (|d|  ?)

Illustration of Bayesian Estimation Prior: p(  ) Likelihood: p(D|  ) D=(c 1,…,c N ) Posterior: p(  |D)  p(D|  )p(  )    : prior mode  ml : ML estimate  : posterior mode

Basic Concepts in Information Theory Entropy: Measuring uncertainty of a random variable Kullback-Leibler divergence: comparing two distributions Mutual Information: measuring the correlation of two random variables

Entropy Entropy H(X) measures the average uncertainty of random variable X Example: Properties: H(X)>=0; Min=0; Max=log M; M is the total number of values

Interpretations of H(X) Measures the “amount of information” in X –Think of each value of X as a “message” –Think of X as a random experiment (20 questions) Minimum average number of bits to compress values of X –The more random X is, the harder to compress A fair coin has the maximum information, and is hardest to compress A biased coin has some information, and can be compressed to <1 bit on average A completely biased coin has no information, and needs only 0 bit

Cross Entropy H(p,q) What if we encode X with a code optimized for a wrong distribution q? Expected # of bits=? Intuitively, H(p,q)  H(p), and mathematically,

Kullback-Leibler Divergence D(p||q) What if we encode X with a code optimized for a wrong distribution q? How many bits would we waste? Properties: - D(p||q)  0 - D(p||q)  D(q||p) - D(p||q)=0 iff p=q KL-divergence is often used to measure the distance between two distributions Interpretation: -Fix p, D(p||q) and H(p,q) vary in the same way -If p is an empirical distribution, minimize D(p||q) or H(p,q) is equivalent to maximizing likelihood Relative entropy

Cross Entropy, KL-Div, and Likelihood Likelihood: log Likelihood: Criterion for estimating a good model

Mutual Information I(X;Y) Comparing two distributions: p(x,y) vs p(x)p(y) Conditional Entropy: H(Y|X) Properties: I(X;Y)  0; I(X;Y)=I(Y;X); I(X;Y)=0 iff X & Y are independent Interpretations: - Measures how much reduction in uncertainty of X given info. about Y - Measures correlation between X and Y

What You Should Know Computational complexity, big-O notation Probability concepts: –sample space, event, random variable, conditional prob. multinomial distribution, etc Bayes formula and its interpretation Statistics: Know how to compute maximum likelihood estimate Information theory concepts: –entropy, cross entropy, relative entropy, conditional entropy, KL-div., mutual information, and their relationship