Download presentation
Presentation is loading. Please wait.
1
. Algorithms in Computational Biology – 236522 אלגוריתמים בביולוגיה חישובית – 236522 (http://webcourse.cs.technion.ac.il/236522)http://webcourse.cs.technion.ac.il/236522 תרגול : יום ב ' 14:30 – 15:20 טאוב 6 אילן גרונאו משרד : טאוב 700 טל :829(4894) דוא " ל : ilangr@cs.technion.ac.il ilangr@cs.technion.ac.il מצגות התרגול / הרצאה : ניתן להוריד יום לפני התרגול / הרצאה מאתר הקורס הרצאות / תרגולים מסמסטר קודם ( חורף 2004/5) ניתן להוריד מ - http://webcourse.cs.technion.ac.il/236522/Winter2004-2005/en/ho.html המצגות הן חומר עזר בלבד !! תרגילי בית : 5 תרגילים הגשה חובה בתא המתרגל (117# בקומה 5) איחורים ובעיות אחרות : להודיע למתרגל זמן מספיק לפני מועד ההגשה
2
. Algorithms in Computational Biology – The full story Best biological explanaiton Biological data Hypotheses space Problems: size of the space complex scoring functions
3
. Parameter Estimation Using Likelihood Functions Tutorial #1
4
. The Problem: Data set Probabilistic Model Find the best explanation for the observed data Helps predict behavior of similar data sets Parameters: Θ = θ 1, θ 2, θ 3, …
5
. An example: Binomial experiment Heads - P(H) Tails - 1-P(H) Each experiment is independent of others The data: series of experiment results, e.g. D = H H T H T T T H H … The unknown parameter: θ=P(H) Data set Model Parameters: Θ
6
. An example: Binomial experiment Maximum Likelihood Estimation (MLE): The likelihood of a given value for θ : L D (θ) = P(D| θ) We wish to find a value for θ which maximizes the likelihood Example: The likelihood of ‘ HTTHH’ is: L HTTHH (θ) = P(HTTHH | θ)= θ (1-θ) (1- θ) θ θ = θ 3 (1-θ) 2 We only need to know N(H) and N(T) (number of Heads and Tails). These are sufficient statistics : L D (θ) = θ N(H) (1-θ) N(T) Data set Model Parameters: Θ
7
7 Sufficient Statistics A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood. Formally, s(D) is a sufficient statistics if for any two datasets D and D’: s(D) = s(D’ ) L D ( ) = L D’ ( ) l Likelihood may be calculated on the statistics.
8
. Maximum Likelihood Estimation (MLE): Reminder: L D (θ) = θ N(H) (1-θ) N(T) We wish to maximize the likelihood (or log-likelihood). l D (θ) = log(L D (θ)) = N(H)·log(θ) + N(T)·log(1-θ) Maximization: ( l D (θ))’ = 0 00.20.40.60.81 L()L() D = H T T H H An example: Binomial experiment Data set Model Parameters: Θ
9
. Still – a series of independent experiments Each experiment has K possible results Examples: - die toss ( K=6 ) - DNA sequence ( K=4 ) - protein sequence ( K=20 ) We want to learn the parameters 1, 2. …, K Sufficient statistics: N 1, N 2, …, N K - the number of times each outcome is observed Likelihood: MLE: Multinomial experiments Data set Model Parameters: Θ
10
. Data set Model Parameters: Θ Likelihood – P(D|Θ) What we would like is to maximize is P(Θ|D). Bayes’ rule: Due to: In our case: The prior probability captures our subjective prior knowledge (prejudice) of the model parameters. Is ML all we need? prior probability likelihood posterior probability
11
. The prior probability captures our subjective prior knowledge (prejudice) of the model parameters. Example: After a set of tosses: ‘HTTHH’ what would you bet on for the next toss? we a priori assume the coin is balanced with high probability. Possible assumptions in molecular biology: some amino acids have the similar frequencies some amino acids have low frequencies … Bayesian inference - prior probabilities
12
. A casino uses 2 kind of dice: 99% are fair 1% is loaded: 6 comes up 50% of the times We pick a die at random and roll it 3 times We get 3 consecutive sixes What is the probability the die is loaded? Bayesian inference - Example: dishonest casino
13
. P(loaded|3 sixes) = ? Use Bayes’ rule ! P(3 sixes|loaded) = (0.5) 3 (likelihood of ‘loaded’) P(loaded) = 0.01 (prior) P(3 sixes) = P(3 sixes|loaded)P(loaded) + P(3 sixes|unloaded)P(unloaded) = (0.5) 3 0.01 + (0.1666) 3 0.99 = 0.00458333 Bayesian inference - Example: dishonest casino (cont.) Unlikely loaded!!
14
. Extracellular proteins have a slightly different amino acid composition than intracellular proteins. From a large enough protein database (e.g. SwissProt ) we can get the following: - P(int) - the probability that an amino-acid sequence is intracellular - P(ext) - the probability that an amino-acid sequence is extracellular - P(a i |int) - the frequency of amino acid a i for intracellular proteins - P(a i |ext) - the frequency of amino acid a i for extracellular proteins Bayesian inference - Biological Example: Proteins What is the probability that a given new protein sequence: X=x 1 x 2 … x n is extracellular?
15
. What is the probability that a given new protein sequence: X=x 1 x 2 ….x n is extracellular? Assuming that every sequence is either extracellular or intracellular (but not both), we have: P(ext) = 1-P(int) P(X) = P(X|ext) ·P(ext) + P(X|int)·P(int) Furthermore: By Bayes’ theorem: Bayesian inference - Biological Example: Proteins (cont.) prior posterior
16
. Summary: Data set Model Parameters: Θ = θ 1, θ 2, θ 3, … The basic paradigm: Estimate model parameters maximizing: Likelihood - P(D|Θ) Posterior probability - P(Θ |D) α P(D|Θ) P(Θ) In the absence of a significant prior they are equivalent
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.