Lecture 6 CS5661 Pairwise Sequence Analysis-V Relatedness –“Not just important, but everything” Modeling Alignment Scores –Coin Tosses –Unit Distributions.

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Probabilistic models Haixu Tang School of Informatics.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Classical Statistics and Scoring of Alignments. Consider a probe of length l and a database of total length m. How many subsequences of length n are there.
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
Attributes Data Binomial and Poisson Data. Discrete Data All data comes in Discrete form. For Measurement data, in principle, it is on a continuous scale,
AP STATISTICS Simulating Experiments. Steps for simulation Simulation: The imitation of chance behavior, based on a model that accurately reflects the.
Teaching Basic Statistics with R: An Introduction to Interactive Packages Shuen-Lin Jeng National Cheng Kung University.
1 Methods of Experimental Particle Physics Alexei Safonov Lecture #21.
Lecture (10) Mathematical Expectation. The expected value of a variable is the value of a descriptor when averaged over a large number theoretically infinite.
Ch.18 Normal approximation using probability histograms Review measures center and spread –List of numbers (histogram of data) –Box model For a “large”
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Random Walks and BLAST Marek Kimmel (Statistics, Rice)
Lecture outline Database searches
Descriptive statistics Experiment  Data  Sample Statistics Sample mean Sample variance Normalize sample variance by N-1 Standard deviation goes as square-root.
Heuristic alignment algorithms and cost matrices
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Lecture 8 The Principle of Maximum Likelihood. Syllabus Lecture 01Describing Inverse Problems Lecture 02Probability and Measurement Error, Part 1 Lecture.
CISC667, F05, Lec7, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence pairwise alignment Score statistics –Bayesian –Extreme value distribution.
Class 3: Estimating Scoring Rules for Sequence Alignment.
AP Statistics Section 7.2 B Law of Large Numbers.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Distributions Dr. Omar Al Jadaan Assistant Professor – Computer Science & Mathematics.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Stats for Engineers Lecture 9. Summary From Last Time Confidence Intervals for the mean t-tables Q Student t-distribution.
1 CS 475/575 Slide Set 6 M. Overstreet Spring 2005.
Model Inference and Averaging
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
OPIM 5103-Lecture #3 Jose M. Cruz Assistant Professor.
1 Lecture outline Database searches –BLAST –FASTA Statistical Significance of Sequence Comparison Results –Probability of matching runs –Karin-Altschul.
Probability Simulation The Study of Randomness.  P all  P all.
1.3 Simulations and Experimental Probability (Textbook Section 4.1)
Comp. Genomics Recitation 3 The statistics of database searching.
Week11 Parameter, Statistic and Random Samples A parameter is a number that describes the population. It is a fixed number, but in practice we do not know.
Lecture 2 Review Probabilities Probability Distributions Normal probability distributions Sampling distributions and estimation.
1 Since everything is a reflection of our minds, everything can be changed by our minds.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
EAS31116/B9036: Statistics in Earth & Atmospheric Sciences Lecture 3: Probability Distributions (cont’d) Instructor: Prof. Johnny Luo
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
40S Applied Math Mr. Knight – Killarney School Slide 1 Unit: Statistics Lesson: ST-5 The Binomial Distribution The Binomial Distribution Learning Outcome.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Statistics What is the probability that 7 heads will be observed in 10 tosses of a fair coin? This is a ________ problem. Have probabilities on a fundamental.
Psychology 202a Advanced Psychological Statistics September 29, 2015.
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
AP STATISTICS Section 7.1 Random Variables. Objective: To be able to recognize discrete and continuous random variables and calculate probabilities using.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
Chapter 8: Probability: The Mathematics of Chance Probability Models and Rules 1 Probability Theory  The mathematical description of randomness.  Companies.
Aim: What is the importance of probability?. What is the language of Probability? “Random” is a description of a kind of order that emerges in the long.
Unit 6 Probability & Simulation: the Study of randomness Simulation Probability Models General Probability Rules.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Conditional Expectation
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Theoretical distributions: the other distributions.
Theoretical distributions: the Normal distribution.
Applied statistics Usman Roshan.
Discrete Probability Distributions
Probability Theory and Parameter Estimation I
What does it mean to say that the results of an experiment are (or are not) statistically significant? The significance level,  (conventionally set to.
Sequence comparison: Significance of similarity scores
Today (2/16/16) Learning objectives (Sections 5.1, 5.2, and 5.3):
Probability & Statistics Probability Theory Mathematical Probability Models Event Relationships Distributions of Random Variables Continuous Random.
Pairwise Sequence Alignment (cont.)
Lecture 6: Sequence Alignment Statistics
Sequence comparison: Significance of similarity scores
Maximum Likelihood Estimation (MLE)
Presentation transcript:

Lecture 6 CS5661 Pairwise Sequence Analysis-V Relatedness –“Not just important, but everything” Modeling Alignment Scores –Coin Tosses –Unit Distributions –Extreme Value Distribution –Lambda and K revealed –Loose Ends

Lecture 6 CS5662 Modeling Expectation Reduced model: Coin tosses –Given: N coin tosses Probability of heads p –Problem: What is the average number of longest run of heads? –Solution: Experimental: Perform several repetitions and count Theoretical: E(Run max ) = log 1/p N –For example, for fair coin and 64 tosses, E(Run max ) = 6

Lecture 6 CS5663 Random alignment as Coin tosses Head = Match Assume –Score = Run of matches –Maximum score = Longest run of matches Therefore –Same model of expectation –For example: For DNA sequences of length N, E(matchlength max ) = Expected longest run of matches = log 1/p N

Lecture 6 CS5664 Local alignment as Coin tosses Assume –Score in local alignment = Run of matches –Maximum score = Longest run of matches Therefore –Similar model of expectation –For DNA sequences of length n & m E(Matchlength max ) ~ log 1/p (nm)(Why not just n or m?) ~ log 1/p (K ’ nm) Var(Matchlength max ) = C (i.e., Independent of sample space)

Lecture 6 CS5665 Refining Model S = AS matrix based scoring between unrelated sequences E(S) ~ log 1/p (K’nm) ~ [ln(Knm)]/ (where = log e 1/p) Holy Grail: Need P(S > x), probability of a score between unrelated sequences exceeding x

Lecture 6 CS5666 Poisson distribution estimate of P(S > x) Consider Coin Toss Example Given [x >> E(Run max )] Define Success = (Run max  x) Define P n = Probability of n successes Define y = E[Success],i.e., Average no. of successes Then, probability of n successes follows Poisson dist. P n = (e- y y n )/n! Probability of 0 successes (No score exceeding x) is given by P 0 = e- y. Then, probability of at least one score exceeding x, P(S > x) =   i  0 P i = (1 - P 0 ) = 1 - e- y For Poisson distribution, y = Kmne - x. Therefore, P(S > x) = 1 – exp (-Kmne - x )

Lecture 6 CS5667 Unit Distributions Normalize Gaussian and EVD –Area under curve = 1 –Curve maximum at 0 Then –For Gaussian Mean = 0; SD = 1 P(S > x) = 1 – exp (-e -x ) –For EVD Mean = (Euler cons); Variance =  2 /6 = P(S > x) = 1 – exp (-e - (x-u) ) –Z-score representation in terms of SDs P (Z > z) = 1 – exp(-e z – )

Lecture 6 CS5668 Lambda and K = Scale factor for scoring system –Effectively converts AS matrix values to actual natural log likelihoods K = Scale factor that reduces search space to compensate for non- independence of local alignments Esimated by fitting to Poisson approximation or equation for E(S)

Lecture 6 CS5669 Treasure Trove of Probabilities Probability distribution of scores between unrelated sequences P(S unrel ) Probability distribution of number of scores from P(S unrel ) exceeding some cut-off, mean represents number of scores exceeding cut-off observed on average Probability of observing score x occurring between unrelated sequences P(S  x)

Lecture 6 CS56610 Loose Ends What about gap parameters? –Short answer: No formal theory –Long answer: Found empirically Choice of parameters can be used to convert local alignment algorithm into a global alignment What about gapped alignment? –Not formally proven, but simulations show statistical behavior similar to ungapped alignment Effective sequence length n’ = n – E(matchLength max )