Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.

Slides:



Advertisements
Similar presentations
Chapter 7 Hypothesis Testing
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
PRIORITIZING REGIONS OF CANDIDATE GENES FOR EFFICIENT MUTATION SCREENING.
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Heuristic alignment algorithms and cost matrices
Evaluating Hypotheses Chapter 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics.
Evaluating Hypotheses Chapter 9 Homework: 1-9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics ~
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Overview Parameters and Statistics Probabilities The Binomial Probability Test.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Chapter Sampling Distributions and Hypothesis Testing.
Similar Sequence Similar Function Charles Yan Spring 2006.
Computational Biology, Part 2 Representing and Finding Sequence Features using Consensus Sequences Robert F. Murphy Copyright  All rights reserved.
BLAST and Multiple Sequence Alignment
Point Specific Alignment Methods
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Rationale for searching sequence databases June 22, 2005 Writing Topics due today Writing projects due July 8 Learning objectives- Review of Smith-Waterman.
Probability Population:
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
BLAST: Basic Local Alignment Search Tool Urmila Kulkarni-Kale Bioinformatics Centre University of Pune.
Probability and the Sampling Distribution Quantitative Methods in HPELS 440:210.
Nonparametric or Distribution-free Tests
Chapter 10 Hypothesis Testing
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.
Chapter 10 Hypothesis Testing
Lecture 7 Introduction to Hypothesis Testing. Lecture Goals After completing this lecture, you should be able to: Formulate null and alternative hypotheses.
Individual values of X Frequency How many individuals   Distribution of a population.
Education Research 250:205 Writing Chapter 3. Objectives Subjects Instrumentation Procedures Experimental Design Statistical Analysis  Displaying data.
+ Section 6.1 & 6.2 Discrete Random Variables After this section, you should be able to… APPLY the concept of discrete random variables to a variety of.
Chapter 4 Correlation and Regression Understanding Basic Statistics Fifth Edition By Brase and Brase Prepared by Jon Booze.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
3. Counting Permutations Combinations Pigeonhole principle Elements of Probability Recurrence Relations.
Local alignment, BLAST and Psi-BLAST October 25, 2012 Local alignment Quiz 2 Learning objectives-Learn the basics of BLAST and Psi-BLAST Workshop-Use BLAST2.
Part III Taking Chances for Fun and Profit Chapter 8 Are Your Curves Normal? Probability and Why it Counts.
Sequencing a genome and Basic Sequence Alignment
Comp. Genomics Recitation 3 The statistics of database searching.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Test for Significant Differences T- Tests. T- Test T-test – is a statistical test that compares two data sets, and determines if there is a significant.
Lecture 2 Review Probabilities Probability Distributions Normal probability distributions Sampling distributions and estimation.
Chapter 7 Sampling Distributions Statistics for Business (Env) 1.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BioInformatics Database of Primer Results In order to help predict the way proteins will act in an organism, biologists cross-examine sequences of amino.
Statistical Inference for the Mean Objectives: (Chapter 9, DeCoursey) -To understand the terms: Null Hypothesis, Rejection Region, and Type I and II errors.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Significance in protein analysis
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Lesson Probability Rules. Objectives Understand the rules of probabilities Compute and interpret probabilities using the empirical method Compute.
Framework of Preferred Evaluation Methodologies for TAACCCT Impact/Outcomes Analysis Random Assignment (Experimental Design) preferred – High proportion.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Sequence Alignment.
+ Chapter 5 Overview 5.1 Introducing Probability 5.2 Combining Events 5.3 Conditional Probability 5.4 Counting Methods 1.
Step 3: Tools Database Searching
Computational Biology, Part 3 Representing and Finding Sequence Features using Frequency Matrices Robert F. Murphy Copyright  All rights reserved.
T tests comparing two means t tests comparing two means.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
BLAST and Psi-BLAST and MSA Nov. 1, 2012 Workshop-Use BLAST2 to determine local sequence similarities. Homework #6 due Nov 8 Chapter 5, Problem 8 Chapter.
Chapter 21 More About Tests.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Pairwise Sequence Alignment (cont.)
Point Specific Alignment Methods
Basic Local Alignment Search Tool (BLAST)
Applying principles of computer science in a biological context
Basic Local Alignment Search Tool
TEST FOR RANDOMNESS: THE RUNS TEST
1-month Practical Course Genome Analysis Iterative homology searching
Presentation transcript:

Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic code, PCR primer), E value Homework 10-due May 14, 2005

Statistical analysis of results underlies bioinformatics When you run a program the computer will always give an answer. The bioinformaticist will analyze the data from two points of view: 1) Statistical 2) Biological Assessment through these filters will determine if the result is reasonable

Two big questions you need to ask yourself 1.Does the result fit with what is currently known about biology (protein structure, evolution, function, etc.)? 2.Could the results have been obtained by random chance? Part of this comes from scientific intuition but another part comes from statistics.

Types of statistics typically used in bioinformatics Yes-Likelihood methods No-ANOVA, regression analysis, hypothesis testing When one performs a sequence comparison search one must ask what is the likelihood that one would obtain a match based on random chance. This depends on the sequence you are searching for and the amount of data within the database you are mining.

Equally likely outcomes sample space S= set of all possible outcomes. Assumption: all outcomes are equally likely. Then, for any event A (=set of outcomes) P(A)=number of elements in A = |A| number of elements in S |S| For an experiment consisting of k parts, each of which can have n i outcomes |S|=n 1 n 2...n k

Multiplication Rule Familiar example: the genetic code. Given that there are 4 nucleotides (A,T,G,C) how many different triplet codons are possible? This is the same as saying 4 items taken 3 at a time with repetition. n things taken k at a time with repetition is n k Position: Answer: 4 3 = 64

Multiplication rule Second example: the PCR primer design. How many different PCR primers of 16 nucleotides in length are possible? This is the same as saying 4 items taken 16 at a time with repetition. n things taken k at a time with repetition is n k 4 Position: Answer: 4 16 = 4.29 x Any 16mer pattern can be expected to appear approximately once in the human genome by chance alone because the human genome contains 3 x 10 9 bases

One may convert the previous calculations to probabilities What is the probability that the codon CCC will occur assuming all codons are represented equally? =

What is the probability that the sequence ATAGCGTACTGCATCA will occur given equal probability of nucleotides at each position? = 2.32 x

Restriction Enzymes What is the probability that you would expect an EcoRI site in a six nucleotide sequence assuming equal representation of all nucleotides? The sequence is : GAATTC = 2.44 x 10 -4

The E value (false positive expectation value) The Expect value (E) is a parameter that describes the number of “hits” one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially as the Similarity Score (S) increases (inverse relationship). The higher the Similarity Score, the lower the E value. Essentially, the E value describes the random background noise that exists for matches between two sequences. The E value is used as a convenient way to create a significance threshold for reporting results. When the E value is increased from the default value of 10 prior to a sequence search, a larger list with more low-similarity scoring hits can be reported. An E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size you might expect to see 1 match with a similar score simply by chance.

E value E = Kmne -λS Where K is constant, m is the length of the query sequence, n is the length of the database sequence, λ is the decay constant, S is the similarity score. If S increases, E decreases exponentially. If the decay constant increases, E decreases exponentially If mn increases the “search space” increases and there is a greater chance for a random “hit”, E increases. Larger database will increase E.