Searching for applications of EVT in biology Adam Butler, Biomathematics & Statistics Scotland UK extremes, April 2007 Acknowledgements: Len Thomas, Clive.

Slides:



Advertisements
Similar presentations
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Statistics in Bioinformatics May 2, 2002 Quiz-15 min Learning objectives-Understand equally likely outcomes, Counting techniques (Example, genetic code,
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
BLAST Sequence alignment, E-value & Extreme value distribution.
Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Heuristic alignment algorithms and cost matrices
FASTA and BLAST. FASTA: Introduction FASTA (pronounced FAST-Aye) stands for FAST-All, reflecting the fact that it can be used for a fast protein comparison.
Similar Sequence Similar Function Charles Yan Spring 2006.
Heuristic Approaches for Sequence Alignments
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Comparing Database Search Methods & Improving the Performance of PSI-BLAST Stephen Altschul.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Sequence alignment, E-value & Extreme value distribution
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive
Speed Up DNA Sequence Database Search and Alignment by Methods of DSP
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Bioinformatics for Stem Cell Lecture 1 Debashis Sahoo, PhD.
BLAST What it does and what it means Steven Slater Adapted from pt.
Extreme values Adam Butler Biomathematics & Statistics Scotland Seminar at MLURI, January 2008.
Protein Sequence Alignment and Database Searching.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
SSAHA, or Sequence Search and Alignment by Hashing Algorithm, is used mainly for fast sequence assembly, SNP detection, and the ordering and orientation.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Protein Folding Programs By Asım OKUR CSE 549 November 14, 2002.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
1 Data structure:Lookup Table Application:BLAST. 2 The Look-up Table Data Structure A k-mer is a string of length k. A lookup table is a table of size.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Multiple sequence alignments Introduction to Bioinformatics Jacques van Helden Aix-Marseille Université (AMU), France Lab.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Extreme values and risk Adam Butler Biomathematics & Statistics Scotland CCTC meeting, September 2007.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Sequence Alignment.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Step 3: Tools Database Searching
DNA RNA Protein replication (mutation!) transcription translation (nucleotides) (amino acids) (nucleotides) Nucleic acids ~ “software” ~ “hardware” An.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
A latent Gaussian model for compositional data with structural zeroes Adam Butler & Chris Glasbey Biomathematics & Statistics Scotland.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Pairwise Sequence Alignment (cont.)
BIOINFORMATICS Fast Alignment
Basic Local Alignment Search Tool (BLAST)
Homology Search Tools Kun-Mao Chao (趙坤茂)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Searching for applications of EVT in biology Adam Butler, Biomathematics & Statistics Scotland UK extremes, April 2007 Acknowledgements: Len Thomas, Clive Anderson, Dirk Husmeier

Overview Biologists are frequently interested in properties of extreme or rare events - i.e. extinction, long-range dispersal, genetic mutation – but EVT is not widely known or used in many branches of biology Some possible reasons: Biological sciences have tended to be data-poor, relative to e.g. hydrology Focus on testing of scientific hypotheses rather than risk assessment Difficulty in deriving a meaningful quantitative definition of an extreme event Oppurtunities arise from the large datasets that arise in modern biology (e.g. genetics, ecological modelling), & from an increasing focus on quantitative risk assessment

Genetics “…a sequence alignment is a way of arranging the primary sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences…” (Wikipedia) EVT has been used for sequence alignment since the early 90s (Karlin et al., 1990; Mott, 1992; Mott & Tribe, 1999), and is now embedded within standard software (BLAST, FASTA)

Basic idea is to compare the target sequence with a (very) large database of known sequences, by: 1)defining a similarity score 2)using a fast algorithm to search for the best match(es) within the database 3)using EVT to evaluate the statistical significance of this match Theoretical arguments are used to justify the use of a Gumbel model for the best score Currently interest is in the alignment of multiple sequences (Fromlett & Futschik, 2004; Wang & Sen, 2006), & this requires the use of multivariate extreme value methods

Ecology Review papers: Gaines & Denny (1993), Katz et al. (2005) Disturbance Study the extremes of environmental processes that are known to lead to ecological disturbance: sediment rates, fire sizes, frost days Longevity & survival Study the maximum lifespan or size of an individual Population dynamics Evaluate the probability of extinction or explosion of a population

Dispersal & spread Spatial spread (of diseases, pollen, invasive species, native species responding to climate change) known to be influenced by long-range dispersal events: use EVT to analyse dispersal data? Issues: spatial structure; censoring &/or non-reporting; mixtures Ecological modelling Study the properties of extreme events simulated by complex process-based ecological models – e.g. mass extinction events Deterministic models: find the region of the parameter space associated with the process exceeding a particular level Stochastic models: calculate the probability of the process exceeding a threshold for a given parameter set

Y(  ) ~ CSM(  ), likelihood of CSM intractable,  high dimensional Possible approach if simulation is quick & we have real data x…?: EVT + ABC: 1. generate a value from the prior,  ~  2. use the model to simulate a dataset y(  ) ~ CSM(  ) 3. fit y(  )|{y(  ) > u} ~ GPD to estimate P(Y(  ) > v), for v >> u 4. accept  if P(Y(  ) > v) lies within a 95% confidence interval about P(X > v), else reject Or perhaps could use ABC-MCMC on ( ,v) with pseudo-prior on v EVT for complex stochastic models: some vague ideas

Y(  ) ~ CSM(  ), likelihood of CSM intractable,  high dimensional Possible approach if simulation is slow & we do not have data…? EVT + GP: Run CSM for a relatively small set of parameter values  Assume y(  )|{y(  ) > u} ~ GPD(  (  )) Assume  (  ) ~ N( ,  ) Impose structure on  & fit by hierarchical Bayes Could be used to draw inferences about P(Y(  ) > v) for v >> u, even if we have not simulated from CM(  )

Some references Karlin, S., Dembo, A. and Kawabata, T. (1990) Statistical composition of high-scoring segments from molecular sequences. The Annals of Statistics, 18, Mott, R. (1992) Maximum-likelihood estimation of the statistical distribution of Smith- Waterman local sequence similarity scores. Bulletin of Mathematical Biology,54, Gaines, S.D. and Denny, M. W. (1993) The largest, smallest, highest, lowest, longest, and shortest: extremes in ecology. Ecology, 74, Mott, R. and Tribe, R. (1999) Approximate sequences of gapped alignments. Journal of Computational Biology, 6, Frommlet, F. and Futschik, A. (2004) On the Dependence Structure of Sequence Alignment Scores Calculated with Multiple Scoring Matrices, Statistical Applications in Genetics and Molecular Biology, 3, article 24. Katz, R. W., Brush, G.S. and Parlange, M.B. (2005) Statistics of extremes: modeling ecological disturbances. Ecology, 86, Wang, L. and Sen, P. K. (2006) Extreme value theory in some statistical analysis of genomic sequences. Extremes, 8, Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215: (PubMed) Karlin S, Altschul SF. (1990) "Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes." Proc Natl Acad Sci U S A. Mar;87(6): (PubMed) Karlin S, Altschul SF. (1993) "Applications and statistics for multiple high-scoring segments in molecular sequences." Proc Natl Acad Sci U S A Jun 15;90(12): (PubMed) Altschul, S.F., Madden, T.L., Sch?ffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25: (PubMed) Pearson, W.R. (1998) "Empirical statistical estimates for sequence similarity searches." J. Mol. Biol. 276: (PubMed) Mott R, Tribe R. (1999) Approximate statistics of gapped alignments. J Comput Biol. Spring;6(1): (PubMed) Mott R. (2000) "Accurate formula for P-values of gapped local sequence and profile alignments." J Mol Biol. Jul 14;300(3): (PubMed) Altschul S.F., Bundschuh R, Olsen R, Hwa T. (2001) "The estimation of statistical parameters for local alignment score distributions." Nucleic Acids Res. Jan 15;29(2): (PubMed) Ewens, W. J., Grant G. R. (2001) "Statistical Methods in Bioinformatics. An introduction" Springer Verlag. Statistics for biology and health. Park Y, Spouge JL. (2002) "The correlation error and finite-size correction in an ungapped sequence alignment." Bioinformatics. Sep;18(9): (PubMed)