Hidden Markov Modeling, Multiple Alignments and Structure Bioinformatic Modeling Techniques Student: Patricia Pearl.

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Ulf Schmitz, Statistical methods for aiding alignment1 Bioinformatics Statistical methods for pattern searching Ulf Schmitz
Lecture 2 Hidden Markov Model. Hidden Markov Model Motivation: We have a text partly written by Shakespeare and partly “written” by a monkey, we want.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic Trees Systematics, the scientific study of the diversity of organisms, reveals the evolutionary relationships between organisms. Taxonomy,
Phylogenetic Trees Understand the history and diversity of life. Systematics. –Study of biological diversity in evolutionary context. –Phylogeny is evolutionary.
Pfam(Protein families )
Classification of Living Things. 2 Taxonomy: Distinguishing Species Distinguishing species on the basis of structure can be difficult  Members of the.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
Molecular Evolution Revised 29/12/06
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Lecture 5: Learning models using EM
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 18: Application-Driven Hardware Acceleration (4/4)
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Comparative ab initio prediction of gene structures using pair HMMs
Sequence similarity.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Molecular Clocks, Base Substitutions, & Phylogenetic Distances.
Similar Sequence Similar Function Charles Yan Spring 2006.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Topic : Phylogenetic Reconstruction I. Systematics = Science of biological diversity. Systematics uses taxonomy to reflect phylogeny (evolutionary history).
Computational Structure Prediction Kevin Drew BCH364C/391L Systems Biology/Bioinformatics 2/12/15.
Xuhua Xia Sequence Alignment Xuhua Xia
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Assessment of sequence alignment Lecture Introduction The Dot plot Matrix visualisation matching tool: – Basics of Dot plot – Examples of Dot plot.
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Protein Sequence Alignment and Database Searching.
Systematics the study of the diversity of organisms and their evolutionary relationships Taxonomy – the science of naming, describing, and classifying.
Hidden Markov Models for Sequence Analysis 4
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
BIOINFORMATICS IN BIOCHEMISTRY Bioinformatics– a field at the interface of molecular biology, computer science, and mathematics Bioinformatics focuses.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Sequencing a genome and Basic Sequence Alignment
Introduction to Bioinformatics Biostatistics & Medical Informatics 576 Computer Sciences 576 Fall 2008 Colin Dewey Dept. of Biostatistics & Medical Informatics.
Sequence Alignment Xuhua Xia
ARE THESE ALL BEARS? WHICH ONES ARE MORE CLOSELY RELATED?
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Hidden Markov Models A first-order Hidden Markov Model is completely defined by: A set of states. An alphabet of symbols. A transition probability matrix.
AdvancedBioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2002 Mark Craven Dept. of Biostatistics & Medical Informatics.
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
Finding new nirK genes in metagenomic data
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
DNA Sequences Analysis Hasan Alshahrani CS6800 Statistical Background : HMMs. What is DNA Sequence. How to get DNA Sequence. DNA Sequence formats. Analysis.
Taxonomy & Phylogeny. B-5.6 Summarize ways that scientists use data from a variety of sources to investigate and critically analyze aspects of evolutionary.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
Free for Academic Use. Jianlin Cheng.
Xuhua Xia Sequence Alignment Xuhua Xia
Bioinformatics Madina Bazarova. What is Bioinformatics? Bioinformatics is marriage between biology and computer. It is the use of computers for the acquisition,
Pipelines for Computational Analysis (Bioinformatics)
Genome Annotation Continued
Evidence of Evolution review
Overview Bioinformatics: Analyzing biological data using statistics, math modeling, and computer science BLAST = Basic Local Alignment Search Tool Input.
Evidence and Phylogenetic trees
Dr Tan Tin Wee Director Bioinformatics Centre
Sequence Based Analysis Tutorial
Chapter 19 Molecular Phylogenetics
Basic Local Alignment Search Tool (BLAST)
Reconfigurable Computing (EN2911X, Fall07)
Presentation transcript:

Hidden Markov Modeling, Multiple Alignments and Structure Bioinformatic Modeling Techniques Student: Patricia Pearl

The basic notion of a hidden Markov model was covered during the class lectures and in our midterm. There are more issues about its history development and future that we’ll discuss tonight.

There was a time when scientists started to think about using hidden Markov models for multiple protein alignments. When was that? Which professional field was using it already?

This is the bibliographic reference for the article that protein scientists used when they got started. Rabiner, L. R. “A tutorial on hidden Markov models and selected application in speech recognition.” Proceedings of the IEEE, 77 (2), This work was sophisticated and a group of scientists at University of California at Santa Cruz could make an analogy between computer speech recognition and protein multiple alignments.

How did they make the analogy between speech recognition and multiple protein and DNA alignments? Speech Recognition Multiple Alignments Alphabet phonemes amino acids Observation words or strings primary sequence of phonemes Good – assigns sounds that sequences in the high probability are real words set

The paper they published is: Krogh, A., Brown, M., Mian, I.S., Sjölander, K., and Haussler, D. “Hidden Markov Models in Computational Biology: Applications to Protein Modeling.” Journal of Molecular Biology, 1994, 235: Sean Eddy was a student at UCSC then. In an article of his, (1996) he describes the paper referenced above as: “The paper that introduced the use of HMM methods for protein and DNA sequence profiles. “

Then, the software was developed by two collections of scientists and grad students, separately. There are many researchers in the subject that are not at these labs. University of California at Santa Cruz and University of Washington, St Louis, Missouri, by UCSC’s former student, Sean Eddy and his research group. Two suites of software have been developed. Their differences are non-trivial. SAM at UCSC Sequence Alignment and Modeling System. HMMER at U of W. Both suites can be downloaded. SAM needs UNIX. HMMER can use many systems.

As has been emphasized in lecture, the advantage of the HMM approach is that it does not guess aabout gap penalties, nor about amino acids nor states. It bases those values on actual data, Bayesian probabilities based in facts. SAM at UCSC Sequence Alignment and Modeling System. Their software is based on HMM’s. Also use a mathematical approach called Dirichlet mixtures to improve detection of weak homologies and to derive hidden Markov models for protein families.

HMMER at University of Washington Sean Eddy’s Lab Home Page This page and related pages have many articles that are available to download. URL for User’s Guide html If we had HMMER installed at BRANDEIS for us, we could all use it with the help of this manual.

HMMER One of the approaches that Sean Eddy has taken to improve HMMER is to use an approach from computational physical chemistry and x-ray diffraction protein crystallography called simulated annealing. The probability values of the fundamental recursive HMM algorithm are varied by an exponential factor taken from the Boltzman formula for physical entropy. S = k b ln Ω The Boltzman constant, k b, is multiplied by t, for temperature. It is started at t = high temp and decreased. The “kt” is used as an exponent P^(1/kt). Eddy reports that it improves accuracy. (Eddy, S., 1995)

Many people are developing the HMM approach to use it on RNA sequences. It is meaningful to briefly describe a recent paper that makes extensive use of primarily hand done RNA alignments, using both primary sequence and secondary RNA structure. It produces evidence toward resolving a problem in systematics biology or evolutionary biology. With HMMER, or any similar software, for RNA alignments, much of this work may be much easier and have measurable probabilistic statistics in the future.

“However, accurate alignment is only possible for proteins of known structure – at least for an identifiable core of residues that comprises the secondary structure elements and active site of the molecule.” S. Eddy(1995) quoting Chothia and Lesk(1986)

Common ancestor OR Anatomical Evidence And more rRNA Multiple alignments w/out secondary structure Crocodile Bird Mammal

|----|----|----|----|----|----|----| Seq1 A-CC-----GC GA--CUUG--GA-CC-CG--G Seq2 A-CC-----GU GA--CUUG--GA-CC-CG--G Seq3 AACCCCGGUGUAGGGGGAAGAACCUUGAUGAACCUCGAUG Seq4 AACCCCGGUGCAGGGGGAAGAACCUUCAUGAACCUCGAUG Figure 1. The problem of aligning short and long sequences. Sequences 1 and 2 are like the reptilian and bird ribosomal 18s RNA. Sequences 3 and 4 are like mammals. Reference: Xiam X., Xie, Z., Kjer, K.M. “18S ribosomal RNA and tetrapod phylogeny.” Systematic Biology. Washington: Jun Vol 52, Iss.3; pg 283.

Phylogenetic tree From: Xiam et al., 2003

They produced several phylogenetic trees, using different methods, with the careful manual alignments that took secondary structure into account. In all, the birds are closer to the crocodiles than to the mammals. “Our research indicates that the previous discrepancy of phylogenetic results between the 18S rRNA gene and other genes is caused mainly by: 1.) misalignment of sequences 2.) the inappropriate use of the frequency parameters 3.) poor sequence quality. When the sequences are aligned with the aide of the secondary structure of the 18S rRNA molecule and when the frequency parameters are estimated either from all sites or from the variable domains where substitutions have occurred, the 18S rRNA sequences no longer support the grouping of the avian species with the mammalian species.” Xia, X., et al., 2003

If there were more time, this presentation would also Include discussions of Psi Blast and of SuperFam. Psi Blast is a BLAST software at NCBI that uses HMM’s and can use multiple alignments. a tutorialhttp:// tml the sitehttp://

SuperFam is a relatively new website. It uses the HMM approach, 59 genomes, and all the solved structures, from those genomes, that are publicly available, as well. The head scientist of SuperFam, Prof. Cyrus Chothia, also supervised a web site called SCOP, or Structural Classification of Proteins. You might find it interesting, that all of the protein structures that are “solved” are actually organized and classified.

Bibliography Eddy, S.R. “Multiple alignment using hidden Markov models.” Proc. Int. Conf. Intell. Syst. Mol Biol. 1995;3: Eddy, S.R. “Hidden Markov Models.” Curr Opin Struct Biol Jun;6(3): Review. Eddy, S.R., “Profile hidden Markov models.” Bioinformatics, 1998; 14(9): Review. Gough, J., and Chothia, C., “SUPERFAMILY: HMMs representing all proteins of known structure. SCOP sequence searches, alignments and genome assignments.” Nucleic Acids Research, 2002, Vol 30:1. Krogh, A., Brown, M., Mian, I.S., Sjolander, Haussler, D. “Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235: , February 1994.

Rabiner, L. R. “A tutorial on hidden Markov models and selected application in speech recognition.” Proceedings of the IEEE, 77 (2), Xia, X., Xie, Z., Kjer, K.M. “18S ribosomal RNA and tetrapod phylogeny.” Systematic Biology. Washington: Jun Jun Vol. 52, Iss. 3; pg 283.