Tue Sep 18 Intro 1: Computing, statistics, Perl, Mathematica Tue Sep 25 Intro 2: Biology, comparative genomics, models & evidence, applications Tue Oct.

Slides:



Advertisements
Similar presentations
Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Hidden Markov Model in Biological Sequence Analysis – Part 2
1 Applications of Dynamic Programming zTo sequence analysis Shotgun sequence assembly Multiple alignments Dispersed & tandem repeats Bird song alignments.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Ab initio gene prediction Genome 559, Winter 2011.
Hidden Markov Models.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Hidden Markov Models Modified from:
Biochemistry and Molecular Genetics Computational Bioscience Program Consortium for Comparative Genomics University of Colorado School of Medicine
Lecture 6, Thursday April 17, 2003
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
1 DNA1: Last week's take-home lessons Types of mutants Mutation, drift, selection Binomial for each Association studies  2 statistic Linked & causative.
Heuristic alignment algorithms and cost matrices
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
1 DNA1: (Last week) Types of mutants Mutation, drift, selection Binomial for each Association studies  2 statistic Linked & causative alleles Alleles,
Introduction to BioInformatics GCB/CIS535
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Comparative ab initio prediction of gene structures using pair HMMs
Introduction to bioinformatics
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Applications of HMMs Yves Moreau Overview Profile HMMs Estimation Database search Alignment Gene finding Elements of gene prediction Prokaryotes.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Protein Sequence Alignment and Database Searching.
Hidden Markov Models for Sequence Analysis 4
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
From Genomes to Genes Rui Alves.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Sequence Alignment.
Step 3: Tools Database Searching
Copyright OpenHelix. No use or reproduction without express written consent1.
(H)MMs in gene prediction and similarity searches.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
Protein Sequence Alignment Multiple Sequence Alignment
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
bacteria and eukaryotes
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Introduction to Bioinformatics II
Basic Local Alignment Search Tool
Presentation transcript:

Tue Sep 18 Intro 1: Computing, statistics, Perl, Mathematica Tue Sep 25 Intro 2: Biology, comparative genomics, models & evidence, applications Tue Oct 02 DNA 1: Polymorphisms, populations, statistics, pharmacogenomics, databases Tue Oct 09 DNA 2: Dynamic programming, Blast, multi-alignment, HiddenMarkovModels Tue Oct 16 RNA 1: Microarrays, library sequencing & quantitation concepts Tue Oct 23 RNA 2: Clustering by gene or condition, DNA/RNA motifs. Tue Oct 30 Protein 1: 3D structural genomics, homology, dynamics, function & drug design Tue Nov 06 Protein 2: Mass spectrometry, modifications, quantitation of interactions Tue Nov 13 Network 1: Metabolic kinetic & flux balance optimization methods Tue Nov 20 Network 2: Molecular computing, self-assembly, genetic algorithms, neural-nets Tue Nov 27 Network 3: Cellular, developmental, social, ecological & commercial models Tue Dec 04 Project presentations Tue Dec 11 Project Presentations Tue Jan 08 Project Presentations Tue Jan 15 Project Presentations Bio 101: Genomics & Computational Biology

DNA1: Last week's take-home lessons Types of mutants Mutation, drift, selection Binomial & exponential dx/dt = kx Association studies  2 statistic Linked & causative alleles Alleles, Haplotypes, genotypes Computing the first genome, the second... New technologies Random and systematic errors

DNA2: Today's story and goals zMotivation and connection to DNA1 zComparing types of alignments & algorithms z Dynamic programming zMulti-sequence alignment zSpace-time-accuracy tradeoffs zFinding genes -- motif profiles zHidden Markov Model for CpG Islands

DNA 2 figure Intro2: Common & simple DNA1: the last 5000 generations

Applications of Dynamic Programming zTo sequence analysis yShotgun sequence assembly yMultiple alignments yDispersed & tandem repeats yBird song alignments yGene Expression time-warping zThrough HMMs yRNA gene search & structure prediction yDistant protein homologies ySpeech recognition

Alignments & Scores Global (e.g. haplotype) ACCACACA ::xx::x: ACACCATA Score= 5(+1) + 3(-1) = 2 Suffix (shotgun assembly) ACCACACA ::: ACACCATA Score= 3(+1) =3 Local (motif) ACCACACA :::: ACACCATA Score= 4(+1) = 4

Increasingly complex (accurate) searches Exact (StringSearch) CGCG Regular expression (PrositeSearch) CGN{0-9}CG = CGAACG Substitution matrix (BlastN) CGCG ~= CACG Profile matrix (PSI-blast) CGc(g/a) ~ = CACG Gaps (Gap-Blast) CGCG ~= CGAACG Dynamic Programming (NW, SM) CGCG ~= CAGACG Hidden Markov Models (HMMER) WU

"Hardness" of (multi-) sequence alignment Align 2 sequences of length N allowing gaps. ACCAC-ACA ACCACACA ::x::x:x: :xxxxxx: AC-ACCATA, A-----CACCATA, etc. 2N gap positions, gap lengths of 0 to N each: A naïve algorithm might scale by O(N 2N ). For N= 3x10 9 this is rather large. Now, what about k>2 sequences? or rearrangements other than gaps?

Separate Training set and Testing sets Need databases of non-redundant sets. Need evaluation criteria (programs) Sensistivity and Specificity (false negatives & positives) sensitivity (true_predicted/true) specificity (true_predicted/all_predicted) Where do training sets come from? More expensive experiments: crystallography, genetics, biochemistry Testing search & classification algorithms

Pearson WR Protein Sci 1995 Jun;4(6): Comparison of methods for searching protein sequence databases. Methods Enzymol 1996;266: Effective protein sequence comparison. Algorithm: FASTA, Blastp, Blitz Substitution matrix:PAM120, PAM250, BLOSUM50, BLOSUM62 Database: PIR, SWISS-PROT, GenPept Comparisons of homology scores

Switch to protein searches when possible M Adjacent mRNA codons F 3’ uac 5'... aug 3’aag uuu...

A Multiple Alignment of Immunoglobulins

Scoring matrix based on large set of distantly related blocks: Blosum62

Scoring Functions and Alignments zScoring function: y  (match) = +1; y  (mismatch) = -1; y  (indel) = -2; y  (other) = 0. zAlignment score: sum of columns. zOptimal alignment: maximum score. } substitution matrix

Calculating Alignment Scores

DNA2: Today's story and goals zMotivation and connection to DNA1 zComparing types of alignments & algorithms z Dynamic programming zMulti-sequence alignment zSpace-time-accuracy tradeoffs zFinding genes -- motif profiles zHidden Markov Model for CpG Islands

What is dynamic programming? A dynamic programming algorithm solves every subsubproblem just once and then saves its answer in a table, avoiding the work of recomputing the answer every time the subsubproblem is encountered. -- Cormen et al. "Introduction to Algorithms", The MIT Press.

Recursion of Optimal Global Alignments

Recursion of Optimal Local Alignments

Computing Row-by-Row min =

Traceback Optimal Global Alignment

Local and Global Alignments

Time and Space Complexity of Computing Alignments

Space & Time Considerations zComparing two one-megabase genomes. zSpace: yAn entry: 4 bytes; yTable: 4 * 10^6 * 10^6 = 4 Terabytes memory (one row at a time) zTime: y1000 MHz CPU: 1M entries/second; y10^12 entries: 1M seconds = 10 days.

Time & Space Improvement for w-band Global Alignments zTwo sequences differ by at most w bps (w<<n). zw-band algorithm: O(wn) time and space. zExample: w=3.

Summary zDynamic programming zStatistical interpretation of alignments zComputing optimal global alignment zComputing optimal local alignment zTime and space complexity zImprovement of time and space zScoring functions

DNA2: Today's story and goals zMotivation and connection to DNA1 zComparing types of alignments & algorithms z Dynamic programming zMulti-sequence alignment zSpace-time-accuracy tradeoffs zFinding genes -- motif profiles zHidden Markov Model for CpG Islands

A Multiple Alignment of Immunoglobulins

A multiple alignment Dynamic programming on a hyperlattice From G. Fullen, 1996.

Multiple Alignment vs Pairwise Alignment Optimal Multiple AlignmentNon-Optimal Pairwise Alignment

Computing a Node on Hyperlattice V S A k=3 2 k –1=7

Challenges of Optimal Multiple Alignments zSpace complexity (hyperlattice size): O(n k ) for k sequences each n long. zComputing a hyperlattice node: O(2 k ). zTime complexity: O(2 k n k ). zFind the optimal solution is exponential in k (non-polynomial, NP-hard).

Methods and Heuristics for Optimal Multiple Alignments zOptimal: dynamic programming yPruning the hyperlattice (MSA) zHeuristics: ytree alignments(ClustalW) ystar alignments ysampling (Gibbs) ylocal profiling with iteration (PSI-Blast,...)

ClustalW: Progressive Multiple Alignment All Pairwise Alignments Cluster Analysis Similarity Matrix Dendrogram From Higgins(1991) and Thompson(1994).

Star Alignments Pairwise Alignment Find the Central Sequence s 1 Pairwise Alignment Multiple Alignment Combine into Multiple Alignment

DNA2: Today's story and goals zMotivation and connection to DNA1 zComparing types of alignments & algorithms z Dynamic programming zMulti-sequence alignment zSpace-time-accuracy tradeoffs zFinding genes -- motif profiles zHidden Markov Model for CpG Islands

What is distinctive ?Failure to find edges? 0. Promoters & CGs islandsVariety & combinations 1. Preferred codonsTiny proteins (& RNAs) 2. RNA splice signals Alternatives & weak motifs 3. Frame across splicesAlternatives 4. Inter-species conservationGene too close or distant 5. cDNA for splice edgesRare transcript Accurately finding genes & their edges

Annotated "Protein" Sizes in Yeast & Mycoplasma x = "Protein" size in #aa % of proteins at length x Yeast

Predicting small proteins (ORFs) min max Yeast

Mutations in domain II of 23 S rRNA facilitate translation of a 23 S rRNA-encoded pentapeptide conferring erythromycin resistance. Dam et al J Mol Biol 259:1-6 Trp (W) leader peptide, 14 codons: MKAIFVLKGWWRTS Phe (F) leader peptide, 15 codons: MKHIPFFFAFFFTFP His (H) leader peptide, 16 codons: MTRVQFKHHHHHHHPD Small coding regions STOP

Motif Matrices a a t g c a t g g a t g t g t g a c g t Align and calculate frequencies. Note: Higher order correlations lost.

Protein starts GeneMark

Motif Matrices a a t g = 12 c a t g = 12 g a t g = 12 t g t g = 10 a c g t Align and calculate frequencies. Note: Higher order correlations lost. Score test sets: a c c c = 1

DNA2: Today's story and goals zMotivation and connection to DNA1 zComparing types of alignments & algorithms z Dynamic programming zMulti-sequence alignment zSpace-time-accuracy tradeoffs zFinding genes -- motif profiles zHidden Markov Model for CpG Islands

Why probabilistic models in sequence analysis? z Recognition - Is this sequence a protein start? z Discrimination - Is this protein more like a hemoglobin or a myoglobin? z Database search - What are all of sequence in Swiss-prot that look like a serine protease?

A Basic idea Assign a number to every possible sequence such that  s P(s|M) = 1 P(s|M) is a probability of sequence s given a model M.

Sequence recognition Recognition question - What is the probability that the sequence s is from the start site model M ? P(M|s) = P(M)* P(s|M) / P(s) (Bayes' theorem) P(M) and P(s) are prior probabilities and P(M|s) is posterior probability.

Database search z N = null model (random bases or AAs) z Report all sequences with logP(s|M) - logP(s|N) > logP(N) - logP(M)  Example, say  hydrolase fold is rare in the database, about 10 in 10,000,000. The threshold is 20 bits. If considering 0.05 as a significant level, then the threshold is = 24.4 bits.

C rare due to lack of uracil glycosylase (cytidine deamination) TT rare due to lack of UV repair enzymes. CG rare due to 5methylCG to TG transitions (cytidine deamination) AGG rare due to low abundance of the corresponding Arg-tRNA. CTAG rare in bacteria due to error-prone "repair" of CTAGG to C*CAGG. AAAA excess due to polyA pseudogenes and/or polymerase slippage. AmAcid Codon Number /1000 Fraction Arg AGG Arg AGA Arg CGG Arg CGA Arg CGT Arg CGC ftp://sanger.otago.ac.nz/pub/Transterm/Data/codons/bct/Esccol.cod Plausible sources of mono, di, tri, & tetra- nucleotide biases

C+ A+ G+ T+ P(G+|C+) > P(A+|A+) CpG Island + in a ocean of - First order Markov Model MM=16, HMM= 64 transition probabilities (adjacent bp) C- A- G- T- P(C-|A+)> Hidden

Estimate transistion probabilities -- an example Training set P(G|C) = #(CG) /  N #(CN) Laplace pseudocount: Add +1 count to each observed. (p.9,108,321 Dirichlet) (p.9,108,321

Estimated transistion probabilities from 48 "known" islands Training set P(G|C) = #(CG) /  N #(CN)

Viterbi: dynamic programming for HMM Recursion: v l (i+1) = e l (x i +1) max(v k (i)a kl ) s i = Most probable path k=2 states

DNA2: Today's story and goals zMotivation and connection to DNA1 zComparing types of alignments & algorithms z Dynamic programming zMulti-sequence alignment zSpace-time-accuracy tradeoffs zFinding genes -- motif profiles zHidden Markov Model for CpG Islands