Motif discovery EM algorithm Gibbs Sampler Enumeration Regression methods Phylogenetic trees Purpose Construction Finding significance Not directly related.

Slides:



Advertisements
Similar presentations
Phylogenetic Tree A Phylogeny (Phylogenetic tree) or Evolutionary tree represents the evolutionary relationships among a set of organisms or groups of.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Bioinformatics Phylogenetic analysis and sequence alignment The concept of evolutionary tree Types of phylogenetic trees Measurements of genetic distances.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
 Aim in building a phylogenetic tree is to use a knowledge of the characters of organisms to build a tree that reflects the relationships between them.
Phylogenetic Trees Lecture 4
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Phylogenetic reconstruction
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Molecular Evolution Revised 29/12/06
Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.
BIOE 109 Summer 2009 Lecture 4- Part II Phylogenetic Inference.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
CISC667, F05, Lec14, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (I) Maximum Parsimony.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Bioinformatics and Phylogenetic Analysis
Transcription factor binding motifs (part I) 10/17/07.
Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.
. Comput. Genomics, Lecture 5b Character Based Methods for Reconstructing Phylogenetic Trees: Maximum Parsimony Based on presentations by Dan Geiger, Shlomo.
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Probabilistic methods for phylogenetic trees (Part 2)
Multiple Sequence Alignments
Phylogenetic trees Sushmita Roy BMI/CS 576
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Multiple Sequence Alignments and Phylogeny.  Within a protein sequence, some regions will be more conserved than others. As more conserved,
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Terminology of phylogenetic trees
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Lecture 25 - Phylogeny Based on Chapter 23 - Molecular Evolution Copyright © 2010 Pearson Education Inc.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Applied Bioinformatics Week 8 Jens Allmer. Practice I.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Introduction to Phylogenetics
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Calculating branch lengths from distances. ABC A B C----- a b c.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Cis-regulatory Modules and Module Discovery
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
Maximum Likelihood Given competing explanations for a particular observation, which explanation should we choose? Maximum likelihood methodologies suggest.
Phylogeny Ch. 7 & 8.
Applied Bioinformatics Week 8 Jens Allmer. Theory I.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Phylogenetic Trees - Parsimony Tutorial #13
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
Step 3: Tools Database Searching
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.
Transcription factor binding motifs (part II) 10/22/07.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
CSCE555 Bioinformatics Lecture 13 Phylogenetics II Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Hidden Markov Models BMI/CS 576
Phylogenetic basis of systematics
Distance based phylogenetics
A Very Basic Gibbs Sampler for Motif Detection
Gibbs sampling.
Learning Sequence Motif Models Using Expectation Maximization (EM)
Hierarchical clustering approaches for high-throughput data
BNFO 602 Phylogenetics Usman Roshan.
Presentation transcript:

Motif discovery EM algorithm Gibbs Sampler Enumeration Regression methods Phylogenetic trees Purpose Construction Finding significance Not directly related to today’s topic: EM algorithm tutorial

From a known family of sequences that are related only by very small motifs (that contribute to the function), find the real motif from vast amounts of junk. Sequence length ~2000; motif size ~10 Science, 262, Finding regulatory motifs - Background

The aligned segment can still be quite diverse. Probability ratios by position: Science, 262,

Finding regulatory motifs - Background Gene Regulatory motifs Transcription aparatus Transcription factor Gene Regulatory motifs Transcription aparatus Transcription factor

ChIP-chip and ChIP-Seq Ren et al. 1999; Iyer et al Thanks to Dr. Steve Qin for the slide.

ChIP-chip and ChIP-Seq Brief Bioinform (2013) 14 (2):

Finding regulatory motifs - Background  The key biological question: Knowing a number of TF binding sites from ChIP-seq, or genes regulated by a certain TF by functional studies, how to find the binding motif?

Finding regulatory motifs - Background Difficulties: (1)Motif locations vary with regard to the gene (2)Motifs are small (~10 characters to be found in strings of ~2000 characters) (3)Motif sequences vary (4)Motif may be multi-block (not considered here) Vavouri et al. Genome Biology :R15

Finding regulatory motifs - Background Summarizing a motif from an aligned sequence block: Brief Bioinform (2013) 14 (2):

Finding regulatory motifs - Conservation One common measure of conservation of a motif: entropy for position i: Or, relative entropy compared to background frequency:

Finding regulatory motifs - quality How to judge the quality of an identified motif? The null distribution is not as simple as we hope. As we discussed before, the genome is not random. No simple profile model summarizes the null --- the non-motif background. What to consider: > Position bias: cluster close to the “start codon” ATG > Orientation bias: tend to be up-stream of a gene > Functional specificity: occurring mostly in the sequences under study, and/or mostly near a functional subcategory of genes.

Finding regulatory motifs - formulation The formulation of the computational question: Align tens/hundreds of sequences; only a tiny fraction of the aligned sequences actually support the alignment (the motif) What’s needed: (1)A scoring system (likelihood function, loss function, etc) (2)A very efficient algorithm to search for the optimum in a space of astronomical size Example: align 100 sequences, each 2000 characters long, there’s roughly 4000^99 possibilities

Finding regulatory motifs – scoring system The motif is generated from a position weight matrix; All other parts of the sequences are generated randomly from a background model. Example: Third order Markov background model: CTTATGTA

Finding regulatory motifs - MEME We are simultaneously seeking two things: (1)A motif profile (θ) (2)The location of the motif in each sequence (π) Sounds familiar? If we treat π as missing data, the problem falls into the realm of the EM algorithm. MEME -- Multiple EM for Motif Elicitation Original idea: Timothy L. Bailey and Charles Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp , AAAI Press, Menlo Park, California, 1994.

Finding regulatory motifs - MEME D'haeseleer PD'haeseleer P. Nat Biotechnol Aug;24(8):

Finding regulatory motifs - MEME The EM algorithm may be trapped at a local optimum. To avoid local optima, MEME tries multiple start points: (1)every n-character word present in the sequences is tried. (2)The best is selected; (3)mask the best, find the second best … … It allows repeated occurrence of a motif in every sequence.

Finding regulatory motifs – Gibbs Sampler Can be considered a stochastic version of the EM algorithm. Probablistic updates avoids local optima to some extent. Lawrence et al. Science 262, Liu, JS et al. J. Am. Stat. Assoc. 90, 1156–1170 S: observed sequences θ: motif parameters θ 0 : background parameters (non-interest, but important) π : motif locations Goal: P(π, θ | θ 0, S) Sample P(π | θ, θ 0, S) Sample P(θ | π, θ 0, S)

Finding regulatory motifs – Gibbs Sampler Randomly initiate θ (a probability matrix) Randomly select one sequence i, score all positions using θ Update π(i) based on pattern probability Iterate until θ converges Weight for each segment: A x =Q x /P x, Q x : likelihood based on the position weight matrix P x : likelihood based on the background model

Finding regulatory motifs – Gibbs Sampler Science 262, Quote:

Finding regulatory motifs – Enumeration Enumeration ??!!! Yes. It works very well compared to EM, Gibbs sampler, and other methods. Tompa, M. et al. Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. 23, 137–144 (2005) The basic idea: Simply count all possible m-mers in all the sequences of interest. Select those with exceptionally high frequencies (by testing). Merge similar ones into a position weight matrix. Using efficient programming (hash), one scan through the sequences gives us all frequencies. The real methods are much more complicated than above.

Finding regulatory motifs – MDscan To address the problem: some input sequences may not contain the motif. Liu XS et al. Nature Biotechnology 20, (2002) (1)Find all non-redundant w-mers (“seed”) from the most reliable subset of t sequences; (2)Find all matches (≥m identity) in the t sequences; (3)Establish the position weight matrix using these matches; (4)Evaluate the motif matrix (approximate maximum a posteriori (MAP) score):

Finding regulatory motifs – MDscan From the top 10~50 seeds from the previous step, iterate Forward step: Using the non-top sequences, search for the w-mer seed; If the addition of a “match” increases the score of the w- mer, the sequence is added; Backward step: Re-examine all matches of each seed; If the removal of a match increases the score, remove the match; Report the highest scoring motifs.

Finding regulatory motifs – Motif Regressor Conlon et al. PNAS 6:3339. (2003) Goal: link gene expression with motif discovery.

Finding regulatory motifs – Motif Regressor How well a sequence agrees with a motif: all w-mers in this sequence motif model background HMM model For every motif: Stepwise model selection using all “good” motifs from last step:

Modeling inter-dependent positions Zhou and Liu Bioinformatics 2005 Barash et al. RECOMB 2003 Thanks to Dr. Steve Qin for the slide

Detect intra-dependent position pairs 26 Hu et al. Nucleic Acids Research, 2009, 1–14 Build intra-dependency into the motif model.

Phylogenetic tree

Phylogenetic tree  It can be seen as a model beyond multiple sequence alignment Sequence similarity  Evolutionary relationship  The biological relavence of phylogenetic trees Finding evolutionary relationships among organisms Help predict gene functions Finding important mutations in rapidly changing organisms, e.g. HIV In terms of computation, sifting out real mutations from all possible pairs at a given location of an alignment

Building phylogenetic trees  Binary tree structure is assumed: At every split, two descendent edges. Biologically: at any time point, only one new species can be generated. edge node  The tree can be built for species, or just a certain protein  The edge length reflects evolutionary distance Different organisms have different evolution speed Different molecule have different evolution speed

Rooted and unrooted trees  The real (hidden) evolutionary path is a rooted tree  Phylogenetic trees are estimates to the real tree based on some distance information  Some methods estimates where the root is, while some do not a b c d a b c d a b C d … … AAAGGGGTTTT a AAAGGGGCCCC b AAATTTTAAAA c AAACCCCAAAA d

Reconstruction of trees  Maximum parsimony Find the tree that requires the least number of mutations to derive the observed sequences.  Distance-based reconstruction Similar to clustering methods.  Likelihood-based reconstruction Needs a model for sequence evolution. Find the tree that gives the highest likelihood of the observed data. (Not discussed here.) The two methods above can be seen as likelihood-based under very simple sequence evolution models and other assumptions.

Maximum parsimony Four sequences: AAG AAA GGA AGA Three possible trees: #changes: 3 4 4

Maximum parsimony  Find the tree that can explain the observed data with minimal number of substitutions. This is analogous to model selection: if two models explain the data equally well, the simpler is preferred. Assumption:Equal evolutionary speed; independence between positions Can apply weights to account for different probabilities of mutations.

Tasks: (1)With a given topology and leaf assignments, find the minimal #substitutions Note: Non-leaf nodes can vary. Needs minimization. Using the independence assumption, can compute one position at a time. (2) Search through all possible topologies with leaf assignments. Maximum parsimony

Task (1) example: the simplest: Fitch’s algorithm Bottom to top: Maximum parsimony AG T T A TAG AGT A Min(# substitutions) =#union operations =2

Task (2) is a very hard problem. There are ~34 million possible trees when n=10. A number of heuristic methods were proposed to deal with the problem. A few names here: Some of such methods use a good topology found by distance- based methods (below) as a starting point. Maximum parsimony Nearest Neighbour Interchanges Branch and Bound Tree Bisection and Reconnection …

Distance- based methods UPGMA Same assumptions as parsimony

Likelihood-based methods Data (aligned sequences): D Evolutionary model: M Tree structure: T Maximize Prob(D | M, T) M: The probability of changing from character b to character a in edge length (time) t. With independence assumption, between two sequences:

Likelihood-based methods The probability of a tree is the product of the sequence mutation rates in each branch of the tree, which is in turn the product of the rate of substitution in each branch times the branch length. It allows mutation rate variations across branches. Computationally costly.

Assessing significance  A simple non-parametric bootstrap strategy: Resample columns (positions) from the aligned sequences. The confidence of a branching event is assessed by the fraction of its occurrence in the trees from the resampled sequences.

Assessing significance Derdeyn et al. Science. 303(5666):