Tue Sep 18 Intro 1: Computing, statistics, Perl, Mathematica Tue Sep 25 Intro 2: Biology, comparative genomics, models & evidence, applications Tue Oct.

Slides:



Advertisements
Similar presentations
BioInformatics (3).
Advertisements

Basic Gene Expression Data Analysis--Clustering
Outlines Background & motivation Algorithms overview
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
Introduction to Bioinformatics
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Understanding Science Through the Lens of Computation Richard M. Karp Nov. 3, 2007.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Analysis of microarray data. Gene expression database – a conceptual view Samples Genes Gene expression levels Sample annotations Gene annotations Gene.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Microarrays. Regulation of Gene Expression Cells respond to environment Heat Food Supply Responds to environmental conditions Various external messages.
Microarray II. What is a microarray Microarray Experiment RT-PCR LASER DNA “Chip” High glucose Low glucose.
MOPAC: Motif-finding by Preprocessing and Agglomerative Clustering from Microarrays Thomas R. Ioerger 1 Ganesh Rajagopalan 1 Debby Siegele 2 1 Department.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Introduction to BioInformatics GCB/CIS535
What is Cluster Analysis
1 RNA1: Structure & Quantitation (Last week) Integration with previous topics (HMM for RNA structure) Goals of molecular quantitation (maximal fold-changes,
Clustering (Gene Expression Data) 6.095/ Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Introduction to Bioinformatics - Tutorial no. 12
Fuzzy K means.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Clustering Unsupervised learning Generating “classes”
Whole Genome Expression Analysis
More on Microarrays Chitta Baral Arizona State University.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Gene expression analysis
1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Gene Expression and Networks. 2 Microarray Analysis Supervised Methods -Analysis of variance -Discriminate analysis -Support Vector Machine (SVM) Unsupervised.
Conference Report: Recomb Satellite NYC, Nov 2010 DREAM, Systems Biology and Regulatory Genomics.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
Discovery of transcription networks Lecture3 Nov 2012 Regulatory Genomics Weizmann Institute Prof. Yitzhak Pilpel.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Pattern Discovery and Recognition for Genetic Regulation Tim Bailey UQ Maths and IMB.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Motif Search and RNA Structure Prediction Lesson 9.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
1 What forces constrain/drive protein evolution? Looking at all coding sequences across multiple genomes can shed considerable light on which forces contribute.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Unsupervised Learning
Clustering Manpreet S. Katari.
RNA1: Last week's take home lessons
Dimension reduction : PCA and Clustering
Transcription factor binding motifs
Unsupervised Learning
Presentation transcript:

Tue Sep 18 Intro 1: Computing, statistics, Perl, Mathematica Tue Sep 25 Intro 2: Biology, comparative genomics, models & evidence, applications Tue Oct 02 DNA 1: Polymorphisms, populations, statistics, pharmacogenomics, databases Tue Oct 09 DNA 2: Dynamic programming, Blast, multi-alignment, HiddenMarkovModels Tue Oct 16 RNA 1: 3D-structure, microarrays, library sequencing & quantitation concepts Tue Oct 23 RNA 2: Clustering by gene or condition, DNA/RNA motifs. Tue Oct 30 Protein 1: 3D structural genomics, homology, dynamics, function & drug design Tue Nov 06 Protein 2: Mass spectrometry, modifications, quantitation of interactions Tue Nov 13 Network 1: Metabolic kinetic & flux balance optimization methods Tue Nov 20 Network 2: Molecular computing, self-assembly, genetic algorithms, neural-nets Tue Nov 27 Network 3: Cellular, developmental, social, ecological & commercial models Tue Dec 04 Project presentations Tue Dec 11 Project Presentations Tue Jan 08 Project Presentations Tue Jan 15 Project Presentations Bio 101: Genomics & Computational Biology

RNA1: Last week's take home lessons Integration with previous topics (HMM for RNA structure) Goals of molecular quantitation (maximal fold-changes, clustering & classification of genes & conditions/cell types, causality) Genomics-grade measures of RNA and protein and how we choose (SAGE, oligo-arrays, gene-arrays) Sources of random and systematic errors (reproducibilty of RNA source(s), biases in labeling, non-polyA RNAs, effects of array geometry, cross-talk). Interpretation issues (splicing, 5' & 3' ends, editing, gene families, small RNAs, antisense, apparent absence of RNA). Time series data: causality, mRNA decay, time-warping

RNA2: Today's story & goals Clustering by gene and/or condition Distance and similarity measures Clustering & classification Applications DNA & RNA motif discovery & search

Data - Ratios - Log Ratios - Absolute Measurement - Euclidean Dist. - Manhattan Dist. - Sup. Dist. - Correlation Coeff. - Single - Complete - Average - Centroid Unsupervised | Supervised - SVM - Relevance Networks Hierarchical | Non-hierarchical - Minimal Spanning Tree- K-means - SOM Data Normalization | Distance Metric | Linkage | Clustering Method Gene Expression Clustering Decision Tree How to normalize - Variance normalize - Mean center normalize - Median center normalize What to normalize - genes - conditions

(Whole genome) RNA quantitation objectives RNAs showing maximum change minimum change detectable/meaningful RNA absolute levels (compare protein levels) minimum amount detectable/meaningful Classification: drugs & cancers Network -- direct causality-- motifs

Clustering vs. supervised learning K-means clustering SOM = Self Organizing Maps SVD = Singular Value decomposition PCA = Principal Component Analysis SVM = Support Vector Machine classification and Relevance networks Brown et al. PNAS 97:262 Butte et al PNAS 97:12182PNAS 97:262PNAS 97:12182

Cluster analysis of mRNA expression data By gene (rat spinal cord development, yeast cell cycle): Wen et al., 1998; Tavazoie et al., 1999; Eisen et al., 1998; Tamayo et al., 1999et al., 1998; et al., 1999;et al., 1998; et al., 1999 By condition or cell-type or by gene&cell-type (human cancer): Golub, et al. 1999; Alon, et al. 1999; Perou, et al. 1999; Weinstein, et al. 1997et al. 1999;et al. 1999; et al. 1999;et al Cheng, ISMB ISMB

Cluster Analysis Protein/protein complex Genes DNA regulatory elements

Clustering hierarchical & non- Hierarchical: a series of successive fusions of data until a final number of clusters is obtained; e.g. Minimal Spanning Tree: each component of the population to be a cluster. Next, the two clusters with the minimum distance between them are fused to form a single cluster. Repeated until all components are grouped. Non-: e.g. K-mean: K clusters chosen such that the points are mutually farthest apart. Each component in the population assigned to one cluster by minimum distance. The centroid's position is recalculated and repeat until all the components are grouped. The criterion minimized, is the within-clusters sum of the variance.

Clusters of Two-Dimensional Data

Key Terms in Cluster Analysis Distance measures Similarity measures Hierarchical and non-hierarchical Single/complete/average linkage Dendrogram

Distance Measures: Minkowski Metric

Most Common Minkowski Metrics

An Example 4 3 x y

Manhattan distance is called Hamming distance when all features are binary. Gene Expression Levels Under 17 Conditions (1-High,0-Low)

Similarity Measures: Correlation Coefficient

What kind of x and y give linear CC ?

Similarity Measures: Correlation Coefficient Time Gene A Gene B Gene A Time Gene B Expression Level Time Gene A Gene B

Hierarchical Clustering Dendrograms Alon et al Clustering tree for the tissue samples Tumors(T) and normal tissue(n).

Hierarchical Clustering Techniques

The distance between two clusters is defined as the distance between Single-Link Method / Nearest Neighbor: their closest members. Complete-Link Method / Furthest Neighbor: their furthest members. Centroid: their centroids. Average: average of all cross-cluster pairs.

Single-Link Method b a Distance Matrix Euclidean Distance (1) (2) (3) a,b,c ccd a,b dd a,b,c,d

Complete-Link Method b a Distance Matrix Euclidean Distance (1) (2) (3) a,b ccd d c,d a,b,c,d

Dendrograms Single-LinkComplete-Link

Which clustering methods do you suggest for the following two-dimensional data?

Nadler and Smith, Pattern Recognition Engineering, 1993

Data - Ratios - Log Ratios - Absolute Measurement - Euclidean Dist. - Manhattan Dist. - Sup. Dist. - Correlation Coeff. - Single - Complete - Average - Centroid Unsupervised | Supervised - SVM - Relevance Networks Hierarchical | Non-hierarchical - Minimal Spanning Tree- K-means - SOM Data Normalization | Distance Metric | Linkage | Clustering Method Gene Expression Clustering Decision Tree How to normalize - Variance normalize - Mean center normalize - Median center normalize What to normalize - genes - conditions

Normalized Expression Data Tavazoie et al (

Time-point 1 Time-point 3 Time-point 2 Gene 1 Gene 2 Normalized Expression Data from microarrays T1 T2T3 Gene 1 Gene N. Representation of expression data d ij

Identifying prevalent expression patterns (gene clusters) Time-point 1 Time-point 3 Time-point Time -point Normalized Expression Normalized Expression Normalized Expression

Glycolysis Nuclear Organization Ribosome Translation Unknown GenesMIPS functional category Cluster contents

RNA2: Today's story & goals Clustering by gene and/or condition Distance and similarity measures Clustering & classification Applications DNA & RNA motif discovery & search

Motif-finding algorithms oligonucleotide frequencies Gibbs sampling (e.g. AlignACE) MEME ClustalW MACAW

Transcription control sites (~7 bases of information) Genome: (12 Mb) 7 bases of information (14 bits) ~ 1 match every sites such matches in a 12 Mb genome (24 * 10 6 sites). The distribution of numbers of sites for different motifs is Poisson with mean 1500, which can be approximated as normal with a mean of 1500 and a standard deviation of ~40 sites. Therefore, ~100 sites are needed to achieve a detectable signal above background. Feasibility of a whole-genome motif search?

Whole-genome mRNA expression data: two-way comparisons between different conditions or mutants, clustering/grouping over many conditions/timepoints. Shared phenotype (functional category). Conservation among different species. Details of the sequence selection: eliminate protein- coding regions, repetitive regions, and any other sequences not likely to contain control sites. Sequence Search Space Reduction

Whole-genome mRNA expression data: two-way comparisons between different conditions or mutants, clustering/grouping over many conditions/timepoints. Shared phenotype (functional category). Conservation among different species. Details of the sequence selection: eliminate protein- coding regions, repetitive regions, and any other sequences not likely to contain control sites. Sequence Search Space Reduction

Modification of Gibbs Motif Sampling (GMS), a routine for motif finding in protein sequences (Lawrence, et al. Science 262: , 1993). Advantages of GMS: stochastic sampling variable number of sites per input sequence distributed information content per motif AlignACE modifications: considers both strands of DNA simultaneously efficiently returns multiple distinct motifs various other tweaks Motif Finding Motif Finding AlignACE (Aligns nucleic Acid Conserved Elements)

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO bp of upstream sequence per gene are searched in Saccharomyces cerevisiae. AlignACE Example Input Data Set

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA AAAAGAGTCA AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCA AAATGAGTCA GAATGAGTCA AAAAGAGTCA ********** MAP score = (maximum) …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 AlignACE Example The Target Motif

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA ********** TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC 5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA ********** TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC MAP score = …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 AlignACE Example Initial Seeding

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA ********** TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC Add? ********** TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC TCTCTCTCCA How much better is the alignment with this site as opposed to without? …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 AlignACE Example Sampling

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA ********** TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC Add? ********** TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC How much better is the alignment with this site as opposed to without? Remove. ATGAAAAAAT …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 AlignACE Example Continued Sampling

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA ********** GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC Add? ********** TGAAAAATTC GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC How much better is the alignment with this site as opposed to without? …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 AlignACE Example Continued Sampling

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA ********** GACATCGAAA GCACTTCGGC GAGTCATTAC GTAAATTGTC CCACAGTCCG TGTGAAGCAC ********* * GACATCGAAAC GCACTTCGGCG GAGTCATTACA GTAAATTGTCA CCACAGTCCGC TGTGAAGCACA How much better is the alignment with this new column structure? …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 AlignACE Example Column Sampling

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA AAAAGAGTCA AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCA AAATGAGTCA GAATGAGTCA AAAAGAGTCA ********** MAP score = …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 AlignACE Example The Best Motif

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAA X AGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAAT X ACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACT X ACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAA X AGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACT X ATCCCGAACATGAAA 5’- ATTGATTGACT X ATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACT X ATTCTGACT X TTTTTTGGAAAGTGTGGCATGTGCTTCACACA AAAAGAGTCA AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCA AAATGAGTCA GAATGAGTCA AAAAGAGTCA ********** …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 Take the best motif found after a prescribed number of random seedings. Select the strongest position of the motif. Mark these sites in the input sequence, and do not allow future motifs to sample those sites. Continue sampling. AlignACE Example Masking (old way)

5’- TCTCTCTCCACGGCTAATTAGGTGATCATGAAAAAATGAAAAATTCATGAGAAAAGAGTCAGACATCGAAACATACAT 5’- ATGGCAGAATCACTTTAAAACGTGGCCCCACCCGCTGCACCCTGTGCATTTTGTACGTTACTGCGAAATGACTCAACG 5’- CACATCCAACGAATCACCTCACCGTTATCGTGACTCACTTTCTTTCGCATCGCCGAAGTGCCATAAAAAATATTTTTT 5’- TGCGAACAAAAGAGTCATTACAACGAGGAAATAGAAGAAAATGAAAAATTTTCGACAAAATGTATAGTCATTTCTATC 5’- ACAAAGGTACCTTCCTGGCCAATCTCACAGATTTAATATAGTAAATTGTCATGCATATGACTCATCCCGAACATGAAA 5’- ATTGATTGACTCATTTTCCTCTGACTACTACCAGTTCAAAATGTTAGAGAAAAATAGAAAAGCAGAAAAAATAAATAA 5’- GGCGCCACAGTCCGCGTTTGGTTATCCGGCTGACTCATTCTGACTCTTTTTTGGAAAGTGTGGCATGTGCTTCACACA AAAAGAGTCA AAATGACTCA AAGTGAGTCA AAAAGAGTCA GGATGAGTCA AAATGAGTCA GAATGAGTCA AAAAGAGTCA ********** …HIS7 …ARO4 …ILV6 …THR4 …ARO1 …HOM2 …PRO3 Maintain a list of all distinct motifs found. Use CompareACE to compare subsequent motifs to those already found. Quickly reject weaker, but similar motifs. AlignACE Example Masking (new way)

 = standard Beta & Gamma functions N = number of aligned sites; T = number of total possible sites F jb = number of occurrences of base b at position j (F  sum) G b = background genomic frequency for base b  b = n x G b for n pseudocounts (  sum) W = width of motif; C = number of columns in motif (W>=C) MAP Score

N = number of aligned sites R = overrepresentation of those sites. MAP  N log R MAP Score

MAP score Motif AlignACE Example: Final Results (alignment of upstream regions from 116 amino acid biosynthetic genes in S. cerevisiae)

Indices used to evaluate motif significance Group specificity Functional enrichment Positional bias Palindromicity Known motifs (CompareACE)

Searching for additional motif instances in the entire genome sequence Searches over the entire genome for additional high-scoring instances of the motif are done using the ScanACE program, which uses the Berg & von Hippel weight matrix (1987). M = length of binding site motif B = base at position l within the motif n lB = number of occurrences of base B at position l in the input alignment n lO = number of occurrences of the most common base at position l in the input alignment

MCBSCB CLUSTER Number of ORFs Distance from ATG (b.p.) Number of sites Distance from ATG (b.p.) Number of sites Number of ORFs N = 186

M14a CLUSTER M14b CLUSTER Distance from ATG (b.p.) Number of sites Distance from ATG (b.p.) Number of sites Number of ORFs N = 74

Rap1 CLUSTER Number of ORFs M1a CLUSTER Distance from ATG (b.p.) Number of sites Distance from ATG (b.p.) Number of sites Number of ORFs N = 164

Separate, Tag, Quantitate RNAs or interactions Clustering Previous Functional Assignments Periodicity Interaction Motifs Interaction partners Group specificity Positional bias Palindromicity CompareACE Metrics of motif significance

N genes total; s1 = # genes in a cluster; s2= # genes in a particular functional category (“success”); p = s2/N; N=s1+s2-x Which odds of exactly x in that category in s1 trials? Binomial: sampling with replacement. or Hypergeometric: sampling without replacement: Odds of getting exactly x = intersection of sets s1 & s2: Functional category enrichment odds ref (Wrong!)

N = Total # of genes (or ORFs) in the genome s1 = # genes in the cluster s2 = # genes found in a functional category x = # ORFs in the intersection of these groups (hypergeometric probability distribution) x s2s1 N = 6226 (S. cerevisiae) Functional category enrichment

N = Total # of genes (ORFs) in the genome s1 = # genes whose upstream sequences were used to align the motif (cluster) s2 = # genes in the target list (~ 100 genes in the genome with the best sites for the motif near their translational starts) x = # genes in the intersection of these groups x s2s1 N = 6226 (S. cerevisiae) Group Specificity Score (S group )

t = number of sites within 600 bp of translational start from among the best 200 being considered m = number of sites in the most enriched 50-bp window s = 600 bp w = 50 bp Start -600 bp 50 bp Positional Bias

Comparisons of motifs The CompareACE program finds best alignment between two motifs and calculates the correlation between the two position-specific scoring matrices Similar motifs: CompareACE score > 0.7

Clustering motifs by similarity motif A motif B motif C motif D A B C D A B C D 1.0 Cluster motifs using a similarity matrix consisting of all pairwise CompareACE scores cluster 1: A, B cluster 2: C, D CompareACE Hierarchical Clustering

Palindromicity CompareACE score of a motif versus its reverse complement Palindromes: CompareACE > 0.7 Selected palindromicity values: Crp PurR ArgR CpxR

S. cerevisiae AlignACE test set

Most specific motifs (ranked by S group )

Most positionally biased motifs

250 AlignACE runs on 50 groups each of 20, 40, 60, 80,and 100 orfs, resulting in 3692 motifs. Allows calibration of an expected false positive rate for a set of hypotheses resulting from any chosen cutoffs. MAP > 10.0 Spec. < 1e-5 Example: Functional Categories Random Runs 82 motifs (24 known) 41 motifs Negative Controls Computational identification of cis-regulatory elements associated with groups of functionally related genes in S. cerevisiae Hughes, et al JMB, 1999.

Positive Controls 29 transcription factors listed on the CSH web site have five or more known binding sites. AlignACE was run on the upstream regions of the corresponding genes. An appropriate motif was found in 21/29 cases. 5/8 false negatives were found in appropriate functional category AlignACE runs. False negative rate = ~ %

Establishing regulatory connections Generalizing & reducing assumptions: Motif Interactions: (Pilpel et al 2001 Nat Gen )Nat Gen Which protein(s): in vivo crosslinking Interdependence of column in weight matrices: array binding (Bulyk et al 2001PNAS 98: 7158)PNAS 98: 7158

RNA2: Today's story & goals Clustering by gene and/or condition Distance and similarity measures Clustering & classification Applications DNA & RNA motif discovery & search