CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Slides:



Advertisements
Similar presentations
Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.
Advertisements

Learning Algorithm Evaluation
CS5263 Bioinformatics Probabilistic modeling approaches for motif finding.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
OUTLINE Scoring Matrices Probability of matching runs Quality of a database match.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Introduction to BioInformatics GCB/CIS535
Predicting protein functions from redundancies in large-scale protein interaction networks Speaker: Chun-hui CAI
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
In silico cis-analysis promoter analysis - Promoters and cis-elements - Searching for patterns - Searching redundant patterns.
Similar Sequence Similar Function Charles Yan Spring 2006.
In silico cis-analysis promoter analysis - Promoters and cis-elements - Searching for patterns - Searching redundant patterns.
Lecture 12 Splicing and gene prediction in eukaryotes
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
341: Introduction to Bioinformatics Dr. Natasa Przulj Deaprtment of Computing Imperial College London
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops
Multiple testing correction
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Lecture 7 Introduction to Hypothesis Testing. Lecture Goals After completing this lecture, you should be able to: Formulate null and alternative hypotheses.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Bioinformatics: Theory and Practice – Striking a Balance (a plea for teaching, as well as doing, Bioinformatics) Practice (Molecular Biology) Theory: Central.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Comp. Genomics Recitation 3 The statistics of database searching.
Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Combining SELEX with quantitative assays to rapidly obtain accurate models of protein–DNA interactions Jiajian Liu and Gary D. Stormo Presented by Aliya.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Sequence Alignment.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Transcription factor binding motifs (part II) 10/22/07.
Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Network Motifs See some examples of motifs and their functionality Discuss a study that showed how a miRNA also can be integrated into motifs Today’s plan.
Canadian Bioinformatics Workshops
CS5263 Bioinformatics Lecture 19 Motif finding. (Sequence) motif finding Given a set of sequences Goal: find sequence motifs that appear in all or the.
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
Lecture 20 Practical issues in motif finding Final project
Motifs BCH364C/394P - Systems Biology / Bioinformatics
Learning Sequence Motif Models Using Expectation Maximization (EM)
Sequence comparison: Significance of similarity scores
CS 5263 & CS 4233 Bioinformatics Motif finding.
Volume 3, Issue 1, Pages (July 2016)
Finding regulatory modules
False discovery rate estimation
Last Update 12th May 2011 SESSION 41 & 42 Hypothesis Testing.
Motifs BCH339N Systems Biology / Bioinformatics – Spring 2016
Presentation transcript:

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project

Homework problem 3.1 Count separately the number of character comparisons and the number of steps needed to find the next matching character using the bad character rule Question: can you give an example?

Extended bad character rule charPosition in P a 6, 3 b7, 4 p2 t1 x5 T: xpbctbxabpqqaabpqz P: tpabxab *^^ P: tpabxab Find T(k) in P that is immediately left to i, shift P to align T(k) with that position k In this iteration: # of comparison = 3 Table lookup: 2

Results for some real genes llr = 394 E-value = 2.0e-023 llr = 347 E-value = 9.8e-002 llr = 110 E-value = 1.4e+004

Strategies to improve results Combine results from different algorithms usually helpful –Ones that appeared multiple times are probably more interesting Except simple repeats like AAAAA or ATATATATA –Cluster motifs into groups according to their similarities

Strategies to improve results Compare with known motifs in database –TRANSFAC –JASPAR Issues: –How to determine similarity between motifs? Alignment between matrices –How similar is similar? Empirically determine some threshold

Strategies to improve results Statistical test of significance –Enrichment in target sequences vs background sequences Target set T Background set B Assumed to contain a common motif, P Assumed to not contain P, or with very low frequency Ideal case: every sequence in T has P, no sequence in B has P

Statistical test for significance Intuitively: if n / N >> m / M –P is enriched (over-represented) in T –Statistical significance? Target set T Background set + target set B + T Size = N Size = M P Appear in n sequences Appear in m sequences

Hypergeometric distribution A box with M balls, of which N are red, and the rest are blue. –Red ball: target sequences –Blue ball: background sequences If we randomly draw m balls from the box, what’s the probability we’ll see n red balls? –If probability very small, we are probably not drawing randomly Total # of choices: (M choose m) # of choices to have n red balls: (N choose n) x (M-N choose m-n)

Cumulative hypergeometric test for motif significance We are interested in: if we randomly pick m balls, how likely that we’ll see at least n red balls? This can be interpreted as the p-value for the null hypothesis that we are randomly picking. Alternative hypothesis: our selection favors red balls. Equivalent: the target set T is enriched with motif P. Or: P is over-represented in T.

Examples Yeast genome has 6000 genes Select 50 genes believed to be co-regulated by a common TF Found a motif for these 50 genes It appeared in 20 out of these 50 genes In the whole genome, 100 genes have this motif M = 6000, N = 50, m = = 120, n = 20 Intuition: –m/M = 120/6000. In Genome, 1 out 50 genes have the motif –N = 50, would expect only 1 gene in the target set to have the motif –20-fold enrichment P-value = 6 x n = 5. 5-fold enrichment. P-value = Normally a very low p-value is needed, e.g

ROC curve for motif significance Motif is usually a PWM Any word will have a score –Typical scoring function: Log P(W | M) / P(W | B) –W: a word. –M: a PWM. –B: background model To determine whether a sequence contains a motif, a cutoff has to be decided –With different cutoffs, you get different number of genes with the motif –Hyper-geometric test first assumes a cutoff –It may be better to look at a range of cutoffs

ROC curve for motif significance With different score cutoff, will have different m and n Assume you want to use P to classify T and B Sensitivity: n / N Specificity: (M-N-m+n) / (M-N) False Positive Rate = 1 – specificity: (m – n) / (M-N) With decreasing cutoff, sensitivity , FPR  Target set T Background set + target set B + T Size = N Size = M P Appeared in n sequences Appeared in m sequences Given a score cutoff

ROC curve for motif significance ROC-AUC: area under curve. 1: perfect separation. 0.5: random. Motif 1 is better than motif 2. 1-specificity sensitivity Motif 1 Motif 2 Random A good cutoff Highest cutoff. No motif can pass the cutoff. Sensitivity = 0. specificity = 1. Lowest cutoff. Every sequence has the motif. Sensitivity = 1. specificity =

Other strategies Cross-validation –Randomly divide sequences into 10 sets, hold 1 set for test. –Do motif finding on 9 sets. Does the motif also appear in the testing set? Phylogenetic conservation information –Does a motif also appears in the homologous genes of another species? –Strongest evidence –However, will not be able to find species-specific ones

Other strategies Finding motif modules –Will two motifs always appear in the same gene? Location preference –Some motifs appear to be in certain location E.g., within bp upstream to transcription start –If a detect motif has strong positional bias, may be a sign of its function Evidence from other types of data sources –Do the genes having the motif always have similar activities (gene expression levels) across different conditions? –Interact with the same set of proteins? –Similar functions? –etc.

To search for new instances Usually many false positives Score cutoff is critical Can estimate a score cutoff from the “true” binding sites Motif finding Scoring function A set of scores for the “true” sites. Take mean - std as a cutoff. (or a cutoff such that the majority of “true” sites can be predicted).

To search for new instances Use other information, such as positional biases of motifs to restrict the regions that a motif may appear Use gene expression data to help: the genes having the true motif should have similar activities Phylogenetic conservation is the key

Final project Write a review paper on a topic that we didn’t cover in lectures Or Implement an algorithm and do some experiments Compare several algorithms (existing implementation ok) Combine several algorithms to form a pipeline (e.g. gene expression + motif analysis) Final: –5 -10 pages report (single space, single column, 12pt) + 15 minutes presentation

Possible topics for term paper Possible topics: –Haplotype inferencing –Computational challenges associated with new microarray technologies –Phylogenetic footprinting –Small RNA gene / target prediction (siRNA, mRNA, …) –Biomedical text mining –Protein structure prediction –Topology of biological networks

An example project Given a gene expression data (say cell cycle) Cluster genes using k-means Find motifs using several algorithms (Cluster and combine similar motifs) Rank motifs according to their specificity to the target sequences comparing to the other clusters Get their logos Use the sequences to search the whole genome for more genes with the motif Do they have any functional significance?