Guiding motif discovery by iterative pattern refinement Zhiping Wang Advisor: Sun Kim, Mehmet Dalkilic School of Informatics, Indiana University.

Slides:

Advertisements

Similar presentations

Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.

Advertisements

Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.

Sequence analysis of CpG islands reveals possible functional correlation between genes and its CpG island sequence Henry Hyun-il Paik Bioinformatics, School.

Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.

Gibbs sampling for motif finding in biological sequences Christopher Sheldahl.

Transcription factor binding motifs (part I) 10/17/07.

A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute.

Reduced Support Vector Machine

Tutorial 5 Motif discovery.

Discovery of RNA Structural Elements Using Evolutionary Computation Authors: G. Fogel, V. Porto, D. Weekes, D. Fogel, R. Griffey, J. McNeil, E. Lesnik,

CisGreedy Motif Finder for Cistematic Sarah Aerni Mentors: Ali Mortazavi Barbara Wold.

Multiple sequence alignments and motif discovery Tutorial 5.

MotifBooster – A Boosting Approach for Constructing TF-DNA Binding Classifiers Pengyu Hong 10/06/2005.

Biological Sequence Pattern Analysis Liangjiang (LJ) Wang March 8, 2005 PLPTH 890 Introduction to Genomic Bioinformatics Lecture 16.

Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.

Exploring Protein Sequences Tutorial 5. Exploring Protein Sequences Multiple alignment –ClustalW Motif discovery –MEME –Jaspar.

Protein Classification A comparison of function inference techniques.

Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.

DNA Motif and protein domain discovery

Motif finding : Lecture 2 CS 498 CXZ. Recap Problem 1: Given a motif, finding its instances Problem 2: Finding motif ab initio. –Paradigm: look for over-represented.

Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)

A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.

Situations where generic scoring matrix is not suitable Short exact match Specific patterns.

Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)

Motif Discovery in Protein Sequences using Messy De Bruijn Graph Mehmet Dalkilic and Rupali Patwardhan.

Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.

Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

BioKnOT Biological Knowledge through Ontology and TFIDF By: James Costello Advisor: Mehmet Dalkilic.

Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-

Motif search Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington

CSCE555 Bioinformatics Lecture 10 Motif Discovery Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.

Motif finding with Gibbs sampling CS 466 Saurabh Sinha.

Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.

My Research Work and Clustering Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.

Sampling Approaches to Pattern Extraction

Motif discovery Tutorial 5. Motif discovery MEME Creates motif PSSM de-novo (unknown motif) MAST Searches for a PSSM in a DB TOMTOM Searches for a PSSM.

Latent SVM 1 st Frame: manually select target Find 6 highest weighted areas in template Area of 16 blocks Train 6 SVMs on those areas Train 1 SVM on entire.

Markov Chain Monte Carlo and Gibbs Sampling Vasileios Hatzivassiloglou University of Texas at Dallas.

Identification of cell cycle-related regulatory motifs using a kernel canonical correlation analysis Presented by Rhee, Je-Keun Graduate Program in Bioinformatics.

Whole Genome Repeat Analysis Package A Preliminary Analysis of the Caenorhabditis elegans Genome Paul Poole.

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.

Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.

Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules S.D. Lee, David W. Cheung, Ben Kao The University of Hong.

Positional Association Rules Dr. Bernard Chen Ph.D. University of Central Arkansas.

Algorithms in Bioinformatics: A Practical Introduction

Protein Domain Database

PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.

Local Multiple Sequence Alignment Sequence Motifs

Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven

. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.

I.U. School of Informatics Motif Discovery from Large Number of Sequences: A Case Study with Disease Resistance Genes in Arabidopsis thaliana by Irfan.

Polish Infrastructure for Supporting Computational Science in the European Research Space EUROPEAN UNION Examining Protein Folding Process Simulation and.

Special Topics in Genomics Motif Analysis. Sequence motif – a pattern of nucleotide or amino acid sequences GTATGTACTTACTATGGGTGGTCAACAAATCTATGTATGA TAACATGTGACTCCTATAACCTCTTTGGGTGGTACATGAA.

Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.

Copyright OpenHelix. No use or reproduction without express written consent1.

PROTEIN INTERACTION NETWORK – INFERENCE TOOL DIVYA RAO CANDIDATE FOR MASTER OF SCIENCE IN BIOINFORMATICS ADVISOR: Dr. FILIPPO MENCZER CAPSTONE PROJECT.

More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.

1 Discovery of Conserved Sequence Patterns Using a Stochastic Dictionary Model Authors Mayetri Gupta & Jun S. Liu Presented by Ellen Bishop 12/09/2003.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

A Very Basic Gibbs Sampler for Motif Detection

Learning Sequence Motif Models Using Expectation Maximization (EM)

Motif Discovery in Protein Sequences using Messy de Bruijn Graph

Bioinformatics Capstone Project

Transcription factor binding motifs

Finding Functionally Significant Structural Motifs in Proteins

Transcription factor binding motifs

Presentation transcript:

Guiding motif discovery by iterative pattern refinement Zhiping Wang Advisor: Sun Kim, Mehmet Dalkilic School of Informatics, Indiana University

Outline Introduction and motivation Our framework for motif discovery 1. Pattern finding 2. Build seed motif 3. Extract subsequences 4. Find motif 5. Iterative refinement Performance of our framework Discussion and Future work

Introduction – motifs & their applications Protein motifs are short patterns conserved in proteins. They are generally important for the function of a protein or the maintenance of protein structures. 1. Enzyme catalytic sites 2. Regions involved in binding a molecule (ADP/ATP, DNA…) or another protein. 3. A fold important for general 3D structure. Distinguish protein groups based on such patterns. Classify a sequenced protein to a specific family of proteins.

Introduction - motif discovery PROSITE: find patterns manually Deterministic algorithm, expectation maximization based: 1. MEME (time consuming) Stochastic algorithm (Gibbs sampling algorithm), random jumps in the search space: 1. Gibbs Sampler 2. AlignACE

Motivation Motif discover is, in a sense, to look for signals compared to noise. The model for noise largely depends on the input sequences (See previous capstones). Our goal is to use “subsequences” to guide motif discovery. We use an iterative pattern refinement procedure to improve the performance of motif discovery.

Outline Introduction and motivation Our framework for motif discovery 1. Pattern finding 2. Build seed motif 3. Extract subsequences 4. Find motif 5. Iterative refinement Performance of our framework Discussion and Future work

Test Data Preparation 1. Download PROSITE pattern and sequence databases. 2. Parse all positive sequences for each PROSITE ID and store them as a PROSITE family. 3. All sequences of one family contain the same PROSITE pattern. 4. We used PROSITE families for motif discovery.

Framework Overview 1. Find patterns in a PROSITE family 2. Build seed motifs according to patterns 3. Select subsequences based on seed motifs 4. Run motif finding program (MEME) on the subsequences 5. Search motifs using MAST over entire family 6. Select subsequences around the motif regions 7. Go to step 4, until the final motif is stable

Outline Introduction and motivation Our framework for motif discovery Pattern finding 2. Build seed motif 3. Extract subsequences 4. Find motif 5. Iterative refinement Performance of our framework Discussion and Future work

Pattern Finding - thresholds For each PROSITE family, we find conserved patterns first. Three thresholds to find a qualified pattern: 1. length of patterns. 2. log-odd value of 1 st Markov model to random model. 3. support value, the occurrence of a pattern in different sequences.

Pattern Finding - algorithm 1. Use thresholds to scan the sequences in one family, find out qualified patterns in each sequence. 2. Rank the sequences according to how many qualified patterns each sequence has. 3. Output the qualified patterns in the top half sequences. 4. Repeat this algorithm (go to step 1) on the rest half sequences until no more patterns can be found.

Pattern Finding - example Qualified Patterns (p1, p2, p3)

Outline Introduction and motivation Our framework for motif discovery 1. Pattern finding Build seed motif 3. Extract subsequences 4. Find motif 5. Iterative refinement Performance of our framework Discussion and Future work

Build Seed Motif 1. Start from the pattern with maximal support, use it as the seed motif. 2. Calculate the scores of the candidate patterns (in sequences not covered by the seed motif) to the seed motif. S i = ΣS i-j Wj (j = 1… n) Si: score of candidate pattern i to seed motif Si-j: score of candidate pattern to j th pattern in the seed motif Wj: the weight (support ratio) of j th pattern in the seed motif 3. Add the pattern with the highest score (also larger than a score threshold) to the seed motif. 4. Go to step 2, until no more patterns can be added to the seed motif.

Build Seed Motif - example Calculate pattern scores (threshold = 5) PatternSequenceSupport (suppose no shared sequences) WeightScore to motif P1CLG4W 1 = 1 P2CLN213 P3ALG210 P4ALN24 S 2-1 = = 13; S 2 = S 2-1 W 1 = 13 P1 C L G P2 C L N

Build Seed Motif - example Calculate pattern scores (threshold = 5) PatternSequenceSupport (suppose no shared sequences) WeightScore to motif P1CLG4W 1 = 4 / (4+2) P2CLN2W 2 = 2 / (4+2) P3ALG28 P4ALN26 S 3-1 = 10, S 3-2 = 4 S 3 = S 3-1 W 1 + S 3-2 W 2 = 8 > 5 S 4-1 = 4, S 4-2 = 10 S 4 = S 4-1 W 1 + S 4-2 W 2 = 6 > 5

Build Seed Motif - example Calculate pattern scores (threshold = 5) PatternSequenceSupport (suppose no shared sequences) WeightScore to motif P1CLG4W 1 = 4 / 8 P2CLN2W 2 = 2 / 8 P3ALG2W 3 = 2 / 8 P4ALN29 S 4-1 = 4, S 4-2 = 10, S 4-3 = 8 S 4 = S 4-1 W 1 + S 4-2 W 2 + S 4-3 W 3 = 9 > 5

Build Seed Motif

Outline Introduction and motivation Our framework for motif discovery 1. Pattern finding 2. Build seed motif Extract subsequences Find motif Iterative refinement Performance of our framework Discussion and Future work

Extract Subsequences

Find Motif MEME

Iterative refinement motif1, motif2, motif3 MAST motif1’, motif2’, motif3’ sub1, sub2, sub3 MEME entire PROSITE family Stable? choose the best motif no yes

Outline Introduction and motivation Our framework for motif discovery 1. Pattern finding 2. Build seed motif 3. Extract subsequences 4. Find motif 5. Iterative refinement Performance of our framework Discussion and Future work

Experiment 1. We randomly chose 17 PROSITE families as test data set. 2. Ran MEME directly on these families and got the best motif for each of them. 3. Ran our framework and got the best motif. 4. Compared the results.

PROSITE Patterns PS00010 C-x-[DN]-x(4)-[FY]-x-C-x-C. PS00011 x(12)-E-x(3)-E-x-C-x(6)-[DEN]-x-[LIVMFY]-x(9)-[FYW]. PS00014 [KRHQSA]-[DENQ]-E-L>. PS00018 D-x-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}-[LIVMC]-[DENQSTAGC]- x(2)-[DE]-[LIVMFYW]. PS00020 [LIVM]-x-[SGN]-[LIVM]-[DAGHE]-[SAG]-x-[DNEAG]-[LIVM]-x-[DEAG]-x(4)- [LIVM]-x-[LM]-[SAG]-[LIVM]-[LIVMT]-W-x-[LIVM](2). PS00099 [AG]-[LIVMA]-[STAGCLIVM]-[STAG]-[LIVMA]-C-x-[AG]-x-[AG]-x-[AG]-x-[SAG]. PS00342 [STAGCN]-[RKH]-[LIVMAFY]>. PS00343 L-P-x-T-G-[STGAVDE]. PS00409 [KRHEQSTAG]-G-[FYLIVM]-[ST]-[LT]-[LIVP]-E-[LIVMFWSTAG](14). PS00881 [DNEG]-x-[LIVFA]-[LIVMY]-[LVAST]-H-N-[STC]. PS01286 P-x(8,10)-[LM]-R-x-[GE]-[LIVP]-x-G-C. PS00012 [DEQGSTALMKRH]-[LIVMFYSTAC]-[GNQ]-[LIVMFYAG]-[DNEKHS]-S- [LIVMST]-{PCFY}-[STAGCPQLIVMF]-[LIVMATN]-[DENQGTAKRHLM]-[LIVMWSTA]- [LIVGSTACR]-x(2)-[LIVMFA]. PS00019 [EQ]-x(2)-[ATV]-[FY]-x(2)-W-x-N. PS00660 W-[LIV]-x(3)-[KRQ]-x-[LIVM]-x(2)-[QH]-x(0,2)-[LIVMF]-x(6,8)-[LIVMF]-x(3,5)-F- [FY]-x(2)-[DENS]. PS00661 [HYW]-x(9)-[DENQSTV]-[SA]-x(3)-[FY]-[LIVM]-x(2)-[ACV]-x(2)-[LM]-x(2)-[FY]-G- x-[DENQST]-[LIVMFYS]. PS00889 [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV]. PS01177 [CSH]-C-x(2)-[GAP]-x(7,8)-[GASTDEQR]-C-[GASTDEQL]-x(3,9)-[GASTDEQN]-x(2)- [CE]-x(6,7)-C-C.

Performance The result of the comparison. PS00010 PS00011 PS00018 PS00020 PS00409 PS00881 PS00012 PS01286PS00099 PS00019 PS00660 PS00014 PS00342 PS00343 PS00661 PS00889 PS01177 MEME ×× Frame -work ××

Outline Introduction and motivation Our framework for motif discovery 1. Pattern finding 2. Build seed motif 3. Extract subsequences 4. Find motif 5. Iterative refinement Performance of our framework Discussion and Future work

Discussion One flaw: Local optima PS01286 is the only family our framework has worse performance on  PROSITE pattern P-x(8,10)-[LM]-R-x-[GE]-[LIVP]-x-G-C  MEME [TNS] W [HE] [GN] [RG] I [AGS] [LM] R [LV] E [LV] [YLF] G C  our framework 1. [EP] W x(4) L G x L [KM] x [VI] T [GA] [VI] [IA] T Q G 2. X(4)-P-x(8)-[LM]-R-x-E-[LV]-x-G-C

Future Work Design our own motif discovery algorithm Convert the framework to a complete program Test the performance of our program on more PROSITE patterns

Acknowledgement Prof. Sun Kim Prof. Mehmet Dalkilic (Memo) Arvind Gopu Scott Martin