I.U. School of Informatics Motif Discovery from Large Number of Sequences: A Case Study with Disease Resistance Genes in Arabidopsis thaliana by Irfan.

Slides:



Advertisements
Similar presentations
Protein Structure Database Introduction Database of Comparative Protein Structure Models ModBase 生資所 g 詹濠先.
Advertisements

Intelligent Systems and Software Engineering Lab (ISSEL) – ECE – AUTH 10 th Panhellenic Conference in Informatics Machine Learning and Knowledge Discovery.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Intro to Bioinformatics Summary. What did we learn Pairwise alignment – Local and Global Alignments When? How ? Tools : for local blast2seq, for global.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
Introduction to Genomics, Bioinformatics & Proteomics Brian Rybarczyk, PhD PMABS Department of Biology University of North Carolina Chapel Hill.
Tutorial 5 Motif discovery.
Comparative Genomics Bio Informatics Scott Gulledge.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Protein Modules An Introduction to Bioinformatics.
Evaluating alignments using motif detection Let’s evaluate alignments by searching for motifs If alignment X reveals more functional motifs than Y using.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Protein and Function Databases
EVOLUTIONARY AND COMPUTATIONAL GENOMICS Shin-Han Shiu Plant Biology / CMB / EEBB / Genetics / QBMI.
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Arabidopsis Gene Project GK-12 April Workshop Karolyn Giang and Dr. Mulligan.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Automatic methods for functional annotation of sequences Petri Törönen.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Functional Linkages between Proteins. Introduction Piles of Information Flakes of Knowledge AGCATCCGACTAGCATCAGCTAGCAGCAGA CTCACGATGTGACTGCATGCGTCATTATCTA.
Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Finish up array applications Move on to proteomics Protein microarrays.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Institute of Biomedical Sciences (ICB) Malaria Nucleus Institute of Mathematics and Statistics (IME) BIOINFO-USP Nucleus Latin American Course on Bioinformatics.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Whole Genome Repeat Analysis Package A Preliminary Analysis of the Caenorhabditis elegans Genome Paul Poole.
Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.
Guiding motif discovery by iterative pattern refinement Zhiping Wang Advisor: Sun Kim, Mehmet Dalkilic School of Informatics, Indiana University.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill
Motif discovery and Protein Databases Tutorial 5.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
GeWorkbench John Watkinson Columbia University. geWorkbench The bioinformatics platform of the National Center for the Multi-scale Analysis of Genomic.
Protein Domain Database
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Input: Alignment. Model parameters from neutral sequence Estimation example.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Motif Search and RNA Structure Prediction Lesson 9.
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
Maik Friedel, Thomas Wilhelm, Jürgen Sühnel FLI-Jena, Germany Introduction: During the last 10 years, a large number of complete.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
InterPro Sandra Orchard.
Detecting Protein Function and Protein-Protein Interactions from Genome Sequences TuyetLinh Nguyen.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Milanesi Luciano Catania, Italy 13/03/2007 Bioinformatics challenges in European projects in Grid. Milanesi Luciano National Research Council Institute.
Graduate Research with Bioinformatics Research Mentors Nancy Warter-Perez, ECE Robert Vellanoweth Chem and Biochem Fellow Sean Caonguyen 8/20/08.
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
Genome Sequence Annotation Server
Sequence based searches:
생물정보학 Bioinformatics.
Prediction of Regulatory Elements for Non-Model Organisms Rachita Sharma, Patricia.
Predicting Active Site Residue Annotations in the Pfam Database
Bioinformatics Biological Data Computer Calculations +
Genome organization and Bioinformatics
Genome Annotation w/ MAKER
Dr Tan Tin Wee Director Bioinformatics Centre
Presentation transcript:

I.U. School of Informatics Motif Discovery from Large Number of Sequences: A Case Study with Disease Resistance Genes in Arabidopsis thaliana by Irfan Gunduz 04/25/04 Capstone Presentation

INTRODUCTION Motifs Highly conserved regions across a subset of proteins that share the same function  A molecule’s function  A Structural Feature  Family membership Motifs can be used to predict YNEDSKH YDDDSNH YDNDSNH YENDSKH >Seq A >Seq B >Seq C >Seq D I.U. School of Informatics

INTRODUCTION Current motif finding soft-wares: MEME PROSITE PRATT, etc Do they work with large number of sequences? Pattern discovery relies on statistical or combinatorial techniques, looking for signals Signal-to-noise ratio becomes less clear as the number of sequences increases What to do? I.U. School of Informatics

 Develop a computational procedure to find functional motifs from large number of sequences Objective

I.U. School of Informatics  BLAST (Sequence alignment tool)  BAG ( Sequence Clustering package)  CLUSTAL W (Multiple sequence alignment)  HMMERII (HMM based software)  BLOCK MAKER (Block/Motif finder)  LAMA (Block comparison tools)  PERL Tools COMPUTATIONAL PROCEDURE

I.U. School of Informatics COMPUTATIONAL PROCEDURE 1- Collecting and Clustering Sequences

I.U. School of Informatics COMPUTATIONAL PROCEDURE 2 - ENRICHMENT

I.U. School of Informatics 3 – REFINEMENT 4 – MOTIF FINDING COMPUTATIONAL PROCEDURE

I.U. School of Informatics A Case Study with Disease Resistance Genes in Arabidopsis thaliana

I.U. School of Informatics Why Disease Resistance Genes?

I.U. School of Informatics Background, Disease Resistance Genes DomainProbable Function TIR CC KIN LRRRecognition of specificity NBATP and GTP binding

I.U. School of Informatics 116 disease resistance protein or disease resistance protein like annotated sequences were extracted from Arabidopsis thaliana genome Clustered into 32 groups After refinement four clusters were formed for further analysis # of Sequences Cluster 196 Cluster 245 Cluster 3641 Cluster to 640 sequences were added in each cluster after HMM iterations Case Study, Arabidopsis thaliana

I.U. School of Informatics Case Study, Arabidopsis thaliana PFAM Search Cluster 1NB-ARC, TIR, Kin, LRR Cluster 2 NB-ARC, Kin, LRR Cluster 3 Ser/Thr Kin Cluster 4 Kin Domains

I.U. School of Informatics Cluster1 Cluster2 Results, Block Maker Case Study, Arabidopsis thaliana YDVFLSFRGVDTRQTIVSHL YDVFLSFRGEDTRKNIVSHL YDVFLSFRGEDTRKTIVSHL

I.U. School of Informatics Results, Lama and BAG Case Study, Arabidopsis thaliana Cluster1 Cluster2 Cluster1 Cluster2 Cluster3 Clusters at the whole gene level Clusters at the Block Level

I.U. School of Informatics TIR-ITIR-IIKin1a Kin2NBS-B Kin1aKin2NBS-BNBS-CNBS-AGLPL Cluster1 Cluster2 Cluster1 Cluster2 Cluster3 Clusters at the whole gene level Clusters at the Block Level LRR Case Study, Arabidopsis thaliana RPP8 RPM1 RPS4 RPP1 RPP5

I.U. School of Informatics Number of Disease Resistance Gene Candidates on each Chromosome Cluster Cluster CHR-1CHR-IICHR-III CHR-IV CHR-V Case Study, Arabidopsis thaliana

I.U. School of Informatics New Disease Resistance Gene Candidates Cluster 1 GI GI GI Cluster 2 GI GI GI GI Case Study, Arabidopsis thaliana

I.U. School of Informatics To test effectiveness of the computational procedure  792 Unique sequences were merged and submitted to MEME and PRATT to detect functional motifs. Time : Took more than 9000 minutes on Pentium IV 1.7 GHz machine running on Linux Result : No known disease resistance gene motifs were detected Case Study, Arabidopsis thaliana

I.U. School of Informatics CONCLUSIONS:  Sensible combination of tools provides an excellent mechanism for motif detection  Clustering helps to improve performance of other well known tools Case Study, Arabidopsis thaliana

I.U. School of Informatics ACKNOWLEDGEMENT Motif Discovery from Large Number of Sequences: A Case Study with Disease Resistance Genes in Arabidopsis thaliana Irfan Gunduz, Sihui Zhao, Mehmet Dalkilic and Sun Kim will be presented at The 2003 International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences

I.U. School of Informatics Case Study, Arabidopsis thaliana

I.U. School of Informatics Disease Resistance Mechanism

I.U. School of Informatics COMPUTATIONAL PROCEDURE  Refinement B A C D B DC