Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.

Slides:

Advertisements

Similar presentations

Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome ECS289A.

Advertisements

Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.

Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*

Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.

Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.

Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.

Bioinformatics Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 1 Introduction Aleppo University Faculty of technical engineering.

1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.

[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.

Comparative Motif Finding

[Bejerano Fall10/11] 1 Thank you for the midterm feedback! Projects will be assigned shortly.

Detecting Orthologs Using Molecular Phenotypes a case study: human and mouse Alice S Weston.

Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.

Defining the Regulatory Potential of Highly Conserved Vertebrate Non-Exonic Elements Rachel Harte BME230.

Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.

Sequence Databases As DNA and protein sequences accumulate, they are deposited in public databases. One of the most popular of these is GenBank, which.

Computational Genomics Lecture 1, Tuesday April 1, 2003.

Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.

Finding Regulatory Motifs in DNA Sequences

A Computational Analysis of the H Region of Mouse Olfactory Receptor Locus 28 Deanna Mendez SoCalBSI August 2004.

Introduction to Bioinformatics From Pairwise to Multiple Alignment.

1 Introduction to Bioinformatics 2 Introduction to Bioinformatics. LECTURE 3: SEQUENCE ALIGNMENT * Chapter 3: All in the family.

BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.

Genome Informatics 2005 ~ 220 participants 1 keynote speaker: David Haussler 47 talks 121 posters.

A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.

What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.

CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Regulation of Gene Expression: An Overview  Transcriptional  Tissue-specific transcription factors  Direct binding of hormones, growth factors, etc.

* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.

Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.

20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.

TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.

Construction of Substitution Matrices

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.

Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.

Identification of Compositionally Similar Cis-element Clusters in Coordinately Regulated Genes Anil G Jegga, Ashima Gupta, Andrew T Pinski, James W Carman,

Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.

From Genomes to Genes Rui Alves.

Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.

Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.

MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.

EB3233 Bioinformatics Introduction to Bioinformatics.

COT 6930 HPC and Bioinformatics Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.

Bioinformatics and Computational Biology

Basic Overview of Bioinformatics Tools and Biocomputing Applications II Dr Tan Tin Wee Director Bioinformatics Centre.

341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.

Sequence Alignment.

Construction of Substitution matrices

Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,

Motif Search and RNA Structure Prediction Lesson 9.

Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.

Pattern Discovery and Recognition for Understanding Genetic Regulation Timothy L. Bailey Institute for Molecular Bioscience University of Queensland.

Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.

Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.

bacteria and eukaryotes

Volume 124, Issue 1, Pages (January 2006)

High-throughput Biological Data The data deluge

Structure of proximal and distant regulatory elements in the human genome Ivan Ovcharenko Computational Biology Branch National Center for Biotechnology.

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Introduction to Bioinformatics II

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Volume 124, Issue 1, Pages (January 2006)

Nora Pierstorff Dept. of Genetics University of Cologne

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Problems from last section

Basic Local Alignment Search Tool

KEY CONCEPT Entire genomes are sequenced, studied, and compared.

Presentation transcript:

Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki Erice 30 Nov 2005

Pairwise alignment of strings A: S T O C K H O L M B: T U K H O L M A minimum number of ’mutation’ steps: a -> b a -> є є -> b …

Dynamic programming d i,j = min(if a i =b j then d i-1,j-1 else , d i-1,j + 1, d i,j-1 + 1) = distance between i-prefix of A and j-prefix of B (without substitutions) d i,j d i-1,j-1 d i,j-1 d i-1,j d m,n mxn table d A B ai ai bj bj +1

d i,j = min(if a i =b j then d i-1,j-1 else , d i-1,j + 1, d i,j-1 + 1) d ID (A,B) optimal alignment by trace-back

Homology searches find homologous sequences: new sequence versus all old ones in database – the most popular computational task in present-day molecular biology = approximate string matching BLAST - big success good homology => same biological function D A T A B A S E NEW SEQUENCE ?

Multiple alignment multiple alignment of sequence families to find interesting conserved motifs: NP-hard => heuristics, Hidden Markov models, MCMC comparison of entire genomes

Gene enhancer module prediction

Problem Gene expression regulation in multicellular organisms is controlled in combinatorial fashion by so called transcription factors (TFs). Transcription factors bind to DNA cis-elements (TF binding sites) on enhancer modules (promoters), and multiple factors need to bind to activate the module. In mammals, the modules are few and far The problem: Locate functional regulatory modules, that is, find interesting patterns.

Gene enhancer modules gene 1 gene 2 gene 3 gene 4 DNA RNA transcription translation Proteins transcription factors enhancer module

Model of cell type specific regulation of target gene expression GLIXY (tissue specific TFs) GLI Ubiquitously expressed TF transcription Common targets (e.g. Patched): Cell type specific targets (e.g. N-myc):

Binding affinity matrices The TF binding sites are represented by affinity matrices. –A column per position –A row per nucleotide Discovered: –Computationally –Traditional wet lab –Microarrays

Binding affinity matrices

Determined TF binding profiles (+ JASPAR)

Finding conserved motifs of binding sites looking at one (human) genome gives too many positives comparative genomics approach: –take the 200 kB regions surrounding the same genes (paralogs and orthologs) of different mammals: human, mouse, chicken, … –find conserved clusters (= motifs) of binding sites cluster = group of binding sites with good local alignment = > Smith-Waterman type algorithm with a novel scoring function

Smith-Waterman find the best local alignment of strings A and B: substring X of A and substring Y of B such that X and Y have the best scoring pairwise alignment X Y

Computational identification of enhancer elements Preserved in evolution: –Affinities of functional cis- elements. –Spatial arrangement of elements within a module. Human Mouse

Parameter optimization scoring function has 3 free parameters. Find good parameters by greedy hill climbing using a training data

Whole genome comparisons Whole genomes can be analyzed with our implementation EEL (Enhancer Element Locator) We compared human genes to orthologs in mouse, rat, chicken, fugu, tetraodon and zebrafish –100 kbp flanking regions on both sides of the gene. –Coding regions masked out. –About comparisons for each pair of species.

Annotating the Human genome with mammalian enhancer-elements

EEL output ● Output from EEL program. ● Previously known functional sites are highlighted ● DNA between the sites is aligned just for the output

Enhancer prediction for N-myc 200 kb Mouse N-Myc genomic region 200 kb Human N-Myc genomic region Conserved GLI binding sites in two predicted enhancer elements, CM5 and CM7 coding region of N-Myc

Wet-lab verification ● Selected some predicted enhancer modules for wet-lab verification ● Fused 1kb DNA segment containing the predicted enhancer to a marker gene (LacZ) with a minimal promoter, and generated transgenic embryos.

Enhancer prediction for N-myc 200 kb Mouse N-Myc genomic region 200 kb Human N-Myc genomic region Conserved GLI binding sites in two predicted enhancer elements, CM5 and CM7 coding region of N-Myc

Summary input: kb flanking sequences of DNA of orthologous pairs of genes from human and mouse find all good enough TF binding sites from the sequences find the best local alignments of the binding sites using the EEL scoring function output: the sequences in good local alignments; these are the putative enhancers postprocessing: an expert biologist selects the most promising predictions for wet lab verification; hopefully he/she has good luck!

Acknowledgements Kimmo Palin Outi Hallikas (Biom) Jussi Taipale (Biom) The BioSapiens project is funded by the European Commission within its FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health,"contract number LHSG-CT