Statistical Analysis for Word counting in Drosophila Core Promoters Yogita Mantri April 27 2005 Bioinformatics Capstone presentation.

Slides:

Advertisements

Similar presentations

Periodic clusters. Non periodic clusters That was only the beginning…

Advertisements

PREDetector : Prokaryotic Regulatory Element Detector Samuel Hiard 1, Sébastien Rigali 2, Séverine Colson 2, Raphaël Marée 1 and Louis Wehenkel 1 1 Bioinformatics.

Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.

Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.

Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.

D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.

Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.

Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.

Markov Chains Lecture #5

McPromoter – an ancient tool to predict transcription start sites

Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.

Bi-correlation clustering algorithm for determining a set of co- regulated genes BIOINFORMATICS vol. 25 no Anindya Bhattacharya and Rajat K. De.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,

Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.

Transcription factor binding motifs (part I) 10/17/07.

Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.

Sequence Motifs. Motifs Motifs represent a short common sequence –Regulatory motifs (TF binding sites) –Functional site in proteins (DNA binding motif)

The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.

An analysis of “Alignments anchored on genomic landmarks can aid in the identification of regulatory elements” by Kannan Tharakaraman et al. Sarah Aerni.

1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler.

1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, (16 April 2004)

Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.

Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.

Promoter structure and gene regulation. Bacterial Promoters Source:

A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss.

Searching for TFBSs with TRANSFAC - Hot topics in Bioinformatics.

A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.

Guiding Motif Discovery by Iterative Pattern Refinement Zhiping Wang, Mehmet Dalkilic, Sun Kim School of Informatics, Indiana University.

Analyzing transcription modules in the pathogenic yeast Candida albicans Elik Chapnik Yoav Amiram Supervisor: Dr. Naama Barkai.

Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.

발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.

A new way of seeing genomes Combining sequence- and signal-based genome analyses Maik Friedel, Thomas Wilhelm, Jürgen Sühnel FLI Introduction: So far,

Sequence analysis – an overview A.Krishnamachari

Motif finding with Gibbs sampling CS 466 Saurabh Sinha.

Grupo 5. 5’site 3’site branchpoint site exon 1 intron 1 exon 2 intron 2 AG/GT CAG/NT.

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Figure 2: over-representation of neighbors in the fushi-tarazu region of Drosophila melanogaster. Annotated enhancers are marked grey. The CDS is marked.

Searching for structured motifs in the upstream regions of hsp70 genes in Tetrahymena termophila. Roberto Marangoni^, Antonietta La Terza*, Nadia Pisanti^,

Motifs BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.

Identification of cell cycle-related regulatory motifs using a kernel canonical correlation analysis Presented by Rhee, Je-Keun Graduate Program in Bioinformatics.

Computational Genomics and Proteomics Lecture 8 Motif Discovery C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E.

Localising regulatory elements using statistical analysis and shortest unique substrings of DNA Nora Pierstorff 1, Rodrigo Nunes de Fonseca 2, Thomas Wiehe.

` Gene Diversification and Transcript Variants by Transposable Elements Un-Jong Jo 1, Dae-Soo Kim 1, Tae-Hyung Kim 1, Jae-Won Huh 2 and Heui-Soo Kim 1,2.

Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.

CS5238 Combinatorial methods in bioinformatics

How do we represent the position specific preference ? BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G.

Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.

Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.

Local Multiple Sequence Alignment Sequence Motifs

. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.

Inference with Gene Expression and Sequence Data BMI/CS 776 Mark Craven April 2002.

Motif Search and RNA Structure Prediction Lesson 9.

Intro to Probabilistic Models PSSMs Computational Genomics, Lecture 6b Partially based on slides by Metsada Pasmanik-Chor.

Finding genes in the genome

HISPIG – A Discriminative Model Refinement Approach with Iterations for Detecting Regulatory Regions Takuma Tsukahara

Transcription factor binding motifs (part II) 10/22/07.

BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.

Evaluation of count scores for weight matrix motifs Project Presentation for CS598SS Hong Cheng and Qiaozhu Mei.

Regulation of Gene Expression

bacteria and eukaryotes

Recitation 7 2/4/09 PSSMs+Gene finding

Finding regulatory modules

Presented by, Jeremy Logue.

Human Promoters Are Intrinsically Directional

Nora Pierstorff Dept. of Genetics University of Cologne

Summarized by Sun Kim SNU Biointelligence Lab.

Presented by, Jeremy Logue.

Deep Learning in Bioinformatics

Gene regulatory regions of the insect/crustacean egr-B homologs.

Presentation transcript:

Statistical Analysis for Word counting in Drosophila Core Promoters Yogita Mantri April Bioinformatics Capstone presentation

 Introduction & Motivation  Dataset used  Part I – Unbiased word counting  Part II – TCAGT-centric word counting  Conclusions and Future work

Introduction  Regulatory elements are short DNA sequences that control gene expression.  They are often found around the Transcription Start Site (TSS), sometimes further upstream.  Identification of promoters and regulatory elements is a major challenge in bioinformatics: Regulatory elements are not well-conserved Computational discovery of TSS in not straightforward Promoter sequences do not have distinguishable statistical properties Transcription is a highly cooperative process including competitive or cooperative binding which is not completely determined from the rest of the genome’s DNA sequence

“Computational analysis of core promoters in the Drosophila Genome”, Ohler, Rubin et. al, Genome Biology 2002, 3(12):research0087.1– Above image edited from: Drosophila Core Promoters

Motivation for project  Database of Core Promoters with TSS experimentally determined is a huge advantage over other approaches using only gene upstream regions.  Word Counting method to determine significant patterns, inspired by Dr. Peter Cherbas’ earlier work. “The arthropod initiator: the capsite consensus plays an important role in transcription”, Cherbas L, Cherbas P., Insect Biochem Mol Biol Jan;23(1):81-90

 Introduction & Motivation  Dataset used  Part I – Unbiased word counting  Part II – TCAGT-centric word counting  Conclusions and Future work

The Database of Drosophila Core Promoters  Compiled by Sumit Middha. It consists of Drosophila core promoters from three experimental sources.  Ohler, Rubin et al: 1941 promoters Stringent criteria for identifying TSSs, requiring 5’ ends of multiple cDNAs to lie in close proximity.  Kadonaga et al: 205 promoters Changed TSS to coincide with A of Inr consensus TCAGT even if experimental results reported TSS in the vicinity. The discrepancy was fixed by taking the experimentally reported TSS.  Eukaryotic Promoter Database: 1926 promoters Assigned TSS based on experimental data with a precision of +/- 5bp or better.  3458 sequences after removing redundant entries in the dataset.

 Introduction & Motivation  Dataset used  Part I – Unbiased word counting  Part II – TCAGT-centric word counting  Conclusions and Future work

Word Analysis – Part I Unbiased search  Used various statistical measures like Z- score on all possible n-mers in the entire dataset and in specific windows.  The goal was to see whether known patterns of interest were significantly enriched in promoter sequences than other patterns.

Basic Statistics of the dataset  3458 promoter sequences in the database.  First step was a word-frequency analysis (pentamers used for initial analysis)  Performed analysis on the following sets: Entire dataset (DS-1) Subset of above dataset, with only -20 to +20 region (DS-2)  2 types of analyses, differing in “Random” sequences used: 1 st Order Markov Chains based on base and transition probabilities of respective dataset “non-coding” regions

Random set  Generated 100 sets of 1 st order Markov chains  Each set contained same number of sequences as original datase (3458), and having same length (350)  Computed occurrence of each pentamer in actual and random sequences  For random sequences, calculated average and S.D over all sets

Z-score  A test of significance  Mean and S.D calculated over 100 sets  Calculated Z-scores for all pentamers  Looking for pentamers with very high or very low Z- scores

RankPatternZ-Score 1aaaaa ttttt ttttg88.1 4gaaaa aaaac atttt gtttt ttttc aaaat gcagc gcagt tcagt acagt tcatt tataa Rank of TCAGT and variants in entire dataset

PATTERNZ-ScoreRank tcagt tcatt gcagt acagt tataa Summary of known pentamers in different windows PatternZ-scoreRank tcagt tcatt gcagt acagt tataa Sliding Windows PatternZ-scoreRank tcagt tcatt gcagt acagt tataa Non-overlapping windows

Z-score Plots of tcagt and variants using sliding windows of 10 bp

Lesson  Cannot ignore position preference of regulatory motifs!

 Introduction & Motivation  Dataset used  Part I – Unbiased word counting  Part II – TCAGT-centric word counting  Conclusions and Future work

Word Analysis – Part II Guided search, starting with known INR element TCAGT  Identification of INR enriched regions  Identification of synonyms  Correlation analysis of INR synonyms  Guided search

TCAGT-centric word analysis WindowZscore (-3,3) (-4,2) (-2,4) (-5,1) (-6,-1) (-7,-2) (-1,5) (1,6) (2,7) (3,8) 28.79

Group1 CTCAG--- ATCAG--- TTCAG--- GTCAG--- -TCAGT-- ---AGTTG ---AGTCG --CAGTT- --CAGTC- Group 3 ACACT--- -CACTCTG Group 4 -TCACA- GTCAC-- --CACAC Group 6 -CATTC TCATT- INR Synonyms Group 2 TTAGT Group 5 TCACTCT “Computational analysis of core promoters in the Drosophila Genome”, Ohler, Rubin et. al, Genome Biology 2002, 3(12):research0087.1–

TOTAL: INR+INR- TATA+ TATA- TATA+ TATA- DPE-DPE+ DPE- DPE+ DPE- DPE+ DPE Binary Tree Representation of Dataset

3 Clusters in INR-positive set ggtcacact ggtcacac cggtcacac ttcagtcg cggacgtg tataaaag tcagt TATA (-40, -35) DPE (+20, -30) INR (-10, +2)

TATA+TATA- INR INR INR+, TATA+ Log Likelihood: DPE+DPE- INR INR INR+, DPE+ Log Likelihood: DPE+DPE- TATA TATA INR+, DPE+ Log Likelihood: Contingency Matrices for INR, TATA, DPE

tctttcttt ggtcacac ctcgaggg ctatcgat cggtcacac ttctttccg gtcacact Possible Alternative TATA and INR Synonyms ?? TATA – 2 ? INR – 2 ?

actatcgat ctatcgat tatcgataaactatcgat Enrichment further upstream – New Binding Sites?

TOTAL: INR+ INR- TATA+ TATA- INR_2+ INR_2- DPE-DPE+ DPE- DPE+ TATA_2+ TATA_2- DPE+ DPE Next Level of Binary Tree analysis DPE+ DPE- ? DPE+ DPE- ?

Conclusions & Future steps  The main goal of this project was to try to identify significant words based on only statistical over- representation.  The first part of the analysis using an unbiased searching method was successful only in a very narrow range of positions around the TSS.  However, the biased search starting with the Inr consensus revealed the 3 known regulatory elements in that region.  An analysis of the Inr-negative set showed over-expression of patterns in the same positions as the Inr, TATA and DPE should be, and could be possible synonyms.  Thus the word-counting strategy has the potential to reveal: Regulatory motifs and interrelationships that other motif discovery programs cannot Synonyms for regulatory motifs Dependencies among regulatory motifs

Acknowledgements  Dr. Haixu Tang  Dr. Sun Kim  Dr. Peter Cherbas  Sumit Middha  Bioinformatics Research Group