Statistical Detection of Co-occurring Transcription Factor Binding Sites Armand Halbert 1.

Slides:



Advertisements
Similar presentations
Journal Club Jenny Gu October 24, Introduction Defining the subset of Superfamilies in LUCA Examine adaptability and expansion of particular superfamilies.
Advertisements

Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
Predicting Enhancers in Co-Expressed Genes Harshit Maheshwari Prabhat Pandey.
Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Clustering approaches for high- throughput data Sushmita Roy BMI/CS 576 Nov 12 th, 2013.
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
Multivariate Methods Pattern Recognition and Hypothesis Testing.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Promoter Panel Review. Background related Promoter In genetics, a promoter is a DNA sequence that enables a gene to be transcribed. It may be very long.
Correlated Mutations and Co-evolution May 1 st, 2002.
Identification of a Novel cis-Regulatory Element Involved in the Heat Shock Response in Caenorhabditis elegans Using Microarray Gene Expression and Computational.
Protein Sectors: Evolutionary Units of Three-Dimensional Structure Najeeb Halabi, Olivier Rivoire, Stanislas Leibler, and Rama Ranganthan Cell 138, ,
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
The Model To model the complex distribution of the data we used the Gaussian Mixture Model (GMM) with a countable infinite number of Gaussian components.
An analysis of “Alignments anchored on genomic landmarks can aid in the identification of regulatory elements” by Kannan Tharakaraman et al. Sarah Aerni.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Similar Sequence Similar Function Charles Yan Spring 2006.
Multiple sequence alignment Conserved blocks are recognized Different degrees of similarity are marked.
Bioinformatics Basics Cyrus Courtesy from LO Leung Yau’s original presentation.
STAT 104 Section 10 Daniel Moon. Agenda Midterm Review: Power Multiple Regression Project Proposal Guideline.
Cis-regultory module 10/24/07. TFs often work synergistically (Harbison 2004)
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
Identifying conserved promoter motifs and transcription factor binding sites in plant promoters Endre Sebestyén, ARI-HAS, Martonvásár, Hungary 26th, November,
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Introduction to gene expression Seema Zargar. Lecture outline Introduction to all terms used in Gene expression.
© Wiley Publishing All Rights Reserved. Searching Sequence Databases.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Analyzing transcription modules in the pathogenic yeast Candida albicans Elik Chapnik Yoav Amiram Supervisor: Dr. Naama Barkai.
CHAPTER 16: Inference in Practice. Chapter 16 Concepts 2  Conditions for Inference in Practice  Cautions About Confidence Intervals  Cautions About.
Modelling binding site with 3DLigandSite Mark Wass
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Comp. Genomics Recitation 3 The statistics of database searching.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Bioinformatics Lecture to accompany BLAST/ORF finder activity
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
HIV genome map This genetic map of the HIV-1 viral genome depicts the
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Local Multiple Sequence Alignment Sequence Motifs
Doug Raiford Lesson 5.  Dynamic programming methods  Needleman-Wunsch (global alignment)  Smith-Waterman (local alignment)  BLAST Fixed: best Linear:
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Introduction to Bioinformatics II
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – the Transcription.
(H)MMs in gene prediction and similarity searches.
Detecting Protein Function and Protein-Protein Interactions from Genome Sequences TuyetLinh Nguyen.
Network Motifs See some examples of motifs and their functionality Discuss a study that showed how a miRNA also can be integrated into motifs Today’s plan.
3.3b1 Protein Structure Threading (Fold recognition) Boris Steipe University of Toronto (Slides evolved from original material.
Bioinformatics Shared Resource Bioinformatics : How to… Bioinformatics Shared Resource Kutbuddin Doctor, PhD.
Module 2: Analyzing gene lists: over-representation analysis
APPLICATIONS OF BIOINFORMATICS IN DRUG DISCOVERY
Prediction of Regulatory Elements for Non-Model Organisms Rachita Sharma, Patricia.
Large Scale Annotation of Genomic Datasets with Genephony
Advanced PGDB Editing: Regulation GO Terms
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Dr Tan Tin Wee Director Bioinformatics Centre
Identify D. melanogaster ortholog
Sequence alignment, Part 2
Topic: Medicine of the future Reading: Harbron, Chris (2006)
Investigation of multi-messenger and rare events in EEE
Basic Local Alignment Search Tool (BLAST)
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Basic Local Alignment Search Tool
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

Statistical Detection of Co-occurring Transcription Factor Binding Sites Armand Halbert 1

Background Examining whether transcription factor binding sites are in a fixed position relative to Transcription Start Sites Created protein clusters with TFBS within 300 nucleotides of the Transcription Start Site. 300 Nucleotides was chosen because this where most cis-regulatory factors are concentrated. 2

False discovery rate of protein clusters We used the DAVID web service on the protein clusters to get Functional Annotations of the proteins in the clusters False Discovery rate was used to determine probability that clusters were “interesting” by chance. We confirmed consequences of our fundamental hypothesis, that TFs occupy fixed positions relative to one another in CRMs (cis-regulatory modules) that co-regulate genes. 3

Jaccard distance Jaccard distance is a measure of the dissimilarity of sets. Jaccard distance of pairs of clusters was then compared to the distance of positions of clusters It was expected that as distance increased, the functions of the proteins would diverge 4

Results Distance between the cluster starting sites weakly correlated with Jaccard Distance of pairs of clusters 5

Non-pathogenicity in natural SIV hosts 6

Background Natural hosts (Example: AGM, African Green Monkeys) rarely get sick from SIV, despite high prevalence rate and high viral loads Contrast to Non-Natural Hosts(Example: Humans), who do develop AIDS Natural hosts are believed to have co-evolved with SIV 7

Searching human proteins vs agm proteins PSIBLAST was used to compare each protein of a species with the proteome of another species, and gather the top hits under an evalue threshold Cd-hit was used to put similar proteins into clusters for finding reciprocal best hit proteins 8

PSIBLAST Reports Creation Process 9

10

PSIBLAST Reports Creation Process 11

Searching PSIBLAST reports 12

Searching PSIBLAST reports 13

Searching PSIBLAST reports: Reciprocal Best Hit 14 H1vsAgm.br H1 M2 0.0 H1 M3 2e-2 …. M2vsHuman.br M2 H1 0.0 M2 H6 4e-6 …. reciprocalBestHits.out H1 M2 0.0 …

Searching PSIBLAST reports: Best Hit 15 H2vsAgm.br H2 M2 0.0 H2 M3 2e-2 …. M2vsHuman.br M2 H3 0.0 M2 H6 4e-6 …. bestHits.out H2 M2 0.0 …

Searching PSIBLAST reports: No Hit 16 H2vsAgm.br noHits.out H2 Goal: to find proteins that have no homologue in other species

Results Searches: AGM: 61,804 Proteins Human: 71,340 proteins 17

Expansion Eventually, this process will be adapted to multiple species Challenges involve performance of a large number of psiblast searches. For example, running each human protein against the African green monkey database took 4 days. Creating it as a farm job will allow the application to scale. 18