Download presentation
Published bySharleen Craig Modified over 9 years ago
1
Using vertebrate genome comparisons to find gene regulatory regions
Ross Hardison and James Taylor Cold Spring Harbor course on Computational Genomics Nov. 10, 2007
2
Major goals of comparative genomics
Identify all DNA sequences in a genome that are functional Selection to preserve function Adaptive selection Determine the biological role of each functional sequence Elucidate the evolutionary history of each type of sequence Provide bioinformatic tools so that anyone can easily incorporate insights from comparative genomics into their research
3
Types of sequences in mammalian genomes
About 1.5-2% codes for protein Almost all shows a sign for purifying selection since the primate-rodent divergence Does not preclude positive selection acting on smaller regions or in specific lineages About 45% is interspersed repeats 22% in ancestral repeats Good model for neutral DNA 23% in lineage-specific repeats About 53% is noncoding, nonrepetitive Minimum of 4% of genome is under purifying selection for a function common to mammals, but does NOT code for protein Regulatory sequences Non-protein coding genes Other important sequences About 49% under no obvious selection: no conserved function?
4
Impact of whole-genome alignments
Guide to functional sequences in the human genome. Conserved sequences Sequences under purifying selection Better gene predictions Sequences that look like elements that regulate gene expression
5
Three modes of evolution
6
Negative and positive selection observed at different phylogenetic distances
:
7
Genome-wide local alignment chains
Human: 2.9 Gb assembly. Mask interspersed repeats, break into 300 segments of 10 Mb. Human blastZ: Each segment of human is given the opportunity to align with all mouse sequences. Mouse Run blastZ in parallel for all human segments. Collect all local alignments above threshold. Organize local alignments into a set of chains based on position in assembly and orientation. Level 1 chain Net Level 2 chain
8
Comparative genomics to find functional sequences
Genome size Find common sequences blastZ, multiZ Human Mouse Rat All mammals 1000 Mbp 2,900 Identify functional sequences: ~ 145 Mbp 2,400 Also birds: 72Mb 2,500 1,200 million base pairs (Mbp) Papers in Nature from mouse and rat and chicken genome consortia, 2002, 2004
9
Regional variation in divergence rates
10
Implications of co-variation in divergence
Large regions (megabase sized) are changing relatively fast or slow for (almost) all types of divergence Neutral substitution, insertions (except SINEs), deletion, recombination This is a consistent property of each region of genomic DNA See similar patterns for orthologous regions on independent lineages to mouse, rat and human An aligned segment with a given similarity score in a fast-changing region is MORE significant than an aligned segments with the same similarity score in a slow-changing region. Must take the differential rate into account in searching for functional DNA = DNA under selection.
11
Use measures of alignment quality to discriminate functional from nonfunctional DNA
Compute a conservation score adjusted for the local neutral rate Score S for a 50 bp region R is the normalized fraction of aligned bases that are identical Subtract mean for aligned ancestral repeats in the surrounding region Divide by standard deviation p = fraction of aligned sites in R that are identical between human and mouse m = average fraction of aligned sites that are identical in aligned ancestral repeats in the surrounding region n = number of aligned sites in R Waterston et al., Nature
12
Decomposition of conservation score into neutral and likely-selected portions
Neutral DNA (ARs) All DNA Likely selected DNA At least 5-6% S is the conservation score adjusted for variation in the local substitution rate. The frequency of the S score for all 50bp windows in the human genome is shown. From the distribution of S scores in ancestral repeats (mostly neutral DNA), can compute a probability that a given alignment could result from locally adjusted neutral rate. Waterston et al., Nature
13
Conservation score S in different types of regions
Red: Ancestral repeats (mostly neutral) Blue: First class in label Green: Second class in label
14
phastCons: Likelihood of being constrained
Phylogenetic Hidden Markov Model Posterior probability that a site is among the 10% most highly conserved sites Allows for variation in rates along lineages c is “conserved” (constrained) n is “nonconserved” (aligns but is not clearly subject to purifying selection) Siepel et al. (2005) Genome Research 15:
15
Larger genomes have more of the constrained DNA in noncoding regions
Expected value if coverage by conserved elements is uniform Siepel et al. 2005, Genome Research
16
Some constrained introns are editing complementary regions:GRIA2
Siepel et al. 2005, Genome Research
17
3’UTRs can be highly constrained over large distances
3’ UTRs contain RNA processing signals, miRNA targets, other regions subject to constraints Siepel et al. 2005, Genome Research
18
Ultraconserved elements = UCEs
At least 200 bp with no interspecies differences Bejerano et al. (2004) Science 304: 481 UCEs with no changes among human, mouse and rat Also conserved between out to dog and chicken More highly conserved than vast majority of coding regions Most do not code for protein Only 111 out of 481overlap with protein-coding exons Some are developmental enhancers. Nonexonic UCEs tend to cluster in introns or in vicinity of genes encoding transcription factors regulating development 88 are more than 100 kb away from an annotated gene; may be distal enhancers
19
GO category analysis of UCE-associated genes
Genes in which a coding exon overlaps a UCE 91 Type I genes RNA binding and modification Transcriptional regulation Genes in the vicinity of a UCE (no overlap of coding exons) 211 Type II genes Developmental regulators Bejerano et al. (2004) Science
20
Intronic UCE in SOX6 enhances expression in melanocytes in transgenic mice
UCEs Pennacchio et al., Tested UCEs
21
The most stringently conserved sequences in eukaryotes are mysteries
Yeast MATa2 locus Most conserved region in 4 species of yeast 100% identity over 357 bp Role is not clear Vertebrate UCEs More constrained than exons in vertebrates Noncoding UCEs are not detectable outside chordates, whereas coding regions are Were they fast-evolving prior to vertebrate/invertebrate divergence? Are they chordate innovations? Where did they come from? Role of many is not clear; need for 100% identity over 200 bp is not obvious for any What molecular process requires strict invariance for at least 200 nucleotides? One possibility: Multiple, overlapping functions
22
Going beyond stringent selection in noncoding sequence to find cis-regulatory modules
23
Constraint in noncoding sequences
Used to predict gene regulatory regions with some success Some sequences conserved between humans and mouse show no apparent function Is constraint revealing many false positives? Sequences regulating gene expression in restricted lineages are not constrained across mammals Is pan-mammalian constraint missing many functional sequences? Tree from Margulies et al. (2007) Genome Res.
24
phastCons can find some but not all gene regulatory regions
LCR HS HS HS HS HS5 phastCons Locus control region, or LCR, is the major distal enhancer fo HBB and related, linked genes. It has 5 DNase hypersensitive sites covering about 20 kb.
25
Two extremes of constraint in CRMs
CRMs= cis-regulatory modules. DNA sequences needed in cis for regulation of expression, usually transcription E.g. promoters, enhancers, silencers
26
Coverage of human by alignments with other vertebrates ranges from 1% to 91%
5.4 5% Millions of years 91 92 173 220 310 360 450
27
Distinctive divergence rates for different types of functional DNA sequences
pTRRs: putative transcriptional regulatory region; likely CRMs Sites identified as occupied by sequence-specific transcription factors based on high-throughput chromatin immunoprecipitation assayed by hybridization to high density tiling arrays of genomic DNA= ChIP-chip
28
cis-Regulatory modules conserved beyond mammals
Human-chicken alignment capture about 6% of pTRRs (likely CRMs) Human-fish alignments capture about 3% of pTRRs. The pan-vertebrate CRMs tend to regulate genes whose products control transcription and development Millions of years 91 173 310 450
29
cis-Regulatory modules conserved in eutherian mammals and marsupials
Human-marsupial alignments capture about 32% of CRMs (pTRRs) Tend to occur close to genes involved in aminoglycan synthesis, organelle biosynthesis Human-mouse alignments capture about 75% of CRMs (pTRRs) Tend to occur close to genes involved in apoptosis, steroid hormone receptors, etc. Within aligned noncoding DNA of eutherians, need to distinguish constrained DNA (purifying selection) from neutral DNA. Millions of years 91 173 310 450
30
Interferon beta Enhancer-Promoter
31
Expected properties of gene regulatory regions
Can be almost anywhere 5’ or 3’ to gene Within introns Close or far away Conserved between species (sometimes) Examine interspecies alignments, noncoding regions Evaluate likelihood of being under purifying selection, e.g. phastCons score Some regulatory regions are deeply conserved, others are lineage-specific Enhancers and promoters: clusters of binding sites for transcription factors (TFBSs) Resources and servers for finding TFBSs TRANSFAC JASPAR TESS MOTIF (GenomeNet) MatInspector
32
Finding known motifs in a query sequence
MatInspector at K. Cartharius et al. (2006) MatInspector and beyond: promoter analysis based on transcription factor binding sites. Bioinformatics 21: Genomatix Software GmbH, Munchen, Germany Query: a UCE in SOX6 1356 bp About 1 in 4 bp is the start of a TFBS match!
33
Conservation of TFBSs between species
Servers to find conserved matches to factor binding sites Comparative genomics at Lawrence Livermore zPicture and rVista Mulan and multiTF ECR browser Consite Conserved TFBSs are available for some assemblies of human genome at UCSC Genome Browser Binding site for GATA-1
34
Clusters of conserved TFBSs: PReMods
Blanchette et al. (2006) Genome Research
35
ESPERR Evolutionary and Sequence Pattern Extraction through Reduced Representation
36
ESPERR: a different approach
Don’t assume a database of known binding motifs Don’t assume strict conservation of the important sequence signals Instead, use alignments of validated examples to learn sequence and evolutionary patterns that characterize a class of elements
37
Objective of ESPERR
38
ESPERR overview
39
Represent columns with ancestral distributions
40
Group columns using evolutionary similarity and frequency distribution
41
An agglomerative algorithm
42
Searching for encodings
43
Evaluate “merit” of candidate mappings
44
Iterate until convergence
45
Search convergence behavior
46
Regulatory potential (RP) to distinguish functional classes
47
Variable order Markov models for discrimination
48
Use ESPERR to compute Regulatory Potential
49
Good performance of ESPERR for gene regulatory regions (RP)
-1
50
Experimental tests of predicted cis-regulatory modules
51
GATA-1 is required for erythroid maturation
Common myeloid progenitor Hematopoietic stem cell MEP G1E cells GATA-1 Common lymphoid progenitor Myeloblast G1E-ER4 cells Eosinophil Basophil Monocyte, macrophage Neutrophil Aria Rad,
52
Genes Co-expressed in Late Erythroid Maturation
G1E-ER cells: proerythroblast line lacking the transcription factor GATA-1. Can rescue by expressing an estrogen-responsive form of GATA-1 Rylski et al., Mol Cell Biol. 2003
53
Predicted cis-Regulatory Modules (preCRMs) Around Erythroid Genes
54
preCRMs with conserved consensus GATA-1 BS tend to be active on transfected plasmids
55
preCRMs with conserved consensus GATA-1 BS tend to be active after integration into a chromosome
56
Examples of validated preCRMs
57
Correlation of Enhancer Activity with RP Score
58
Validation status for 99 tested fragments
59
preCRMs with High RP and Conserved Consensus GATA-1 Tend To Be Validated
60
Conclusions Particular types of functional DNA sequences are conserved over distinctive evolutionary distances. Multispecies alignments can be used to predict whether a sequence is functional (signature of purifying selection). Patterns in alignments and conservation of some TFBSs can be used to predict some cis-regulatory elements. The predictions of cis-regulatory elements for erythroid genes are validated at a good rate. Databases and servers such as the UCSC Table Browser, Galaxy, and others provide access to these data.
61
Many thanks … PSU Database crew: Belinda Giardine, Cathy Riemer, Yi Zhang, Anton Nekrutenko Wet Lab: Yuepin Zhou, Hao Wang, Ying Zhang, Yong Cheng, David King RP scores and other bioinformatic input: Francesca Chiaromonte, James Taylor, Shan Yang, Diana Kolbe, Laura Elnitski Alignments, chains, nets, browsers, ideas, … Webb Miller, Jim Kent, David Haussler Funding from NIDDK, NHGRI, Huck Institutes of Life Sciences at PSU
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.