Download presentation
Presentation is loading. Please wait.
Published byJesse Shelton Modified over 9 years ago
1
GENOME EVOLUTION AND GENE DUPLICATIONS IN EUKARYOTES Shin-Han Shiu Plant Biology / QBMI Michigan State University
2
Genomes and gene contents 30,000 25,000 10,000 6,000 45,000 17,000
3
Duplicate genes in the genome Arabidopsis gene families* *: Clusters of Markov clustering using all-against-all BLAST E values as distance measures
4
Gene function and duplication What’s the consequence?
5
Gene function and duplication What’s the consequence?
6
Focus I: Duplication Mechanism and Loss Rate Gene Duplications MechanismsConsequences Preferential retention
7
Duplication mechanisms + Whole genome duplication Tandem duplication Segmental duplication Replicative transposition
8
Lineage-specific gains in plants and animals Organism Lineage-specific gains Normalized gain* # of genes in families analyzed % total Rice1011567432846735.5 (23.7)** Arabidopsis598439902193627.3 (18.2)** Human811 219543.7 Mouse1265 240415.3 *: The gain counts are normlized against the ratio between the Arabidopsis-rice and human-mouse divergence time (150 and 100 Mya, respectively). **: Numbers in parentheses refer to percentage total based on normalized gains. Substantially more recent duplicates in plants than in animals Mostly due to frequent whole genome duplications in plants
9
Gain vs. Loss 3 rounds of whole-genome duplications in the Arabidopsis lineage ~82% duplicates from the last round were lost in the past 40 million years 15,000* 30,000 60,000 120,000 Arabidopsis gene content: 21,000** *: Number of orthologous groups in shared families between Arabidopsis and rice. **: Number of genes in shared families. Genome duplications + tandem duplications – gene losses =
10
“Age” distribution of animal duplicates Steady decay in the number of duplicates Frequent TD, SD, and RT Ks: rate of nucleotide substitutions in codon sites that do not affect amino acid identity Shiu et al., 2006
11
Plant duplicate “age” distribution Apparent peak at ~0.18 instead of zero Ks Frequent WGD, TD, SD (maybe), and RT (in some plants) Shiu et al., 2004
12
Genome remodeling in polyploids Natural and synthetic polyploids ~348 Mb ~203 Mb~314 Mb ~257 Mb 20,000 yr
13
Experimental approaches Genome-wide polymorphism monitored by tiling array Genome Tiled probes Gap Resolution Array 20,000 yr ~6 million features
14
Genome-wide Single Feature Polymorphism Mid-parent (MP) vs. Arabidopsis suecica (As) PolyploidSFP Natural58,517 Synthetic503
15
Genome-wide Single Feature Polymorphism Genome-wide polymorphism monitored by tiling array Gene PseudogeneTransposon
16
Genome-wide Single Feature Polymorphism Duplication or deletion MP duplication or As deletion
17
Genome Survey Sequencing Sequence ~40-60Mb of the Arabidopsis suecica genome 0.15-0.2 X coverage, will be done next week! Ultra-high throughput sequencer (GS20) funded by the Strategic Partnership Grant Ultra-high throughput 20-30 Mb per run, each run 5 hours Will be 100Mb per run early 2007 Cost efficient ~$0.3/kb Read length rather limited ~100bp per read now Will be ~200bp early 2007 For more information contact: Andreas Weber (aweber@msu.edu) aweber@msu.edu David DeWitt (dewittd@msu.edu) dewittd@msu.edu Or Shin-Han Shiu (shius@msu.edu) shius@msu.edu Seminar on instrumentation: 9/29, Friday, 1pm, 1415 BPS
18
Summary: Gene duplication and polyploidy Gene duplication occurred frequently in eukaryotes but most duplicate are lost. In plants, whole genome duplication is common. But gene lost occurred frequently. After 4 generations, very small number of SFPs are identified in synthetic polyploids. After 20,000 generations, most coding genes do not have clustered sequence polymorphism that indicative of deletion. Clustered polymorphisms mostly locate in pseudogenes and transposons. Survey sequencing is necessary to determine if some coding genes have become pseudogenes without being deleted.
19
Focus II: Differential Retention of Duplicates Gene Duplications MechanismsConsequences Preferential retention
20
Duplicate genes in the genome Arabidopsis gene families* *: Clusters of Markov clustering using all-against-all BLAST E values as distance measures
21
Large gene families in plants One of the largest gene families
22
Normalized gain: % expanded OGs Large family sizes do not necessarily indicates higher expansion rates
23
Ancestral family sizes and gene gains Large ancestral family tend to have more lineage specific gains but with many exceptions
24
Differential expansion of functional categories GO: GeneOntology Protein ubiquitination Polysaccharide biosynthesis Cell wall modification Transcriptional regulation Biotic stress response Secondary metabolism
25
Differences in Duplicability CategoryArabidopsisHuman Defense response Proteolysis Transport Ion channel activity Metabolism Development Protein kinase activity Transcription factor activity Duplicability The propensity for the retention of a duplicate gene Computational analysis of genome-wide trend
26
Kinase superfamily sizes among eukaryotes Organism Number of genes Kinase superfamily Percent total gene Arabidopsis thaliana25,81410414.0 Oryza sativa subsp. indica~35,00016073.6 Chlamydomonas reinhardtii~12,2004143.4 Plasmodium falciparum5,334941.8 Plasmodium yoelii7,681700.9 Caenorhabditis elegans19,4844172.1 Drosophila melanogaster13,8082621.9 Anopheles gambiae15,0882161.4 Ciona intestinalis15,8523162.0 Fugu rubripes33,6096321.9 Mus musculus22,4444952.2 Homo sapiens22,9804722.1 Saccharomyces cerevisiae64491131.8 Candida albicans6,164951.5 Neurospora crassa100821041.9 Schizosaccharomyces pombe49451092.2 Shiu & Bleecker, 2003
27
Kinase families in rice and Arabidopsis Gene count differences among families indicate differential expansion Shiu et al., 2004
28
Estimation of ancestral RLK family size A.B. 440 speciation points rice Arabidopsis A.B. WAKLRR VIII, X, XII Kinase phylogeny of Arabidopsis and rice RLKs Shiu et al., 2004
29
Development vs. resistance/defense RLKs Shiu et al., 2004
30
Contradiction Plant genes invovled in development tend to have high duplicability Developmental RLKs Low duplicability Resistance/Defense RLKs High duplicability Animal tyrosine kinases Low duplicability Transcription factors High duplicability
31
Selection for expansion Depend on the level of variations of the signals TT OR
32
Summary: differential retention Longevity and duplicability of plant genes High Low High Low Duplicability Longevity Examples Transcription factors Resistance genes Enzymes in central metabolic pathways ??
33
Focus III: Functional Consequences Gene Duplications MechanismsConsequences Preferential retention
34
Functional Consequences of Duplication Functional divergence and conservation Is it because of changes in cis-regulatory elements or coding sequences How are duplicates retained, subfunctionalization or neofunctionalization
35
Divergence in gene expression Develop pipelines for cis-element prediction and Clusters of genes with similar expression profiles Machine learning Motif functional prediction Cis-regulatory logic Expression data Over-represented sequence motifs in 5’ regions Experimental validations
36
Divergence in post-translational modification Conservation of phosphorylation site across speces SACE: budding yeast CAGL: Candida glabra CAAL: Candida albicans CATR: Candida tropicalis NECR: Neurospora crassa DEHA: Debaryomuces hansenii
37
Detailed Functional Studies of Duplicate Genes Functional analyses of DDF1 and DDF2 transcription factors Derived from recent whole genome duplication in Arabidopsis Related to the well known CBF factors involved in cold and draught stress DDFs Promoter GFP Knockouts Over- expression studies Interacting proteins Binding targets DDFs Promoter GFP Knockouts Over- expression studies Interacting proteins Binding targets Arabidopsis thalianaArabidopsis lyrata
38
Focus IV: Protein space Gene Duplications MechanismsConsequences Preferential retention Consequences Preferential retention
39
Tiling array analysis of transcriptome Human Chr 21, 22 Kapranov et al., 2002
40
Posterior probability p(F|coding)
41
Performance of the CI measure Known Arabidopsis exon and intron 90-300bp Arabidopsis small protein that are not annotated Correctly predict 19 out of 20 (95%). Yesat sORF with translation evidence Correctly predict 98 out of 114 (86%) In “intergenic” sequences of Arabidopsis genome 3,274 sORF identified
42
Coupling with tiling array expression Hybridization intensities for feature types
43
Summary: Novel coding genes Many unannotated regions in the genomes are expressed. Using the CI measure, many proteins that were not annotated but with evidence of expression from yeast and Arabidopsis are identified correctly. Using the CI measure, we estimated that ~3000 novel coding regions are present in the unannotated regions of Arabidopsis thaliana genome. Using tiling array data, we found that many of these novel coding regions are expressed.
44
Acknowledgement Lab members Kousuke Hanada Melissa Lehti-Shiu Cheng Zou Emily Eckenrode University of Chicago Justin Borevitz Xu Zhang University of Wisconsin Sara Patterson Rick Vierstra University of Missouri Scott Peck Michigan State University Many… Rong Jin, Comp Sci & Eng Yue-Hua Cui, Stat & Prob Startup fund
45
Recent completion …
46
Genome remodeling in polyploids Genome duplication occur frequently in plants What is the fate of duplicates? How fast do gene losses occur? Is there any preference in genes retained? ABCDEABCDE A1 B1 C1 D1 E1 A2 B2 C2 D2 E2 t1t1 t2t2 A1 B1 C1 D1 E1 A2 B2 C2 D2 E2 A1 B1 C1 D1 E1 A2 B2 C2 D2 E2 N g = 5 10 8 5
47
Comparing degrees of expansion Combined set Arabidopsis: ~25,000 proteins Rice prediction: ~66,000 genes Gene/domain families Shared unique Pairwise distance Putative orthologous groups u i = 1 GO:0001 e i = 4 All orthologous groups Total unexpanded = Σ u i Total expanded = Σ e i
48
Major questions on gene duplication When: timing of gene duplications, e.g. N = 10
49
Domain gains in rice and Arabidopsis Gain in one lineage does not necessarily predict gain in the other
50
Identify novel small coding genes Determine base composition probabilities Coding sequences Non-coding sequences CDS parameters NCDS parameters # of AAA # of all NNN Pc(AAA) = Pc(AAAT) Pc(AAA) Pc(T|AAA) = Calculate posterior probability c1c2c3 c4c5c6 Feature tables n
51
Setting up the Bayes’ Priors S = ATG TTC TAC TTT G… …
52
Coding Likelihood (CL) Sliding windows of a sequence Simulation based on NCDS (introns) 1 2 3 4 … n
53
Divergence in post-translational modification Conservation of phosphorylation site across speces
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.