Download presentation
Presentation is loading. Please wait.
Published bySamson Freeman Modified over 9 years ago
1
Dark matters in the genomes Shin-Han Shiu Plant Biology / Genetics / EEBB / QBMI
2
About myself
6
Cell, nucleus, and chromosomes
7
DNA AGGCGTAGAGAGATCCTTGAT TCCGCAACTCTCAAGGAACAA
8
DNA and Genome Genome is all the DNA in a cell made up of A, T, G, C... How many A's, T's, G's, and C's are there in the human genome? 3,200,000,000 letters A sizable book, say, Lord of the Ring: Fellowship of the Ring 764,470 characters in 410 pages ~2,000 characters per page The book of our life 1,600,000 pages 4,186 Fellowship of the ring
9
Genome sequencing chronology YearOrganismSignificance Genome size (bp) Number of genes 1977 Bacteriophage fX174 First genome ever! 5,38611 1981 Human mitochondria First organelle 16,50037 1995 Haemophilus influenzae Rd First free-living organism 1,830,137~3,500 1996 Saccharomyces cerevisiae First eukaryote 12,086,000~6,000
10
Genome sequencing chronology YearOrganismSignificance Genome size (bp) Number of genes 1998 Caenorhab-ditis elegans First multi- cellular organism 97,000,000~19,000 1999 Human chromosome 22 First human chromosome 49,000,000673 2000 Drosophila melanogaster First insect 150,000,000~14,000 2000 Arabidopsis thaliana First plant genome 150,000,000~25,000
11
Now, 1366 genomes are sequenced or being sequenced
12
Between human and other animals How much do our and chimp genomes differ? 0.1% 1% 10% 50% 90% How many genes do you think we share with worm? 1% 10% 50% 75% 99%
13
Genome and better food
14
Basic understanding of science & environment
15
Our research interest TTGGCTATCCTTTATATTTTAAGGGTTATTAGGATATTTTTTATTATGACTACATGGGATAAATGTTTAAAAAAAATAAAAAAAAACCTTTCTACGTTTGAGTATAAGACGTGGATAAAGCCTATCCATGTGGAGCAAAA TAGTAACTTATTCACAGTTTACTGTAACAATGAATATTTCAAAAAACATATAAAATCTAAGTATGGAAATCTTATTTTATCAACAATCCAAGAGTGTCATGGTAATGATTTAATTATTGAATATTCTAATAAAAAATTCT CTGGCGAAAAAATTACTGAGGTTATCACAGCTGGACCACAAGCTAATTTTTTTAGCACAACAAGTGTTGAGATAAAAGATGAATCAGAAGATACAAAAGTAGTACAAGAACCTAAAATATCAAAGAAGTCTAATAGTAAA GACTTTTCTTCATCACAAGAGTTATTCGGTTTTGACGAAGCTATGCTAATTACAGCAAAAGAAGATGAGGAATACTCTTTTGGTTTACCGTTAAAAGAAAAATATGTTTTTGATAGTTTTGTTGTTGGAGATGCTAACAA AATTGCTAGAGCAGCGGCTATGCAGGTATCGATAAATCCAGGTAAATTACATAACCCTTTATTCATTTATGGTGGTAGTGGTTTAGGTAAAACTCACTTAATGCAAGCAATAGGTAATCATGCAAGAGAAGTTAATCCTA ATGCCAAAATTATTTATACAAATTCAGAACAATTTATTAAAGATTATGTAAATTCTATTCGTTTACAAGATCAAGATGAGTTTCAAAGAGTTTATAGATCTGCGGATATACTTTTGATTGATGATATTCAATTTATCGCT GGTAAAGAGGGTACTGCTCAGGAGTTTTTCCATACTTTTAATGCATTGTATGAAAATGGTAAACAGATAATTCTAACTAGTGATAAGTATCCAAATGAAATAGAAGGGCTTGAAGAAAGACTAGTTTCGCGTTTTGGTTA TGGTTTAACAGTTTCTGTTGATATGCCAGATTTAGAAACCAGAATTGCTATCTTGCTCAAAAAAGCTCATGATTTAGGTCAGAAATTACCTAACGAAACAGCAGCTTTTATTGCTGAGAATGTACGTACTAATGTCAGAG AACTAGAAGGTGCTCTAAATAGGGTTCTTACTACCTCTAAATTTAATCATAAAGATCCTACTATCGAAGTAGCACAAGCTTGCTTAAGAGATGTTATAAAAATACAAGAAAAGAAAGTAAAAATAGATAATATCCAAAAG GTTGTTGCTGATTTTTATAGAATCAGGGTAAAAGATTTAACTTCTAATCAAAGAAGTAGAAATATAGCTAGACCAAGACAGATAGCAATGAGTTTAGCACGTGAACTAACATCACATAGTTTGCCAGAAATAGGCAATGC TTTTGGTGGTAGAGACCATACGACAGTTATGCATGCTGTCAAAGCTATAACTAAATTAAGACAAAGCAATACTTCAATATCGGATGATTATGAGTTGCTTTTAAATAAAATTTCTCGTTAAATAAAATTAGTAACTTTAT CAAAGGGGTTTTAAAAAATGAATTTTGTACTAAATAGAGATGACTTACTAAAGCCTTTGCAATCTATGCTCTCAGTTGCAAATAGTAAGAGTACAATGCCTTTATTATCATGTATCTTATTTGATATTGATAATAATAAT CTCAAAATTACGGCTTCGGATCTTGATACAGAGATATCATGCAATATAGCAGTTAGTTGTAACACAACTATTAAGTTAGCATTAAATGCTGACAAAATTTATAACATTGTCAGAAGCTTAAATGAAAATTCAATGATTGA TTTTAGAATTAATGAAAATAAGGTAACTATTGTTTCTAATAATAGTACTTTTAACCTTATATCACTAAATGCTGACAACTATCCTCTTATTGATAGTAATATCAATGAGCAAGCAAGTTTTGATCTTTCTCAACAAGATT TTCATCATATTATTTCAAAAGTAGATTTCTCAATGGCTAATGATGATACTCGATATTTCTTAAATGGGATGTTTTGGGAAATCAACGCAAATCTACTAAGAGCAGTATCTACAGATGGTCATAGAATGTCTATCACAGAG GCTATAATTGATAGTAAAGTGTTAGATAGTGCTTCTCAGTCGATAATTCCAAAAAAAGCGATTTTAGAGCTTAAAAAGATAGTTGGCAAAACAGAAGAAAATATCAAAATTTGTCTTGGCAAAAATTATCTAAAAGCGAT TTTTGGTAATTATGCTTTTATATCAAAGCTTATAGATGGTCGCTATCCTGATTACCAAAAAGTAATCCCTAAAAATAATACAAAACTATTAGCAGTTGATAAGCAGTTTTTCAAAAATTCATTATTAAGAACATCAATAC TTGCTAATGATAAATATAAAGGTGTTCGTCTTAACATATCTCAAAATCAATTACTTCTATCAGCTAATAACCCTGATAATGAAAAGGCTGAAGATAAAATCGAAGTTCAATATAATGATCAACCAATGGAAATTTGTTTT AATTACAAATATCTTTTGGATATTATAAATGTACTTAGTGAAGAAACTATGTCTATCTACCTTGATAATCCAAATATGAGTGCTTTAGTTAAAGATGAGAAAGATAATAGTTTGTTTATTATTATGCCAATGAAAATTTA AGTAATAAGTAGTTTTAGGAAATAACTATTTTTATAAGCCTTTTGGAATGAATAATAAAGCAATAAAAAAAGGTATGCATAAAAACATTATATAGAAAGCTGGGATTAGATAATTTCCAGTAGTAGTAATTAATAAAGTC ATAAGAAAGGCAACAGTACCTCCAAATAAAGAAACGCTTATATTAAAACTTATTCCAAAACCAGTATTTCTATTTCTAACAGGAAATAGCCCTGCCGTATTTGCAAATATAGGTCCTATTACAGCACCACTTATAATAGC AAGAGAAAAAATAGCTATAGATACTAATTGATGATTTTTTATAATAATTTGGTATATTGGTAAAACAGCTATAAATAAGACTATACAAGAATACATCAGAACTTTTTTACCACCAATTCTATCAGCAATATATCCAAATA TAATTGAAGAAAGCATTAATACTATAGTTAATCCGAGAGTATTTTGTTAAACACATAAGAGATAGCATCACAAAATTTTTCAAAACTATTATTCACTTTTCTAAATATTTTTTTAAAGTTAGCCCAAACCTTTTCAATAG GATTTAAATCTGGAGAATACGGAGGTAGATATAATATTTGTACATCAAATTTATTGGCTATTTCAATCAGCTTAGAGGATTTATGGAAACTAGCATTATCCATTACTATAGTAGTTTTAGGTTTTAATGATGGGCATAAG TGTTCCTCAAACCATTGATTAAAAATTTCAGTATTGGTATATCCACTGTACTCTAATGGAGCTATAATCTTTTTATCTGCATAATTATATCCAGCAACAATACTTCTTCTTTGTGTTTGATATGCTAAAACCTCACCATA ACTAGGCTCACCAATTAGTGACCATCCTCTTAGGATAGAAAGCTTATTGTCACACCCCATCTCATCTATATAAAATAACAAGTTTTGAGCTATTTCTTTTAGTTTTTCTATATACTCCAACCTTTCATGTTCTTTTCTTT GCTTATATTTTGGAGTCTTTTTTTAAAACTAAAACCAAGTCTATTAAGACAATCATAAAATGTACTTCTTGGAATATCAGGGGCTAATGCTTCTTTTATATCTAATGCACTTGCATCTGGATGATCTATCAAATACTGTT CAATCAATGTTTTATCGGTAAAGCTAGCGACTCTGCCACAACCAACTCCTTGCTTTGAACTATAATCTCCGGTTCTTTTATAAAACTCTATCCATGAAACAACTGTACGCTTATCTATGTTAAAAAACTTACTCAGCTCG AACTCCGTCATACCTTCTTCATATTTATTAATTACGATGTCTCTAAAATATTGGCTATATGATGGCATTTTTATTAGACATTATAACATTTCTACAAATATCTTTTTCTACAAATATCTTTCGGATTAACTATATAAGTA GAGTCAACAACCATCCAAATCACCCAATTATCTATAATTTTCTGCTTGCTAAAAAAACGCATACCAATGATGCTACACTTGTAAAACCATCCATATATGGCGTTGTTGAATCAGTATAAAATATAAGTAGTTGCGAAACT AGTAACCCAAATACTACAATGCTTACTAGAACTTTTAACCAACCAATGATTTTAAGTCTATGAACAACTATCTTTTTGTGACTAAAGTTGGGTTGCCAACTATACCAACCGTATCCAAAGCTAAATAAAAGAATCATCTG CAATATAGCATCGGCATATAGTCCACTAACAGAAAATAAACCCGCACTCATGATCAAACCAACTATCTCCACAGGCCAACCAATGACATAAAGCCTTGCCAGCAAAAAGGTACACAAAAGATTAACAATCATTGTACAAA AATCAAAAATATGCAGCATATTTATTTTACTAATCAAAGTATTATAAATATTATAATAACTTTGAAGTTGGCGTATTAAAGCCATAAACTTTAGTAGGTTAGTGTTTATACCAATATTTTGAGATGCTTTCTGCAAGCTA ATAACATTTAGCTATCTAGCCTAAATAATTAATATACAAAACTTTCAAGCTTATTGAATTTTTCAACAGATACAGCGCGTTATAACAAATAAGTAATTGACTAAATTAAAAAGCAAGTATAATATCGATTGTGTTTATTA CATAATATAAAACGAGGATAAAAAAAATATGAAATTAAGAAAAGTATTAATCGCGACATTATTAGGAGCTTCTGCTTTATCTTTAAGTAGTTGTTGGTTACTTGTTGGTGCAGCTGTTGGTGGTGGAACTGCTGCGTATA TTTCTGGTGAGTATTCAATGAATATGAGTGGCAGTGTAAAAGATATTTACAATGCTACTTTAAAAGCTGTTCAAAGCAATGATGATTTTGTAATTACTAAAAAATCTATTACTTCTGTTGATGCAGTTGTTGATGGTAGT ACTAAGGTAGACTCAACAAGTTTCTATGTTAAAATAGAAAAACTTACTGATAATGCTTCAAAAGTTACAATTAAGTTTGGTACTTTTGGTGACCAAGCAATGTCAGCAACATTAATGGATCAAATCCAAAAGAATCTTTA ATTAAATAGGTAATTACTATAATGACTTTTCTAAAGAAAGCTTTTATTGCAACTATAGTTTCTATTTCAGCATTAGTTCTAAATAGTTGTATTGTTGCAGCAATAGCTGTTGGTGGTGGAACAGTTGCCTATATTGATGG AAATTATTTTATGAATATAGAAGGCAACTATAAAGCTGTCTATAAAGCTACTCTTAAAGCTATTAATGATAATAATGACTTTGTTCTAGTATCAAAAGATCTTGATCAAACAAAGCAAAATGCCGACATTGAAGGTGCTA CTAAAATTGATAGTACGAGTTTTAGTGTCAAAATTGAAAGACTGACAGATCAGGCTACTAAAGTGACAATCAAATTTGGTACTTTTGGCGATCAAGCAATGTCATCAACATTAATGGATCAGATCCAGGCAGCTGTACAT AAAGCTTCTTAGAAATGTACAAAAAACTCTACTTAATTATATTATCCACAATAATCGCAATCTCTCTTAATAGTTGTGTTGTTGCCGCTGTTTTAATTGGTACAGCAGTTGTTGCTGGAGGTACAGTATATTACATCAAT GGTAACTATATAATCGAAGTCCCTAAAGATATTAGAAGTGTATACAATGCTACAATCAAGACTATACAGATGGATAGTCAAAATAAACTAATAAGTCAAACCTATAATACTAAATCTGCTATAATTAAAGCTTTACAAAA AGGTGAAAAAATTAGTATAGATTTAAGCAATATTGATAGTCGTTCAACAGAGATAAAAATTCGTATAGGTGTACTTGGCGATGAGAAAAAATCTGCTGATTTAGCAAACTCAATAACAAAAAATATCACCTAAGCAATAT TTCTCGAACTTTGGTTAACTTTTTCTTTTTAAAAACTTTCAAAAATGTATAATTTGTGTTAGTTTGCAAACTACCCTTATATCCATAATGAGTAATAAGGTATTAGATACATATTATAAAAACAATCGACATATTTGGGT GCTAGTACTATCTGGTGCTGTTATAGGCACAATGATTGGTCTTCTAGCAACAGCATTTCAGCTACTCCTAGACTTTATTTTTAAAATTAAGCTGGCTCTTTTTTCTTTCAGTGGTGGTAATCTTTTTATCGAAATCGCTA TGTCAATATCATTAAGTATTGTGATGGTATTAATTTCGATTTTTATTGTTAAAAAATTTGCGAAAGAGGCTGGTGGTAGCGGTATCCAAGAGGTTGAGGGTGCTTTAAAAGGCTGCCGCAAAATACGTAAAAGAGTTATG CCCGTGAAGTTTATAAGTGGACTTTTTTCGTTAGGCTCAGGTTTAAGTTTAGGTAAAGAGGGACCATCAATTCATATGGCTGCTGCATTAGCGCAGTTTTTTGTTGATAAATTTAAACTTACTACAAAATATGCTAATGC GGTTATCTCTGCTGGGGCTGGAGCTGGACTAGCAGCTGCTTTTAATACCCCACTTTCTGGGATTATCTTTGTTATTGAAGAGATGAATAGAAAGTTTAGATTTAGTGTTTCGGCAATAAAGTGTGTGCTAGTAGCATGTA TCATGAGTACAGTTATCTCTAGAGCTATTATGGGTAATCCTCCAGCAATACGCGTAGAAACTTTCAGCTCAGTACCACAAAATACTCTTTGGTTATTTATGGTATTAGGGATTATATTTGGTTATTTTGGTTTACTATTT AACAAATCCTTAATCAAAGTGGCAAACTTTTTCTCAGAAGGCTCCAAGAAGAGGTATTGGACTTTAGTTATAATTGTTTGCATAATTTTTGGTATTGGTGTTGTTCTATCTCCAAATGCTGTTGGCGGTGGCTATATTGT CATAGCAAATACTCTTGATTATAACTTATCAATCAAGATGCTTTTAGTGCTTTTTGTACTTCGTTTTGCTGGAGTTATTTTCTCATATGGCACCGGCGTTACTGGTGGGATATTCGCACCAATGATTGCGCTTGGTACTG
16
Evolution of genome sizes C-value: 1pg ~= 1.02Gb Thale cress (Arabidopsis thaliana): 0.16 pg Fruit fly (Drosophila melanogaster): 0.18 pg Pufferfish (Takifugu rubripes): 0.4 pg Human (Homo sapiens): 3.5 pg Onion (Allium cepa): 16.75 pg Tiger salamander (Ambystoma tigrinum): 32 pg Marbled lungfish (Protopterus aethiopicus): 132 pg http://www.rbgkew.org.uk/
17
Genic region and genome size Dan Graur
18
What's in the genome Genome Annotated genes Exon UTR Intron Cis-regulatory elements Selfish elements Novel genes Dead genes (pseudogenes)
19
"Non-genic": repetitive elements E.g. Human genome Exons take up? Introns account for? Repetitive elements occupy? Unknown? Venter et al. (2001) Science 291:1304 ABC 1%24%25% 24%1% 25% 35%60% 45% 40%15% 5%
20
What are in the unknown regions? Investigate with tiling array cDNA array Tiling array Gap size: 10bp Probe size: 25bp Number of features: Arabidopsis, 135Mb, 1 chip, ~6x106 features Human, 3Gb, 7 chips, ~4.2x107 features
21
"Non-genic": unannotated genes Kapranov et al., 2002. Science Tiling array analysis of human Chr 21, 22
22
Tiling array analysis of human transcriptome Kapranov et al., 2002. Science Human Chr 21, 22 What do you think these expressed regions represent??
23
Difficulties for coding gene prediction Training data You need to know something... “Biased” toward the properties of the majority. Real genes that are shorter tend to be much harder to predict. Table 3 Accuracy of GISMO, Glimmer and CRITICA in predicting short genes (<300 bp) Gene finder Cor Sn Sn fk (%) Sp GISMO 0.64 63.0 86.4 69.0 Glimmer 0.54 72.0 83.7 44.0 CRITICA 0.60 46.0 67.4 84.0 Sn fk denotes the sensitivity in detecting function-known genes. Krause et al., 2006. Nucleic Acid Res. 35:540
24
Novel coding sequence identification Arabidopsis thaliana as an example 135Mb, ~50% occupied by annotated genes. Focus on coding sequences 90-300bp long. What would you do next to eliminate ORFs that are likely false predictions? 133,090 sORFs
25
Criterion 1: Codon usage bias Some codons are used more frequently than others http://www.cbs.dtu.dk/services/GenomeAtlas/
26
Criterion 1: Codon usage bias For example: codons for proline Suppose you have the following 2 sequences both code for poly-leucine, which one is more likely to be real coding sequence? NCDSCDS CCT0.250.12 CCC0.250.49 CCA0.250.06 CCG0.250.33 Seq1CCT CCA CCT Seq2CCC CCG CCC
27
Novel CDS identification Posterior probability calculation Bayes' theorm
28
Non-coding sequences Coding sequences Novel CDS identification Determine base composition probabilities Feature tables Coding sequences Non-coding sequences CDS parameters NCDS parameters c1c2c3 c4c5c6 n ATGC AAA0.010.310.030.02 AAT0.030.170.010.15 AAG0.020.000.02 AAC0.150.02 0.05 ATA0.040.030.050.00 ATT0.060.010.070.02 ATC0.01 0.050.29...
29
Posterior probability of coding sequence Compare known non-coding and coding sequences Hanada et al., 2007. Genome Res.
30
Posterior probability of coding sequence Scanning Arabidopsis genome Hanada et al., 2007. Genome Res.
31
After applying the first criterion 7,442 coding sORFs
32
How good is the CDS finding measure For the training data For 18 Arabidopsis small protein genes All 18 are predicted as CDS. For 84 yeast small protein genes All 84 are predicted as CDS.
33
So what does this mean? If a sequence is a true coding sequence Our approach can predict them with high accuracy. So, the sensitivity is very good. Is this good enough?? What about specificity? Namely, how good is the criteria in excluding false positives?
34
Criterion 2: Expression What would be the expression level you would expect for true CDS compared to false CDS? Tiling array Gap size: 10bp Probe size: 25bp Expression level Frequency
35
Comparison of expression levels A: Exon B: Intron C: Prediceted novel CDS D: tRNA E: rRNA Exon, intron, tRNA, rRNA, our predictions
36
Applying the second criterion Prediction significantly enriched in expressed sequences 2,996 transcribed sORFs
37
Criterion 3: Purifying selection Compare known coding and non-coding sequences
38
Criterion 3: Purifying selection Compare known coding and non-coding sequences
39
Our research interests 30,000 25,000 10,000 6,000 45,000 17,000
40
Duplication Mechanism and Loss Rate Gene Duplications MechanismsConsequences Preferential retention Preferential retention Consequences
41
Duplication mechanisms + Whole genome duplication Tandem duplication Segmental duplication Duplicative transposition
42
Differences in Duplicability CategoryArabidopsisHuman Defense response Proteolysis Transport Ion channel activity Metabolism Development Protein kinase activity Transcription factor activity Duplicability The propensity for the retention of a duplicate gene Computational analysis of genome-wide trend
43
Functional Consequences of Duplication Functional divergence and conservation Is it because of changes in cis-regulatory elements or coding sequences How are duplicates retained, subfunctionalization or neofunctionalization
44
Acknowledgement Lab members Kousuke Hanada Melissa Lehti-Shiu Cheng Zou TIGR Chris Town Hank Wu University of Chicago Wen-Hsiung Li Justin O. Borevitz Xu Zhang Funding
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.