Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dark matters in the genomes Shin-Han Shiu Plant Biology / Genetics / EEBB / QBMI.

Similar presentations


Presentation on theme: "Dark matters in the genomes Shin-Han Shiu Plant Biology / Genetics / EEBB / QBMI."— Presentation transcript:

1 Dark matters in the genomes Shin-Han Shiu Plant Biology / Genetics / EEBB / QBMI

2 About myself

3

4

5

6 Cell, nucleus, and chromosomes

7 DNA AGGCGTAGAGAGATCCTTGAT TCCGCAACTCTCAAGGAACAA

8 DNA and Genome  Genome is all the DNA in a cell made up of A, T, G, C...  How many A's, T's, G's, and C's are there in the human genome? 3,200,000,000 letters  A sizable book, say, Lord of the Ring: Fellowship of the Ring 764,470 characters in 410 pages ~2,000 characters per page  The book of our life 1,600,000 pages 4,186 Fellowship of the ring

9 Genome sequencing chronology YearOrganismSignificance Genome size (bp) Number of genes 1977 Bacteriophage fX174 First genome ever! 5,38611 1981 Human mitochondria First organelle 16,50037 1995 Haemophilus influenzae Rd First free-living organism 1,830,137~3,500 1996 Saccharomyces cerevisiae First eukaryote 12,086,000~6,000

10 Genome sequencing chronology YearOrganismSignificance Genome size (bp) Number of genes 1998 Caenorhab-ditis elegans First multi- cellular organism 97,000,000~19,000 1999 Human chromosome 22 First human chromosome 49,000,000673 2000 Drosophila melanogaster First insect 150,000,000~14,000 2000 Arabidopsis thaliana First plant genome 150,000,000~25,000

11 Now, 1366 genomes are sequenced or being sequenced

12 Between human and other animals  How much do our and chimp genomes differ?  0.1%  1%  10%  50%  90%  How many genes do you think we share with worm?  1%  10%  50%  75%  99%

13 Genome and better food

14 Basic understanding of science & environment

15 Our research interest TTGGCTATCCTTTATATTTTAAGGGTTATTAGGATATTTTTTATTATGACTACATGGGATAAATGTTTAAAAAAAATAAAAAAAAACCTTTCTACGTTTGAGTATAAGACGTGGATAAAGCCTATCCATGTGGAGCAAAA TAGTAACTTATTCACAGTTTACTGTAACAATGAATATTTCAAAAAACATATAAAATCTAAGTATGGAAATCTTATTTTATCAACAATCCAAGAGTGTCATGGTAATGATTTAATTATTGAATATTCTAATAAAAAATTCT CTGGCGAAAAAATTACTGAGGTTATCACAGCTGGACCACAAGCTAATTTTTTTAGCACAACAAGTGTTGAGATAAAAGATGAATCAGAAGATACAAAAGTAGTACAAGAACCTAAAATATCAAAGAAGTCTAATAGTAAA GACTTTTCTTCATCACAAGAGTTATTCGGTTTTGACGAAGCTATGCTAATTACAGCAAAAGAAGATGAGGAATACTCTTTTGGTTTACCGTTAAAAGAAAAATATGTTTTTGATAGTTTTGTTGTTGGAGATGCTAACAA AATTGCTAGAGCAGCGGCTATGCAGGTATCGATAAATCCAGGTAAATTACATAACCCTTTATTCATTTATGGTGGTAGTGGTTTAGGTAAAACTCACTTAATGCAAGCAATAGGTAATCATGCAAGAGAAGTTAATCCTA ATGCCAAAATTATTTATACAAATTCAGAACAATTTATTAAAGATTATGTAAATTCTATTCGTTTACAAGATCAAGATGAGTTTCAAAGAGTTTATAGATCTGCGGATATACTTTTGATTGATGATATTCAATTTATCGCT GGTAAAGAGGGTACTGCTCAGGAGTTTTTCCATACTTTTAATGCATTGTATGAAAATGGTAAACAGATAATTCTAACTAGTGATAAGTATCCAAATGAAATAGAAGGGCTTGAAGAAAGACTAGTTTCGCGTTTTGGTTA TGGTTTAACAGTTTCTGTTGATATGCCAGATTTAGAAACCAGAATTGCTATCTTGCTCAAAAAAGCTCATGATTTAGGTCAGAAATTACCTAACGAAACAGCAGCTTTTATTGCTGAGAATGTACGTACTAATGTCAGAG AACTAGAAGGTGCTCTAAATAGGGTTCTTACTACCTCTAAATTTAATCATAAAGATCCTACTATCGAAGTAGCACAAGCTTGCTTAAGAGATGTTATAAAAATACAAGAAAAGAAAGTAAAAATAGATAATATCCAAAAG GTTGTTGCTGATTTTTATAGAATCAGGGTAAAAGATTTAACTTCTAATCAAAGAAGTAGAAATATAGCTAGACCAAGACAGATAGCAATGAGTTTAGCACGTGAACTAACATCACATAGTTTGCCAGAAATAGGCAATGC TTTTGGTGGTAGAGACCATACGACAGTTATGCATGCTGTCAAAGCTATAACTAAATTAAGACAAAGCAATACTTCAATATCGGATGATTATGAGTTGCTTTTAAATAAAATTTCTCGTTAAATAAAATTAGTAACTTTAT CAAAGGGGTTTTAAAAAATGAATTTTGTACTAAATAGAGATGACTTACTAAAGCCTTTGCAATCTATGCTCTCAGTTGCAAATAGTAAGAGTACAATGCCTTTATTATCATGTATCTTATTTGATATTGATAATAATAAT CTCAAAATTACGGCTTCGGATCTTGATACAGAGATATCATGCAATATAGCAGTTAGTTGTAACACAACTATTAAGTTAGCATTAAATGCTGACAAAATTTATAACATTGTCAGAAGCTTAAATGAAAATTCAATGATTGA TTTTAGAATTAATGAAAATAAGGTAACTATTGTTTCTAATAATAGTACTTTTAACCTTATATCACTAAATGCTGACAACTATCCTCTTATTGATAGTAATATCAATGAGCAAGCAAGTTTTGATCTTTCTCAACAAGATT TTCATCATATTATTTCAAAAGTAGATTTCTCAATGGCTAATGATGATACTCGATATTTCTTAAATGGGATGTTTTGGGAAATCAACGCAAATCTACTAAGAGCAGTATCTACAGATGGTCATAGAATGTCTATCACAGAG GCTATAATTGATAGTAAAGTGTTAGATAGTGCTTCTCAGTCGATAATTCCAAAAAAAGCGATTTTAGAGCTTAAAAAGATAGTTGGCAAAACAGAAGAAAATATCAAAATTTGTCTTGGCAAAAATTATCTAAAAGCGAT TTTTGGTAATTATGCTTTTATATCAAAGCTTATAGATGGTCGCTATCCTGATTACCAAAAAGTAATCCCTAAAAATAATACAAAACTATTAGCAGTTGATAAGCAGTTTTTCAAAAATTCATTATTAAGAACATCAATAC TTGCTAATGATAAATATAAAGGTGTTCGTCTTAACATATCTCAAAATCAATTACTTCTATCAGCTAATAACCCTGATAATGAAAAGGCTGAAGATAAAATCGAAGTTCAATATAATGATCAACCAATGGAAATTTGTTTT AATTACAAATATCTTTTGGATATTATAAATGTACTTAGTGAAGAAACTATGTCTATCTACCTTGATAATCCAAATATGAGTGCTTTAGTTAAAGATGAGAAAGATAATAGTTTGTTTATTATTATGCCAATGAAAATTTA AGTAATAAGTAGTTTTAGGAAATAACTATTTTTATAAGCCTTTTGGAATGAATAATAAAGCAATAAAAAAAGGTATGCATAAAAACATTATATAGAAAGCTGGGATTAGATAATTTCCAGTAGTAGTAATTAATAAAGTC ATAAGAAAGGCAACAGTACCTCCAAATAAAGAAACGCTTATATTAAAACTTATTCCAAAACCAGTATTTCTATTTCTAACAGGAAATAGCCCTGCCGTATTTGCAAATATAGGTCCTATTACAGCACCACTTATAATAGC AAGAGAAAAAATAGCTATAGATACTAATTGATGATTTTTTATAATAATTTGGTATATTGGTAAAACAGCTATAAATAAGACTATACAAGAATACATCAGAACTTTTTTACCACCAATTCTATCAGCAATATATCCAAATA TAATTGAAGAAAGCATTAATACTATAGTTAATCCGAGAGTATTTTGTTAAACACATAAGAGATAGCATCACAAAATTTTTCAAAACTATTATTCACTTTTCTAAATATTTTTTTAAAGTTAGCCCAAACCTTTTCAATAG GATTTAAATCTGGAGAATACGGAGGTAGATATAATATTTGTACATCAAATTTATTGGCTATTTCAATCAGCTTAGAGGATTTATGGAAACTAGCATTATCCATTACTATAGTAGTTTTAGGTTTTAATGATGGGCATAAG TGTTCCTCAAACCATTGATTAAAAATTTCAGTATTGGTATATCCACTGTACTCTAATGGAGCTATAATCTTTTTATCTGCATAATTATATCCAGCAACAATACTTCTTCTTTGTGTTTGATATGCTAAAACCTCACCATA ACTAGGCTCACCAATTAGTGACCATCCTCTTAGGATAGAAAGCTTATTGTCACACCCCATCTCATCTATATAAAATAACAAGTTTTGAGCTATTTCTTTTAGTTTTTCTATATACTCCAACCTTTCATGTTCTTTTCTTT GCTTATATTTTGGAGTCTTTTTTTAAAACTAAAACCAAGTCTATTAAGACAATCATAAAATGTACTTCTTGGAATATCAGGGGCTAATGCTTCTTTTATATCTAATGCACTTGCATCTGGATGATCTATCAAATACTGTT CAATCAATGTTTTATCGGTAAAGCTAGCGACTCTGCCACAACCAACTCCTTGCTTTGAACTATAATCTCCGGTTCTTTTATAAAACTCTATCCATGAAACAACTGTACGCTTATCTATGTTAAAAAACTTACTCAGCTCG AACTCCGTCATACCTTCTTCATATTTATTAATTACGATGTCTCTAAAATATTGGCTATATGATGGCATTTTTATTAGACATTATAACATTTCTACAAATATCTTTTTCTACAAATATCTTTCGGATTAACTATATAAGTA GAGTCAACAACCATCCAAATCACCCAATTATCTATAATTTTCTGCTTGCTAAAAAAACGCATACCAATGATGCTACACTTGTAAAACCATCCATATATGGCGTTGTTGAATCAGTATAAAATATAAGTAGTTGCGAAACT AGTAACCCAAATACTACAATGCTTACTAGAACTTTTAACCAACCAATGATTTTAAGTCTATGAACAACTATCTTTTTGTGACTAAAGTTGGGTTGCCAACTATACCAACCGTATCCAAAGCTAAATAAAAGAATCATCTG CAATATAGCATCGGCATATAGTCCACTAACAGAAAATAAACCCGCACTCATGATCAAACCAACTATCTCCACAGGCCAACCAATGACATAAAGCCTTGCCAGCAAAAAGGTACACAAAAGATTAACAATCATTGTACAAA AATCAAAAATATGCAGCATATTTATTTTACTAATCAAAGTATTATAAATATTATAATAACTTTGAAGTTGGCGTATTAAAGCCATAAACTTTAGTAGGTTAGTGTTTATACCAATATTTTGAGATGCTTTCTGCAAGCTA ATAACATTTAGCTATCTAGCCTAAATAATTAATATACAAAACTTTCAAGCTTATTGAATTTTTCAACAGATACAGCGCGTTATAACAAATAAGTAATTGACTAAATTAAAAAGCAAGTATAATATCGATTGTGTTTATTA CATAATATAAAACGAGGATAAAAAAAATATGAAATTAAGAAAAGTATTAATCGCGACATTATTAGGAGCTTCTGCTTTATCTTTAAGTAGTTGTTGGTTACTTGTTGGTGCAGCTGTTGGTGGTGGAACTGCTGCGTATA TTTCTGGTGAGTATTCAATGAATATGAGTGGCAGTGTAAAAGATATTTACAATGCTACTTTAAAAGCTGTTCAAAGCAATGATGATTTTGTAATTACTAAAAAATCTATTACTTCTGTTGATGCAGTTGTTGATGGTAGT ACTAAGGTAGACTCAACAAGTTTCTATGTTAAAATAGAAAAACTTACTGATAATGCTTCAAAAGTTACAATTAAGTTTGGTACTTTTGGTGACCAAGCAATGTCAGCAACATTAATGGATCAAATCCAAAAGAATCTTTA ATTAAATAGGTAATTACTATAATGACTTTTCTAAAGAAAGCTTTTATTGCAACTATAGTTTCTATTTCAGCATTAGTTCTAAATAGTTGTATTGTTGCAGCAATAGCTGTTGGTGGTGGAACAGTTGCCTATATTGATGG AAATTATTTTATGAATATAGAAGGCAACTATAAAGCTGTCTATAAAGCTACTCTTAAAGCTATTAATGATAATAATGACTTTGTTCTAGTATCAAAAGATCTTGATCAAACAAAGCAAAATGCCGACATTGAAGGTGCTA CTAAAATTGATAGTACGAGTTTTAGTGTCAAAATTGAAAGACTGACAGATCAGGCTACTAAAGTGACAATCAAATTTGGTACTTTTGGCGATCAAGCAATGTCATCAACATTAATGGATCAGATCCAGGCAGCTGTACAT AAAGCTTCTTAGAAATGTACAAAAAACTCTACTTAATTATATTATCCACAATAATCGCAATCTCTCTTAATAGTTGTGTTGTTGCCGCTGTTTTAATTGGTACAGCAGTTGTTGCTGGAGGTACAGTATATTACATCAAT GGTAACTATATAATCGAAGTCCCTAAAGATATTAGAAGTGTATACAATGCTACAATCAAGACTATACAGATGGATAGTCAAAATAAACTAATAAGTCAAACCTATAATACTAAATCTGCTATAATTAAAGCTTTACAAAA AGGTGAAAAAATTAGTATAGATTTAAGCAATATTGATAGTCGTTCAACAGAGATAAAAATTCGTATAGGTGTACTTGGCGATGAGAAAAAATCTGCTGATTTAGCAAACTCAATAACAAAAAATATCACCTAAGCAATAT TTCTCGAACTTTGGTTAACTTTTTCTTTTTAAAAACTTTCAAAAATGTATAATTTGTGTTAGTTTGCAAACTACCCTTATATCCATAATGAGTAATAAGGTATTAGATACATATTATAAAAACAATCGACATATTTGGGT GCTAGTACTATCTGGTGCTGTTATAGGCACAATGATTGGTCTTCTAGCAACAGCATTTCAGCTACTCCTAGACTTTATTTTTAAAATTAAGCTGGCTCTTTTTTCTTTCAGTGGTGGTAATCTTTTTATCGAAATCGCTA TGTCAATATCATTAAGTATTGTGATGGTATTAATTTCGATTTTTATTGTTAAAAAATTTGCGAAAGAGGCTGGTGGTAGCGGTATCCAAGAGGTTGAGGGTGCTTTAAAAGGCTGCCGCAAAATACGTAAAAGAGTTATG CCCGTGAAGTTTATAAGTGGACTTTTTTCGTTAGGCTCAGGTTTAAGTTTAGGTAAAGAGGGACCATCAATTCATATGGCTGCTGCATTAGCGCAGTTTTTTGTTGATAAATTTAAACTTACTACAAAATATGCTAATGC GGTTATCTCTGCTGGGGCTGGAGCTGGACTAGCAGCTGCTTTTAATACCCCACTTTCTGGGATTATCTTTGTTATTGAAGAGATGAATAGAAAGTTTAGATTTAGTGTTTCGGCAATAAAGTGTGTGCTAGTAGCATGTA TCATGAGTACAGTTATCTCTAGAGCTATTATGGGTAATCCTCCAGCAATACGCGTAGAAACTTTCAGCTCAGTACCACAAAATACTCTTTGGTTATTTATGGTATTAGGGATTATATTTGGTTATTTTGGTTTACTATTT AACAAATCCTTAATCAAAGTGGCAAACTTTTTCTCAGAAGGCTCCAAGAAGAGGTATTGGACTTTAGTTATAATTGTTTGCATAATTTTTGGTATTGGTGTTGTTCTATCTCCAAATGCTGTTGGCGGTGGCTATATTGT CATAGCAAATACTCTTGATTATAACTTATCAATCAAGATGCTTTTAGTGCTTTTTGTACTTCGTTTTGCTGGAGTTATTTTCTCATATGGCACCGGCGTTACTGGTGGGATATTCGCACCAATGATTGCGCTTGGTACTG

16 Evolution of genome sizes  C-value: 1pg ~= 1.02Gb  Thale cress (Arabidopsis thaliana): 0.16 pg  Fruit fly (Drosophila melanogaster): 0.18 pg  Pufferfish (Takifugu rubripes): 0.4 pg  Human (Homo sapiens): 3.5 pg  Onion (Allium cepa): 16.75 pg  Tiger salamander (Ambystoma tigrinum): 32 pg  Marbled lungfish (Protopterus aethiopicus): 132 pg http://www.rbgkew.org.uk/

17 Genic region and genome size Dan Graur

18 What's in the genome Genome Annotated genes Exon UTR Intron Cis-regulatory elements Selfish elements Novel genes Dead genes (pseudogenes)

19 "Non-genic": repetitive elements  E.g. Human genome  Exons take up?  Introns account for?  Repetitive elements occupy?  Unknown? Venter et al. (2001) Science 291:1304 ABC 1%24%25% 24%1% 25% 35%60% 45% 40%15% 5%

20 What are in the unknown regions?  Investigate with tiling array cDNA array Tiling array Gap size: 10bp Probe size: 25bp  Number of features:  Arabidopsis, 135Mb, 1 chip, ~6x106 features  Human, 3Gb, 7 chips, ~4.2x107 features

21 "Non-genic": unannotated genes Kapranov et al., 2002. Science  Tiling array analysis of human Chr 21, 22

22 Tiling array analysis of human transcriptome Kapranov et al., 2002. Science  Human Chr 21, 22  What do you think these expressed regions represent??

23 Difficulties for coding gene prediction  Training data  You need to know something...  “Biased” toward the properties of the majority.  Real genes that are shorter tend to be much harder to predict. Table 3 Accuracy of GISMO, Glimmer and CRITICA in predicting short genes (<300 bp) Gene finder Cor Sn Sn fk (%) Sp GISMO 0.64 63.0 86.4 69.0 Glimmer 0.54 72.0 83.7 44.0 CRITICA 0.60 46.0 67.4 84.0 Sn fk denotes the sensitivity in detecting function-known genes. Krause et al., 2006. Nucleic Acid Res. 35:540

24 Novel coding sequence identification  Arabidopsis thaliana as an example  135Mb, ~50% occupied by annotated genes.  Focus on coding sequences 90-300bp long.  What would you do next to eliminate ORFs that are likely false predictions? 133,090 sORFs

25 Criterion 1: Codon usage bias  Some codons are used more frequently than others http://www.cbs.dtu.dk/services/GenomeAtlas/

26 Criterion 1: Codon usage bias  For example: codons for proline  Suppose you have the following 2 sequences both code for poly-leucine, which one is more likely to be real coding sequence? NCDSCDS CCT0.250.12 CCC0.250.49 CCA0.250.06 CCG0.250.33 Seq1CCT CCA CCT Seq2CCC CCG CCC

27 Novel CDS identification  Posterior probability calculation  Bayes' theorm

28 Non-coding sequences Coding sequences Novel CDS identification  Determine base composition probabilities  Feature tables Coding sequences Non-coding sequences CDS parameters NCDS parameters c1c2c3 c4c5c6 n ATGC AAA0.010.310.030.02 AAT0.030.170.010.15 AAG0.020.000.02 AAC0.150.02 0.05 ATA0.040.030.050.00 ATT0.060.010.070.02 ATC0.01 0.050.29...

29 Posterior probability of coding sequence  Compare known non-coding and coding sequences Hanada et al., 2007. Genome Res.

30 Posterior probability of coding sequence  Scanning Arabidopsis genome Hanada et al., 2007. Genome Res.

31 After applying the first criterion 7,442 coding sORFs

32 How good is the CDS finding measure  For the training data  For 18 Arabidopsis small protein genes  All 18 are predicted as CDS.  For 84 yeast small protein genes  All 84 are predicted as CDS.

33 So what does this mean?  If a sequence is a true coding sequence  Our approach can predict them with high accuracy.  So, the sensitivity is very good.  Is this good enough??  What about specificity?  Namely, how good is the criteria in excluding false positives?

34 Criterion 2: Expression  What would be the expression level you would expect for true CDS compared to false CDS? Tiling array Gap size: 10bp Probe size: 25bp Expression level Frequency

35 Comparison of expression levels A: Exon B: Intron C: Prediceted novel CDS D: tRNA E: rRNA  Exon, intron, tRNA, rRNA, our predictions

36 Applying the second criterion  Prediction significantly enriched in expressed sequences 2,996 transcribed sORFs

37 Criterion 3: Purifying selection  Compare known coding and non-coding sequences

38 Criterion 3: Purifying selection  Compare known coding and non-coding sequences

39 Our research interests 30,000 25,000 10,000 6,000 45,000 17,000

40 Duplication Mechanism and Loss Rate Gene Duplications MechanismsConsequences Preferential retention Preferential retention Consequences

41 Duplication mechanisms +  Whole genome duplication  Tandem duplication  Segmental duplication  Duplicative transposition

42 Differences in Duplicability CategoryArabidopsisHuman Defense response Proteolysis Transport Ion channel activity Metabolism Development Protein kinase activity Transcription factor activity  Duplicability  The propensity for the retention of a duplicate gene  Computational analysis of genome-wide trend

43 Functional Consequences of Duplication  Functional divergence and conservation  Is it because of changes in cis-regulatory elements or coding sequences  How are duplicates retained, subfunctionalization or neofunctionalization

44 Acknowledgement  Lab members Kousuke Hanada Melissa Lehti-Shiu Cheng Zou  TIGR  Chris Town  Hank Wu  University of Chicago  Wen-Hsiung Li  Justin O. Borevitz  Xu Zhang  Funding


Download ppt "Dark matters in the genomes Shin-Han Shiu Plant Biology / Genetics / EEBB / QBMI."

Similar presentations


Ads by Google