Integrating Genomes D. R. Zerbino, B. Paten, D. Haussler Science 336, 179 (2012) Teacher: Professor Chao, Kun-Mao Speaker: Ho, Bin-Shenq June 4, 2012
Outline Overview Obtaining Genomic Sequences Modeling Evolution of Genotype From Genotype to Phenotype Looking Ahead to Applications Conclusion
Overview Specialization in computational genomics Integration of genetic, molecular, and phenotypic information Impact on diverse fields of science New window into the story of life population genetics, phylogenetics human disease genetics + graph theory, signal processing statistics, computer science
Milestones First genome sequences_1970s Bacteriophage MS2 RNA: 3,569 nucleotides long_1976 Computational genomics_1980 Smith and Waterman Stormo et al. 16-fold improvement in computational power under Moore’s law A 10,000-fold sequencing performance improvement in the past 8 years
Computational Genomics Genomic data Evolution Molecular phenotype Organismal phenotype DNA sequence evolving in time ( history ) chromatin piece interacting with other molecules ( mechanism ) gene product acting in cellular pathways affecting organisms ( function )
Obtaining Genomic Sequences Genome assembly given sufficient read redundancy Large redundant regions (repeats) → complex networks of read-to-read overlaps not all reflecting actual overlaps → to determine which overlaps being legitimate and which being spurious → NP-hard problem → undetermined, prone-to-errors, costly-to-finish regions Newer sequencing technologies with longer reads
Obtaining Genomic Sequences Reference-based assembly Tendency of bias toward reference genome Newer sequencing technologies with longer reads
Modeling Evolution of Genotype Diversity of Genomes Alignment Phylogenetic analysis
Diversity of Genomes every genome being the result of a 3.8-billion-year evolutionary journey from the origin of life Mostly shared and partly unique Single-base change_substitution, SNP Indel_insertion, deletion Tandem duplication Recombination Transposition Rearrangement_inversion, segmental deletion, segmental duplication, fusion, fission, translocation Whole genome duplication
Diversity of Genomes Germline selections ↓ Evolution Somatic selections ↓ Cancer / Immunity
Assembly and Alignment Fig. 1. Assembly and alignment.
Alignment Alignment with assumption of derivation from a suitably recent common ancestor What being conserved or changed during the evolution from common ancestor Substitution, indel, segment order, copy number Local alignment for conserved functional regions of more distantly related genomes Global / Genome alignment for genomes from closely related species
Phylogenetic Analysis Single tree providing an explicit order of gene descent through shared ancestry Finding optimal phylogeny under probabilistic or parsimony models of substitutions and indels being NP-hard Being complicated by homologous recombination Intending to construct a tractable unified theory of genome evolution with stochastic processes jointly describing diversification events of genome
From Genotype to Phenotype Fig. 2. The dynamic processes that affect and are affected by the genome.
Genomes_Mechanisms_Functions Active molecules of the cell, including proteins, messenger RNAs, other functional RNAs Epigenetic mechanisms regulating RNA and protein production and function Gene regulatory networks Protein signaling cascades Metabolic pathways Regulatory network motifs
From Genotype to Phenotype Exploring unfolding history and diversity of life Deriving experimental data from an expansion of cell culture resources for diverse species / tissues and newer single-cell assay methodologies Correlating specific segregating variants with phenotypic traits or diseases Identifying causal variants by complete genome analysis in related as well as unrelated cases and controls and in combination with better prediction of possible effects of genome variants
From Genotype to Phenotype Constructing models of molecular phenotypes involving epigenetic state, RNA expression, and (inferred) protein levels through hidden Markov models, factor graphs, Bayesian networks, and Markov random fields Incorporating biological knowledge into classification and regression methods (e.g., general linear models, neural networks, and support vector machines)
Looking Ahead to Applications Genome data growth collectively from petabytes (10 15 bytes) today to exabytes (10 18 bytes) tomorrow Cancer diagnosis and treatment Immunology Stem cell therapy Agriculture Human prehistory study
Conclusion Facing challenges of obtaining maximum information from every sequencing experiment To borrow and tie together advances from a spectrum of different research fields into foundational mathematical models Between model comprehensiveness and computational efficiency To be shaped by increasing knowledge of biology
Thank You For Your Attention