A high-resolution map of human

Slides:



Advertisements
Similar presentations
Functional Genomics with Next-Generation Sequencing
Advertisements

Manolis Kellis: Research synopsis Brief overview 1 slide each vignette Why biology in a computer science group? Big biological questions: 1.Interpreting.
Mouse Genome Annotation Summit, 12 Mar 2008 The Status of the Mouse Genome.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Next-generation sequencing – the informatics angle Gabor T. Marth Boston College Biology Department AGBT 2008 Marco Island, FL. February
Predicting the Function of Single Nucleotide Polymorphisms Corey Harada Advisor: Eleazar Eskin.
Comparative Motif Finding
Richard, Rochelle, Zohal, Angie
Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes.
Comparative ab initio prediction of gene structures using pair HMMs
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
28-Way vertebrate alignment and conservation track in the UCSC Genome Browser Journal club Dec. 7, 2007.
David Haussler Howard Hughes Medical Institute University of California, Santa Cruz Assembly, Comparison, and Annotation of Mammalian Genomes.
The Human Genome Project and 100 Million Years of Human Evolution
Short Primer on Comparative Genomics Today: Special guest lecture 12pm, Alway M108 Comparative genomics of animals and plants Adam Siepel Assistant Professor.
Restriction Fragment Length Polymorphisms (RFLPs) By Amr S. Moustafa, M.D.; Ph.D. Assistant Prof. & Consultant, Medical Biochemistry Dept. College of.
Human Genome Project Seminal achievement. Scientific milestone. Scientific implications. Social implications.
“An integrated encyclopedia of DNA elements in the human genome” ENCODE Project Consortium. Nature 2012 Sep 6; 489: Michael M. Hoffman University.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Active Lecture Questions for BIOLOGY, Eighth Edition Neil Campbell & Jane Reece Questions prepared by Jung Choi, Georgia Institute of Technology Copyright.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Nucleotide sequence alignments in Compara Stephen Fitzgerald
Todd J. Treangen, Steven L. Salzberg
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Computational personal genomics: selection, regulation, epigenomics, disease Manolis Kellis MIT Computer Science & Artificial Intelligence Laboratory Broad.
Doug Brutlag 2011 Genomics & Medicine Doug Brutlag Professor Emeritus of Biochemistry &
By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Korea BioInformation Center Byoung-Chul Kim
Click to edit Master title style Click to edit Master subtitle style CLICKER QUESTIONS For CAMPBELL BIOLOGY, NINTH EDITION Jane B. Reece, Lisa A. Urry,
© 2010 by The Samuel Roberts Noble Foundation, Inc. 1 The Samuel Roberts Noble Foundation, 2510 Sam Noble Parkway, Ardmore, OK, 73401, USA 2 National Center.
Kerstin Lindblad-Toh Whitehead/MIT Center for Genome Research Michael Kamal Broad/MIT Center For Genome Reseach.
Encode variation analysis. Analysis goals Quantify genetic variation in ENCODE regions Detect selective constraint in ENCODE features Develop rules for.
SHI Meng. Abstract Changes in gene expression are thought to underlie many of the phenotypic differences between species. However, large-scale analyses.
数据库使用 杨建华 2010/9/28. Outline of the Topics UCSC and Ensembl Genome Browser (Blat vs Blast vs Blastz vs Multiz) 挖掘数据用 Table Browser 或 BioMart 用户友好化你的数据.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Epidemiology 217 Molecular and Genetic Epidemiology Bioinformatics & Proteomics John Witte.
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
Comparative genomics of 24 mammals Manolis Kellis MIT MIT Computer Science & Artificial Intelligence Laboratory Broad Institute of MIT and Harvard.
Manolis Kellis Broad Institute of MIT and Harvard
Comparative Genomics Methods for Alternative Splicing of Eukaryotic Genes Liliana Florea Department of Computer Science Department of Biochemistry GWU.
.1Sources of DNA and Sequencing Methods.1Sources of DNA and Sequencing Methods 2 Genome Assembly Strategy and Characterization 2 Genome Assembly.
Gene discovery using combined signals from genome sequence and natural selection Michael Brent Washington University The mouse genome analysis group.
Retroviruses and the RW Genome – Active Roles in Evolution of Immunity and Pregnancy Chicago, June 8, 2015 James A. Shapiro University of Chicago
Comparative Genomics I: Tools for comparative genomics
Accessing and visualizing genomics data
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
Transcription factor binding motifs (part II) 10/22/07.
SNP Discovery in Whole-Genome Light-Shotgun 454 Pyrosequences Aaron Quinlan 1, Andrew Clark 2, Elaine Mardis 3, Gabor Marth 1 (1) Department of Biology,
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Katherine S. Pollard Gladstone Institutes, Institute for Human Genetics and Division of Biostatistics - UCSF What makes us human?
分子診斷學概論  第一章 綜說 overview 疾病發生原因的影響層次 DNA 、 RNA 或蛋白質 分子診斷的目的 偵測這些致病因子是那個層次發生變化 本書著重 DNA 、 RNA 的變化 蛋白質層次由原文書章節提供 The Application of Proteomics To Disease.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
BioForum - California Academy of Sciences
Kerstin Lindblad-Toh1 et al.
EGASP 2005 Evaluation Protocol
The Transcriptional Landscape of the Mammalian Genome
Comparative genomics in flies and mammals
Detection of the footprint of natural selection in the genome
Genetics and Evolutionary Biology
Manolis Kellis Broad Institute of MIT and Harvard
Structure of proximal and distant regulatory elements in the human genome Ivan Ovcharenko Computational Biology Branch National Center for Biotechnology.
Genome Projects Maps Human Genome Mapping Human Genome Sequencing
Genome organization and Bioinformatics
Comparative genomics of 29 eutherian mammals
Phylogenetic footprinting and shadowing
.1Sources of DNA and Sequencing Methods 2 Genome Assembly Strategy and Characterization 3 Gene Prediction and Annotation 4 Genome Structure 5 Genome.
Volume 11, Issue 7, Pages (May 2015)
Presentation transcript:

A high-resolution map of human evolutionary constraint using 29 mammals Kerstin Lindblad-Toh, Manuel Garber, Or Zuk, Michael F. Lin, Brian J. Parker, Stefan Washietl, Pouya Kheradpour, Jason Ernst, Gregory Jordan, Evan Mauceli, Lucas D. Ward, Craig B. Lowe, Alisha K. Holloway, Michele Clamp, Sante Gnerre, Jessica Alfo’ldi, Kathryn Beal, Jean Chang, Hiram Clawson, James Cuff, Federica Di Palma, Stephen Fitzgerald, Paul Flicek, Mitchell Guttman, Melissa J. Hubisz, David B. Jaffe, Irwin Jungreis, W. James Kent, Dennis Kostka, Marcia Lara, Andre L. Martins, Tim Massingham, Ida Moltke, Brian J. Raney, Matthew D. Rasmussen, Jim Robinson, Alexander Stark, Albert J. Vilella, Jiayu Wen, Xiaohui Xie, Michael C. Zody, Broad Institute Sequencing Platform and Whole Genome Assembly Team{, Kim C. Worley, Christie L. Kovar, Donna M. Muzny, Richard A. Gibbs, Baylor College of Medicine Human Genome Sequencing Center Sequencing Team, Wesley C. Warren, Elaine R. Mardis, George M. Weinstock,, Richard K. Wilson, Genome Institute at Washington University, Ewan Birney, Elliott H. Margulies, Javier Herrero, Eric D. Green, David Haussler,, Adam Siepel, Nick Goldman, Katherine S. Pollard, Jakob S. Pedersen,, Eric S. Lander & Manolis Kellis Discover and interpret all functional elements within it (for studies in human bio, health and disease) functional elements: exons, introns and intergenic regions (protein-coding, RNA, regulatory and chormatin roles) HMRD Comparative analysis w/human, mouse, rat, and dog protein sequence genomes Resulted in similarities (showed that @ least 5% is under purifying selection and mostly likely functional consisting of non-coding elements with regulatory roles) evolution selected for them Presentation by: Tu Nguyen & Yazmin Rodriguez

Goal with human genome Discover and interpret all functional elements within it HMRD Comparative analysis w/human, mouse, rat, and dog protein sequence genomes Resulted in similarities Discover and interpret all functional elements within it (for studies in human bio, health and disease) functional elements: exons, introns and intergenic regions (protein- coding, RNA, regulatory and chormatin roles) HMRD Comparative analysis w/human, mouse, rat, and dog protein sequence genomes Resulted in similarities (showed that @ least 5% is under purifying selection and mostly likely functional consisting of non-coding elements with regulatory roles) evolution selected for them

Coverage average depth of sequence over a nucleotides coverage = (#reads * read length) / length of genome Branch Length and Genetics one branch length is equal to 1 nucleotide substitution per site False Discovery Rate (FDR) Why is coverage important? coverage: average depth of sequence over a nucleotide (ie: how many times that portion is sequenced) | the higher the coverage, the more accurate the sequencing the higher the coverage, the more accurate the sequencing is to determine whether a sequencing deviation is an error in sequencing or a SNP Branch Length and Genetics one branch length is equal to 1 nucleotide substitution per site (usually 100 bp) False Discovery Rate (FDR) designed to control the expected proportion of incorrectly rejected null hypothesis (‘false discoveries’) for example a FDR of 10% would mean, there is a maximum of 10% false positives in the discoveries you made Picture credit: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC430923/figure/f3/ http://www.ncbi.nlm.nih.gov/pmc/articles/PMC430923/figure/f3/

Why is this important according to the authors? Genetic Constraints portions of DNA that remain the same Why is this important according to the authors? they must be important to be selected for and conserved through evolution Genetic Constraints portions of DNA that remain the same suggesting evolution (exons, introns, intergenic elements) Why is this important according to the authors? they must be important to be selected for and conserved through evolution even something as simple and effective as DNA polymerase would be prone to mutations, however, it remained more or less unchanged, therefore there must be a selection towards DNA polymerase (species that had mutations in DNA polymerase were not as ‘fit’ as the others) Why is this important in terms of this article? Evolutionary genetic constraints were not only exons, that would eventually code for proteins. They also contained introns (spliced out during transcription) and intergenic regions (region between genes often referred to as ‘junk DNA’). These regions must have an important function (protein coding or regulatory) for it to be conserved and selected for SCE (synonymous constraint elements) short stretches within protein-coding ORFs that also encode additional, overlapping functional elements reduced apparent rate of synonymous substitutions in cross-species alignment ------ Yaz’s notes: Initial Mammalian Comparisons (just estimated the overall portion of a genome under evolutionary constraint [sequence that has not changed over the years, suggesting evolution has selected for them) gave a general view of location (estimated the overall location of the genome under evolutionary constraint couldn’t detect everything (but, it couldn’t detect the constraint elements especially the smaller ones) Wanted to identify and interpret constraint elements found for functionality Picture credit: http://www.nature.com/nrg/journal/v15/n3/fig_tab/nrg3644_F3.html http://www.nature.com/nrg/journal/v15/n3/fig_tab/nrg3644_F3.html

Process - Sequencing, assembly, alignment 29 mammalian genomes shotgun sequencing Largest fraction of constraint found in: exons, introns, intergenic regions 40% of what they found were introns MultiZ Image: Blue are organisms with finished genome sequences high quality drafts are in green (7x coverage) drafts are in black (2x coverage) red branches indicate more than 10 substitutions per 100 bp, while blue are less than 10 substitutions per 100 bp 29 mammalian genomes shotgun sequencing (20 based on ~2 fold coverage; and the rest by ~7 fold coverage thus to maximize the species sequenced Unique counts: 5’ UTRs, 3’ UTRs, promoters, pseudogenes, non-coding RNAs, introns, intergenic Used MultiZ to take many local alignments to generate a multiple local alignment 40% of the gene are intronic power to detect constrained elements depends on total branch length of phylogenetic tree connecting the species

Image: at 10% FDR, 3.6 million constrained elements can be detected encompassing 4.2% of the genome the area shaded in blue are fraction of newly detected bases compare this to the 29 mammals (union of HMRD 50-bp + Siepel vertebrate elements) largest fraction of constraint can be seen in coding exon, introns and intergenic regions

AR - neutrally evolving repeats HMRD - Human, mouse, rat, dog Masked genomic - the whole genome the numbers of aligned species increases with the functional importance of each feature, suggesting that the power is highest over functional elements

Process - Detection of constrained sequence 4.2% of human genome (3.6 million elements) were pinpointed with a resolution of 12bp fine enough to detect individual binding sites for NRSF in promoter SiPhy-pi substitution rate and substitution pattern constrained regions - decrease in SNPs count when polymorphism do occur in these constrained regions, they tend to match alleles of non-human mammals Detection of constrained sequence

The neurological gene NPAS4has many constrained elements overlapping introns and the upstream intergenic region transcription factor on a gene found overlapping constrained element introns…. which looking further into are known to regulate lineages.

Biased nucleotide substitution patterns identifies positions where two bases appear equally constraint and correlating with SNPs in the human population. An example of an intergenic SiPhy-π element (HG18 chr12:1,916,342-1,916,380) detected based on the presence of three 2-fold degenerate constrained bases. Note how these bases (in grey boxes) alternate between bases across the evolutionary tree. One of the degenerate bases matches a SNP present in several human populations, European CEPF (ECU), Yoruban Africans (YRI) and Japanese (JPT).

Process - Functional annotation of constraint Detection of constrained sequence

You don’t need to look at the figure... The top figure basically points out that a new protein-coding gene (exon) was predicted using the 29 mammalian comparison. Its, then, supported by two independent multi-exon transcripts predicted by Scripture based on the Illumina HiSeq Body Map 2 The bottom figure basically points out that there might be a stop codon readthrough. The region between the first stop codon and second stop codon is highly conserved. This breaks down shortly after the second codon, making the authors believe that the second stop codon is the ‘true’ stop codon. In this case, TGA (a stop codon) also codes for SEC which may act as an active site for amino acids

Evolutionary signatures characteristic of conserved RNA secondary structures to reveal 37,381 candidate structural elements covering ~1% of constrained regions. This technique helped predict a new structure for the 3’ end of XIST large intergenic non-coding RNA This genomic region is constrained throughout the 29 species, making the authors believe that it is crucial in making the 12-18 bp stem and 14 bp loop for the hairpins Image: (b) The RNA structure (green) is predicted on the XIST strand (purple) and overlaps short RNAs (blue) observed at high abundance in the chromatin cellular compartment. (d) The human sequences of all six hairpins were aligned using hairpin D as the reference. Insertions relative to D are shown with orange bars and numbers. Fully conserved positions (*) between the human sequences reveal the same loop region motif. (e) Multiple alignment across vertebrates for hairpin D (f) Secondary structure drawing of XIST structure with color-coding of substitution evidence

As different types of conservation in promoters may imply distinct biological functions, we classified the patterns of conservation within core promoters into three categories: (1) those with uniformly ‘high’ constraint (2) uniformly ‘low’ constraint (3) ‘intermittent’ constraint, consisting of alternating peaks and troughs of conservation High and intermittent constraints are associated with CpG islands, while low constraint regions are associated with low regions of overlap. All three regions overlap at TATA boxes Image: Analysis of promoters for 47,945 transcripts identified three patterns of high (red), intermittent (blue) and low (green) constraint. The genes with intermittent constraint had between 1-9 peaks of constraint within the 200 bp core promoter. This means that the promoter region is generally conserved.

There was enough data to produce known and novel motifs form four species (HMRD) with many conserved instances across the genomes. However, this data doesn’t allow us to discover new motifs. Using the 29 mammalian genomes improves this, allowing us to detect individual motif instances, and predict specific target sites for 688 regulatory motifs corresponding to 345 transcription factors There was a 60% FDR implemented, representing a reasonable compromise between specificity and sensitivity given the available discovery power and matching the experimental specifity of chromatin immunoprecipitation (ChIP) Image: (a) Enrichment of motifs in published experimental data sets. Known motifs for each factor show an enrichment in experimental data sets, which increases with conservation. (b) Enrichment further increases for regions that are bound both in human and in the orthologous positions in mouse.

Scaling of motif instances using different species subsets Scaling of motif instances using different species subsets. Comparison of high and low coverage species demonstrates the value of having low coverage species.

Examination of evolutionary signatures identified Synonymous constraint elements and evidence of positive selection for certain sequences. Here we see two regions of SCE within HOXA2 reading frame (protein present in embryonic development regulating gene expression) these two regions have been characterized as enhancers on exons driving expression of HOX2A

As mentioned in the previous slide, examination of evolutionary signatures has brought up evidence of positive selection throughout lineages. blue are sites under purifying selection (the selective removal of alleles that are causing damage) gray are the sites under neutral selection (changes in the gene pool that are a result of neutral occurrences that don’t hurt nor give advantage to the species) while the red are under positive selection (selective of an allele that increases fitness)

Why is this important Detecting and interpreting these elements is relevant to medicine Gives us more of an understanding of a gene Epigenetics? Detecting and interpreting these elements is relevant to medicine (as loci identified in genome-wide studies frequently lie on non-coding regions)

Take Home Message Found multiple results of constrained sequences in the 29 mammalian genomes Potential functional classes for ~60% of constrained bases Found multiple results of constrained sequences in the 29 mammalian genomes (specifically on non-transcribable sequences) Potential functional classes for ~60% of constrained bases (found similar constraints within the genomes suggesting further studies for the actual function and more understanding of these ‘non-coding genes’)

Where do we go from here? functional elements relevant to this clade, including recent eutherian innovations discovering regulatory elements enable discovery of lineage-specific elements within mammalian clades human-specific selection should be detectable comparative approaches provide an unbiased catalogue of shared functional regions provide information on ancestral and recent selective pressures important implications for understanding human biology, health and disease functional elements relevant to this clade, including recent eutherian innovations discovering regulatory elements enable discovery of lineage-specific elements within mammalian clades, increased resolution for shared mammalian constraint (single-nucleotide resolution) Laurasiatherian and Euarchontoglire branches contains multiple model organisms human-specific selection should be detectable by combining data across genomic regions and by comparing thousands of humans experimental studies require prior knowledge of the biochemical activity sought and reveal regions active in specific cell types and conditions comparative approaches provide an unbiased catalogue of shared functional regions independent of biochemical activity or condition with increasing branch length, they can provide information on ancestral and recent selective pressures across clades and within the human population combination of disease genetics, comparative and population genomics and biochemical studies have important implications for understanding human biology, health and disease

Critiques Overview what they found Statistical findings, no process Straightforward, understandable

Questions?