Computational analysis on genomic variation: Detecting and characterizing structural variants in the human genome Hugo Y. K. Lam Computational Biology.

Slides:



Advertisements
Similar presentations
Genomics – The Language of DNA Honors Genetics 2006.
Advertisements

Discovery of Structural Variation with Next-Generation Sequencing Alexandre Gillet-Markowska Gilles Fischer Team – Biology.
We processed six samples in triplicate using 11 different array platforms at one or two laboratories. we obtained measures of array signal variability.
What has variation data taught us about the biology of recombination? Rory Bowden, Afidalina Tumian, Ronald Bontrop, Colin Freeman, Tammie MacFie, Gil.
1000 Genomes SV detection Boston College Chip Stewart 24 November 2008.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CHAPTER 18 LECTURE SLIDES.
02_13.jpg Human chromosome 4 02_15.jpg 02_15_2.jpg.
Informatics challenges and computer tools for sequencing 1000s of human genomes Gabor T. Marth Boston College Biology Department Cold Spring Harbor Laboratory.
Structural Variation in the Human Genome Michael Snyder March 2, 2010.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
Activities in SVs, focusing on breakpoint characterization Mark Gerstein, Yale.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
Fig Chapter 12: Genomics. Genomics: the study of whole-genome structure, organization, and function Structural genomics: the physical genome; whole.
Genomes and Their Evolution. GenomicsThe study of whole sets of genes and their interactions. Bioinformatics The use of computer modeling and computational.
Large Scale Variation Among Human and Great Ape Genomes Determined by Array Comparative Genomic Hybridization Devin P. Locke, Richard Segraves, Lucia Carbone,
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
1 A Robust Framework for Detecting Structural Variations February 6, 2008 Seunghak Lee 1, Elango Cheran 1, and Michael Brudno 1 1 University of Toronto,
A Genome-wide association study of Copy number variation in schizophrenia Andrés Ingason CNS Division, deCODE Genetics. Research Institute of Biological.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
1 Commentary 1.Do not get too worried about "methods" and details. I fully expect there to be concepts and techniques that you simply are not going to.
Chapter 21 Eukaryotic Genome Sequences
Anatomy of a Genome Project A.Sequencing 1. De novo vs. ‘resequencing’ 2.Sanger WGS versus ‘next generation’ sequencing 3.High versus low sequence coverage.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Genomics Method Seminar - BreakDancer January 21, 2015 Sora Kim Researcher Yonsei Biomedical Science Institute Yonsei University College.
Doug Brutlag 2011 Genomics & Medicine Doug Brutlag Professor Emeritus of Biochemistry &
Complementarity of network and sequence information in homologous proteins March, Department of Computing, Imperial College London, London, UK 2.
Bioinformatic Tools for Comparative Genomics of Vectors Comparative Genomics.
The BreakSeq Project Nucleotide-resolution analysis of structural variants using BreakSeq and a breakpoint library Mark Gerstein.
Identification of Copy Number Variants using Genome Graphs
____ __ __ _______Birol et al :: AGBT :: 7 February 2008 A NOVEL APPROACH TO IMPROVE THE NOISE IN DETECTING COPY NUMBER VARIATIONS USING OLIGONUCLEOTIDE.
Cancer genomics Yao Fu March 4, Cancer is a genetic disease In the early 1970’s, Janet Rowley’s microscopy studies of leukemia cell chromosomes.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
PREETI MISRA Advisor: Dr. HAIXU TANG SCHOOL OF INFORMATICS - INDIANA UNIVERSITY Computational method to analyze tandem repeats in eukaryote genomes.
High density array comparative genomic hybridisation (aCGH) for dosage analysis and rapid breakpoint mapping in Duchenne Muscular Dystrophy (DMD) Victoria.
Table 8.3 & Alberts Fig.1.38 EVOLUTION OF GENOMES C-value paradox: - in certain cases, lack of correlation between morphological complexity and genome.
Do not reproduce without permission 1 Gerstein.info/talks (c) (c) Mark Gerstein, 2002, Yale, bioinfo.mbb.yale.edu Gerstein Lab Aims in ModENCODE.
Copyright © 2008 Pearson Education, Inc., publishing as Pearson Benjamin Cummings PowerPoint ® Lecture Presentations for Biology Eighth Edition Neil Campbell.
CHROMOSOMAL INVERSIONS IN HUMAN POPULATIONS Andrea González Morales.
Genomics Chapter 18.
Chromosome inversions in human populations Marta Ruiz Fernández Master in Advanced Genetics 17 December 2014.
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
A high-resolution map of human evolutionary constraints using 29 mammals Kerstin Lindblad-Toh et al Presentation by Robert Lewis and Kaylee Wells.
LECTURE PRESENTATIONS For CAMPBELL BIOLOGY, NINTH EDITION Jane B. Reece, Lisa A. Urry, Michael L. Cain, Steven A. Wasserman, Peter V. Minorsky, Robert.
Global Variation in Copy Number in the Human Genome Speaker: Yao-Ting Huang Nature, Genome Research, Genome Research, 2006.
Canadian Bioinformatics Workshops
The evolution of selfishness HW#5: FASTA sequence Function.
Canadian Bioinformatics Workshops
Precise Identification of Structural Variations in the Human Genome by Splitting Shotgun Reads Zemin Ning1, Anthony Cox1, David Adams1, Paul Flicek2, Charles.
Global Variation in Copy Number in the Human Genome
SVs and CNVs They are often confused…
Genomes and Their Evolution
Genomes and Their Evolution
SGN23 The Organization of the Human Genome
Frequency of Nonallelic Homologous Recombination Is Correlated with Length of Homology: Evidence that Ectopic Synapsis Precedes Ectopic Crossing-Over 
Jin Zhang, Jiayin Wang and Yufeng Wu
Volume 153, Issue 4, Pages (May 2013)
GENOMICS OF STRUCTURAL VARIATION
Fig Figure 21.1 What genomic information makes a human or chimpanzee?
Gene Density and Noncoding DNA
Figure 1. Biased distribution of different SV classes
Complete Haplotype Sequence of the Human Immunoglobulin Heavy-Chain Variable, Diversity, and Joining Genes and Characterization of Allelic and Copy-Number.
Next-Generation Sequencing of Duplication CNVs Reveals that Most Are Tandem and Some Create Fusion Genes at Breakpoints  Scott Newman, Karen E. Hermetz,
Presentation transcript:

Computational analysis on genomic variation: Detecting and characterizing structural variants in the human genome Hugo Y. K. Lam Computational Biology & Bioinformatics, Yale University May 20, 2010 Advisor: Prof. Mark B. Gerstein Committee Members: Prof. Hongyu Zhao & Prof. Kei-Hoi Cheung

Overview Aim – enhance the understanding of the mechanism and impact of genomic variation – develop an efficient computational approach to identify and characterize SVs Introduction – SV, event type, formation mechanism, and more Main Chapters – Clustering pseudogenes into families – Correlating CNV, SD and other genomic features – Nucleotide-resolution analysis of SVs Conclusion and future directions 5/20/20102Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

INTRODUCTION Chapter One

Genomic Variation Gene Dup. Gene Segmental Duplication (SD) Dup. GeneGene Dup. Gene SD/Copy Number Variation (CNV) Paralog Ortholog Non-allelic homologous recombination (NAHR) Alu Gene Alu family Dup. ψgene Pssd. ψgene duplicate Retro-transpose 5/20/20104Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB Active L1 Pssd. ψgene Structural Variation (SV) Insertion Deletion Insertion Inversion Retro-elements Deletion VNTR Ancestral State

Structural Variation Variants (usually >= 1kb) include – Copy number variants Deletions Insertions/duplications – Inversions – Translocations Important contribution to human diversity and disease susceptibility Feuk et al. Nature Reviews Genetics 7, /20/20105Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

SV Formation Mechanism NAHR (Non-allelic homologous recombination) Flanking repeat (e.g. Alu, LINE…) NHEJ (Non-homologous-end- joining) No (flanking) repeats. In some cases <4bp microhomologies TEI (Transposable element insertion) L1, SVA, Alus VNTR (Variable Number Tandem Repeats) Number of repeats varies between different people 5/20/20106Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

Read-depth & CGH Approaches 5/20/2010Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB7 Image: Snyder et al. Genes Dev Mar 1;24(5): Ref: Yoon et al. Genome Res September; 19(9): 1586–1592. Image: Ref: Urban et al. Proc Natl Acad Sci USA 103, 4534{9 (2006).

Paired-End Sequencing & Mapping Insertion Deletion Reference Target Breakpoint Inversion Breakpoint Inversion Breakpoint Reference Paired-End Sequencing and Mapping Span << expected Span >> expectedAltered end orientation 5/20/20108Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB Korbel et al. Science 19 October 2007: Vol no. 5849, pp

Split-read Analysis Deletion Reference Read Reference Read Breakpoint Insertion 5/20/20109Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB Target Genome Zhang et al. Submitted Alt: BreakSeqMore: Breakpoint Assembly

CLUSTERING PSEUDOGENES INTO FAMILIES Chapter Two Lam HY, Khurana E, Fang G, Cayting P, Carriero N, Cheung KH, Gerstein MB. Pseudofam: the pseudogene families database. Nucleic Acids Res (Database issue):D738-D743.

Motivation No study had analyzed eukaryotic pseudogenes in a family approach No resource was annotating pseudogene families Clustering pseudogenes in families may lead to a better understanding of the evolutionary constraints on pseudogenes, and reveal the formational bias of pseudogenes 5/20/201011Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

Methodology 5/20/201012Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

Results Protein family size has an obviously stronger correlation among species than pseudogene family size Correlation of pseudogene family size decreases when the evolutionary distance increases between the species Housekeeping families, such as GAPDH and ribosomal, tend to be enriched with a large number of pseudogenes Developed a large-scale database of eukaryotic pseudogene families PSEUDOFAM 5/20/201013Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

CORRELATING CNV, SD AND OTHER GENOMIC FEATURES Chapter Three Kim PM*, Lam HY*, Urban AE, Korbel JO, Affourtit J, Grubert F, Chen X, Weissman S, Snyder M, Gerstein MB. Analysis of copy number variants and segmental duplications in the human genome: Evidence for a change in the process of formation in recent evolutionary history. Genome Res Dec;18(12):

Motivation NAHR during meiosis was thought to lead to the formation of larger deletions and duplications However, not much was known about how other mechanisms are involved, and not many sequences of breakpoints are available Calculate enrichment of specific genomic features around the breakpoints may reveal different formation signatures for SDs and CNVs 5/20/201015Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

Methodology Survey a range of genomic features Count the number of features in each genomic bin (100kb) Calculate correlations using robust stats …ATCAAGG CCGGAA… Repeat Exact match Local environment If exact CNV breakpoints are known, we can calculate the enrichment of repeat elements relative to the genome or relative to the local environment 5/20/201016Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

Old Low seq-ID (%) CNVs/Young High seq-ID (%) SDs Results Fixation/Aging NAHR NHEJ Alu SD LINE Microsatellite Subtelomeres Fragile sites Alu Burst (40 MYA) AFTER THE ALU BURST, THE IMPORTANCE OF ALU ELEMENTS FOR GENOME REARRANGEMENT DECLINED RAPIDLY About 40 million years ago there was a burst in retrotransposon activity The majority of Alu elements stem from that time This, in turn, led to rapid genome rearrangement via NAHR The resulting SDs, could create more SDs, but with Alu activity decaying, their creation slowed 5/20/201017Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

NUCLEOTIDE-RESOLUTION ANALYSIS OF SV Chapter Four Lam HY*, Mu XJ*, Stütz AM, Tanzer A, Cayting PD, Snyder M, Kim PM, Korbel JO*, Gerstein MB. “Nucleotide- resolution analysis of structural variants using BreakSeq and a breakpoint library”. Nature Biotechnology 2010 Jan;28(1):47-55.

Aims Leverage the nucleotide-level information of the available SV Breakpoints – Enable rapid detection of SVs short-read sequences personal genomes – Analyze the breakpoint junctions Deduce the formation mechanism Infer ancestral states of events Examine physical features for formation biases 5/20/201019Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

Methodology Breakpoints collection and standardizationJunction Library Creation SV Detection Short-read alignment SV Characterization Formation Mechanism Ancestral State Physical Feature 5/20/201020Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

SVs with sequenced breakpoints 1KG Project >20,000 1KG Project >20,000 Published BreakSeq Library 5/20/201021Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

SV Breakpoint Library Library of SV breakpoint junctions 5/20/201022Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

Alternative Junctions of an Insertion Alternative Junction of a Deletion SV Detection and Genotyping “BreakSeq” leverages the junction library to detect previously known SVs at nucleotide- level from short-read sequenced genome, which can hardly be achieved by methods such as split-readsplit-read Library of SV breakpoint junctions Map reads onto Junctions can be put on a chip 5/20/201023Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

Validation for Detected SVs 48 positive outcomes out of 49 PCRs that were scored in NA12891: 98% PCR validation rate (for low and high-support events) 12 amplicons sequenced in NA12891: all breakpoints confirmed Personal genome (ID)Ancestry High support hits (>4 supporting hits) Total hits (incl. low support) NA18507* Yoruba YH* East Asian81158 NA12891 [1000 Genomes Project, CEU trio] European /20/201024Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

SV Ancestral State Analysis Junction AJunction B Syntenic Primate Region inferring Deletion StateSyntenic Primate Region inferring Insertion State Junction C Region in Reference Genome inferring Insertion State Junction C 1000 bp Junction AJunction B 1000 bp Inferring Insertion according to Ancestral State Inferring Deletion according to Ancestral State Region in Reference Genome inferring Deletion State SV Junction Library 5/20/201025Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB Ref: Genome Remodeling Process

SV Ancestral State Analysis Insertion Deletion 5/20/201026Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

SV Mechanism Classification Highly similar with minor offset SV Non-Allelic Homologous Recombination (NAHR) SV Single Transposable Element Insertion (STEI) Repeat Element RE1 RE2 Multiple Transposable Element Insertion (MTEI) Breakpoint 5/20/201027Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

SV Mechanism Classification 5/20/201028Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

SV Insertion Traces NAHR-based insertions involve nearby sequences NHR- and RT-based insertions are mostly inter-chromosomal 5/20/201029Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

Example of a Rectified SV 147 L1s, are still implicated in recent retrotransposition activity (Mill et al., 2007) Our results suggest the possible recent activity in the human population of at least 84 complete L1 elements 38 of these putative active mobile elements not overlap with the 147 Rectified SV (Deletion -> Insertion) Rectified SV (Deletion -> Insertion) Ancestral Genomes L1 Repeat Element Mech: STEI Putative Active L1 5/20/201030Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

Breakpoint Features Analysis Breakpoints SVs vs. Telomeres 5/20/201031Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

An Automated Pipeline The Annotation PipelineThe Identification Pipeline The BreakSeq Pipeline Data Conversion SV Dataset Annotated and Standardized SVs Junction Library Sequence Reads Standardized SVs Rapid SV identification for short-read genomes SV Calls Annotating SVs with different features 5/20/201032Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

Computational Performance ModuleSVsProcessing TimeCPU* Annotation1,8894.1hrs1 Standardization<1 min Mechanism1.7 hrs Ancestral State2.4 hrs Feature Analysis~2 mins Annotation20K1.7 days1 Annotation20K<1 day2 Annotation3M~3 days80 (20 nodes) SV Detection1,8898 hrs (40x)1 SV Detection1,889~1 week (1000 × 40x) 192 (48 nodes) * 3GHz CPU, 16GB ram (1 node = 4CPUs) 5/20/201033Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

CONCLUSION AND FUTURE DIRECTIONS Chapter Five

Conclusion Performed analysis on a large-scale study of nucleotide-resolution SVs to pinpoint the effects of SV and to characterize them Developed a novel approach to rapidly scan for and characterize SVs in a newly sequenced personal genome – revealed the formational biases underlying SV formation NAHR is driven by recombination between repeat sequences NHR is likely driven by DNA repair and replication errors TEI mostly involves transposons or alike – by applying on a large scale, we envisage that it could be used for genotyping and determining SV allele frequencies 5/20/201035Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

Future Directions Look into transposons that contribute a substantial part of the SVs in the human genome – ~45% of the human genome are occupied by transposons or alike – ~ 0.05% are still active About subfamilies of Alu, L1, and SVA elements remain actively mobile – produce genetic diversity – cause diseases by gene disruption 5/20/201036Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

Future Directions No systematic and efficient way to identify active mobile elements and their variation among individuals Aim to identify active mobile elements in a genome-wide fashion – For example, a microarray with probes that represent both ends of known L1 elements can be used to capture candidates for paired-end sequencing and subsequent alignment analyses to deduce which are still active. 5/20/201037Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

Acknowledgement Yale University – Mark Gerstein – Hongyu Zhao – Kei-Hoi Cheung – Joel Rozowsky – Nitin Bhardwaj – Tara Gianoulis – Jasmine Mu (Breakseq) – Zhengdong (Split-read) – Jiang Du (Split-read) – Ekta Khurana (Pseudogene) – Suganthi (Pseudogene) – Zhi Lu (Worm TF) – Chong Shou (Network rewiring) – Alex Urban (Transposons) – Annote-assemates (Alex, Roger, Gang, Becky, Lucas…) – Gerstein Lab Yale University – Nicholas Carriero (ITS) – Robert Bjornson (ITS) – Mihali Felipe (Sys admin) – Anne Nicotra (Admin) University of Toronto – Philip Kim (Breakseq, SDCNV) EMBL – Jan Korbel (Breakseq) – Adrian Stuetz (Breakseq) University of Vienna – Andrea Tanzer (Breakseq) Stanford University – Philip Cayting (Breakseq, Pseudogene) – Maeve O'Huallachain (Transposons) – Michael Snyder 5/20/201038Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB

Acknowledgement 5/20/2010Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB39

Breakpoint Assembly 5/20/2010Hugo Lam's Defense - Computational Analysis on Genomic Variation - Yale CBB40 Quinlan et al. Genome Res May;20(5): Goto: Split-read