Precise Identification of Structural Variations in the Human Genome by Splitting Shotgun Reads Zemin Ning1, Anthony Cox1, David Adams1, Paul Flicek2, Charles.

Slides:



Advertisements
Similar presentations
Genomics – The Language of DNA Honors Genetics 2006.
Advertisements

Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
1 of 25 Sequence Variation in Ensembl. 2 of 25 Outline SNPs SNPs in Ensembl Linkage disequilibrium SNPs in BioMart DAS sources.
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
The bonobo genome compared with the chimpanzee and human genomes Kay Pruüfer et al. Nature (June,2012) Presenter: Chia-Ying Chen.
Copyright, ©, 2002, John Wiley & Sons, Inc.,Karp/CELL & MOLECULAR BIOLOGY 3E The Stability of the Genome Duplication, Deletion, Transposition.
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
Lecture Outline 12/7/05 The human genome
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Restriction Fragment Length Polymorphisms (RFLPs) By Amr S. Moustafa, M.D.; Ph.D. Assistant Prof. & Consultant, Medical Biochemistry Dept. College of.
Introduction Basic Genetic Mechanisms Eukaryotic Gene Regulation The Human Genome Project Test 1 Genome I - Genes Genome II – Repetitive DNA Genome III.
Dr Katie Snape Specialist Registrar in Genetics St Georges Hospital
Whole Exome Sequencing for Variant Discovery and Prioritisation
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
RExPrimer Pongsakorn Wangkumhang, M.Sc. Biostatistics and Informatics Laboratory, Genome Institute, National Center for Genetic Engineering and Biotechnology.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
Copy Number Variants: detection and analysis Manuel Ferreira & Shaun Purcell Boulder, 2009.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack.
Development: differentiating cells to become an organism.
A.J. Pierce MI615 University of Kentucky. Low Copy Repeats in the Human Genome Implications for Genomic Structure MI615 Andrew J. Pierce Microbiology,
CS177 Lecture 10 SNPs and Human Genetic Variation
SNP Haplotypes as Diagnostic Markers Shrish Tiwari CCMB, Hyderabad.
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Chapter 21 Eukaryotic Genome Sequences
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Vervet Monkey Genomics: Genome Canada and Génome Québec Physical Map Project J. Wasserscheid, G. Leveque, C. Nagy, C. Pinsonnault, and K. Dewar, McGill.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Doug Brutlag 2011 Genomics & Medicine Doug Brutlag Professor Emeritus of Biochemistry &
Genome-Wide Analysis of Transposon Insertion Polymorphisms (TIPs) Reveals Intraspecific Variation in Cultivated Rice.
Cancer genomics Yao Fu March 4, Cancer is a genetic disease In the early 1970’s, Janet Rowley’s microscopy studies of leukemia cell chromosomes.
ABC for the AEA Basic biological concepts for genetic epidemiology Martin Kennedy Department of Pathology Christchurch School of Medicine.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Genomics Chapter 18.
Simple-Sequence Length Polymorphisms SSLPs Short tandemly repeated DNA sequences that are present in variable copy numbers at a given locus. Scattered.
Chromosome inversions in human populations Marta Ruiz Fernández Master in Advanced Genetics 17 December 2014.
Chapter 2 Genetic Variations. Introduction The human genome contains variations in base sequence from one individual to another. Some sequence variants.
Recent Advances in Genomic Science Julian Sampson Institute of Medical Genetics, Cardiff.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Looking Within Human Genome King abdulaziz university Dr. Nisreen R Tashkandy GENOMICS ; THE PIG PICTURE.
Canadian Bioinformatics Workshops
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Simple-Sequence Length Polymorphisms
A multi-strain, high-resolution mouse haplotype map reveals three distinctive genetic signatures Laboratory of Population Genetics.
Congenic mice reveal effect of SNP, genomic rearrangements and expression variation on genome wide gene expression Introduction There is still no well-defined.
DNA Marker Lecture 10 BY Ms. Shumaila Azam
Chapter 4 “DNA Finger Printing”
Introduction to bioinformatics lecture 11 SNP by Ms.Shumaila Azam
Genomes and Their Evolution
SGN23 The Organization of the Human Genome
Relationship between Genotype and Phenotype
Bellwork: What is the human genome project. What was its purpose
Congenic mice reveal effect of SNP, genomic rearrangements and expression variation on genome wide gene expression Introduction There is still no well-defined.
Genome Projects Maps Human Genome Mapping Human Genome Sequencing
Position specific effect of SNP on signal ratio from long oligonucleotide CGH microarrays; most single probe aberrations represent genuine genomic variants.
The characterisation of mtDNA deletions using long-read sequencing
Fig Figure 21.1 What genomic information makes a human or chimpanzee?
Linking Genetic Variation to Important Phenotypes
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Gene Density and Noncoding DNA
Working in the Post-Genomic C. elegans World
Transposable Elements
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Hunting for Celiac Disease Genes
SNPs and CNPs By: David Wendel.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Recently Mobilized Transposons in the Human and Chimpanzee Genomes
Presentation transcript:

Precise Identification of Structural Variations in the Human Genome by Splitting Shotgun Reads Zemin Ning1, Anthony Cox1, David Adams1, Paul Flicek2, Charles Shaw-Smith1, Mark Griffiths1, Adam Spargo1, Jane Rogers1 and Richard Durbin1 1The Wellcome Trust Sanger Institute 2EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA UK S Target Site Duplications and Length Distribution INTRODUCTION   A large extent of structural variations exists in the human genome between individuals1,2. Disease and disease susceptibility may be associated with this type of genetic variation.. Current experimental or computational methods provide a means to study human diversity and investigations include copy-number variations using array CGH3,4, identification of insertions/deletions using sequencing traces5, and fine scale mapping using pair-ending fosmids6, from which hundreds of submicroscopic copy-number variants and inversions have been identified. It was reported that the sequences involved sometimes contained entire genes and their regulatory regions, up to millions of DNA bases in size. However, the comparative microarray studies reported in the literature lack the sequence level precision on breakpoints and also the surveys were only on a small fraction of the sequence. The in silico strategy6 using fosmid ends achieved higher resolution, but it still cannot, in most cases, provide exact loci for breakpoints, nor a solution to detect variants less than 5 kb. Short indels (<50 bps) can be identified by aligning shotgun reads against the genome assembly. However, there is still much progress to be made in order to detect accurately all types of structural variations in the different size ranges. We have developed a computational method for the precise identification of structural variants across the genome by aligning shotgun reads against the reference sequence. As individual reads covering the boundaries of variation regions are split, this enables us to pinpoint the exact breakpoint loci as well as to extract sequences involved between the boundaries if applicable. DNA samples used in this analysis were from 10 different human individuals and one chimpanzee male with a total number of 74 million shotgun reads, providing a wealth of resources and diversity in studying structural variations in the human genome. Reference Sample Reads VNTR a b d ’ d ’’ Deletion d Figure 1. Length distribution of structural variants with Chimp ancestral data included. Figure 2. Length distribution of target site duplications. Detection of Structural Variants Deletion b a (a) Deletion Sample Reads Reference Insertion VNTR (b) Insertion (d) VNTRs (c) Insertion with sequence 2’ A’ 1 A’’ 1’ 2 Experimental validation – PCR Tests 1. Insertion Chr1:237001745 2. Deletion Chr1:56954646-56954968 4. Insertion Chr13:30790030 3. Deletion Chr6:39030177-39030481 Results and Conclusion DNA Sources and Reads Exonic, Intronic and Noncoding Mapping 2549 145 236 285 1831 1281 A B C   A total number of 7,293 structural variants have been identified: 2,500 deletions, 2,358 insertions and 2,435 VNTRs, using 44 million shotgun reads from 10 different human individuals. To assess the ancentral states of variation with the chimpanzee genome, we also used 30 million chimp reads. Compared with one existing database dbRIP7 of structural variations, there are 545 exact matches among 2095 retrotransposon insertion polymorphisms (L1, Alu and SVA). 66% of sequences of structural variants can be masked as retrotransposons; 28% of human variants share the same location with the chimp, i.e. ancestral states; 89% of ancestral deletions are retrotransposons, 66% for VNTRs; 38% of variants are located in exon/intro regions; Conclusion: Mobile transposons are significantly more active in the intro-genetic regions and this might lead to phenotype differences among human individuals. Type of Variation Coding_Exonic Coding_Intronic Noncoding Total SV_deletion 17 892 1591 2500 SV_insertion 2 897 1459 2358 SV_VNTRs 8 966 1461 2435 Species Cell lines Number of reads Human HAPMAP 17109 1,841,054 HAPMAP 17119 5,977,374 HAPMAP 11321 4,488,765 HAPMAP 07340 3,728,821 HAPMAP 10470 557,845 Celera HuAA 2,788,046 Celera HuBB 19,397,599 Celera HuCC 1,745,337 Celea HuDD 2,011,152 Celera HuFF 1,507,522 Total Human 44,043,515 Chimpanzee Clint 30,838,333 Total Reads 74,881,848 Genes Affected by Detected Variants Type of Variation Chr Name of the gene offset_start offset_end SV_deletion 1 SEC22L1 142581992 142586126 10 ENSG00000080218 114102173 114106649 SFMBT2 7491228 7492455 11 ENK17_HUMAN 101071004 101080469 FAM55A 113930121 113937136 16 UBN1 4837686 4839024 18 DHFR 22001811 22005321 21 BAGE4 10118653 10130352 22 ARVCF 18378162 18379189 GSTT2 22598698 22635852 Q8N7Q6_HUMAN 36068891 36074529 XP_372900.2 3 PFN2 151170677 151171962 Q96EG4_HUMAN 4 Q9UN78_HUMAN 18755734 18761820 6 ENSG00000197659 81682701 81687383 KIAA1949 30760573 30764396 SV_insertion 12 KRT4 51493851 51493869 2 ENSG00000177083 31960130 31960137 SV_VNTRs 13 ENSG00000182751 36266859 36266939 Q8WYY0_HUMAN 97987112 97987144 NP_660337.2 655266 655362 17 KRTAP4-10 36594293 36594448 ENOSF1 702428 702517 CU025_HUMAN 42245600 42249864 KIF25 168248733 168248958 X FMR1NB 146768339 146768419 (a) 230 331 903 658 296 1171 C B A (b) Chromosomes, Reads and Structural Variants Figure 3. Data overlaps among three datasets: A – Devine_lab5; B – This Study; C – dbRIP7. (a) Deletion; (b) Insertion. Chr No Reads Chr_Size Deleletion Deletion/Size Insertion Insertion/Size VNTRs VNTRs/Size Total Total/Size 1 3371754 245.522847 194 0.790150499 191 0.777931677 182 0.74127521 567 2.309357385 2 2171482 243.018229 147 0.604892895 146 0.600777977 127 0.522594542 420 1.728265413 3 1935162 199.50574 128 0.64158555 114 0.571412131 123 0.616523615 365 1.829521296 4 1523375 191.411218 139 0.726185233 104 0.543332836 115 0.600800733 358 1.870318802 5 1476265 180.857866 110 0.608212418 80 0.442336304 86 0.475511527 276 1.526060249 6 2768970 170.975699 224 1.310127704 219 1.280883782 662 3.871895269 7 1696097 158.628139 154 0.970823972 0.718661901 132 0.832134833 400 2.521620707 8 1354111 146.274826 112 0.765681991 83 0.567425047 74 0.50589703 269 1.839004068 9 1709487 138.429268 119 0.859644797 0.823525268 100 0.722390586 333 2.405560651 10 2100332 135.413628 149 1.100332383 158 1.166795413 456 3.367460179 11 2046883 134.452384 152 1.130511751 168 1.249512987 130 0.96688505 450 3.346909788 12 2066086 132.449811 142 1.07210421 1.049454121 1.124954418 430 3.246512749 13 1561043 114.14298 1.121400545 1.287858439 141 1.235292788 416 3.644551772 14 886057 106.368585 58 0.545273776 48 0.451261056 55 0.51706996 161 1.513604792 15 959438 100.338915 64 0.63783827 72 0.717568054 71 0.707601831 207 2.063008156 16 858369 88.827254 70 0.788046425 51 0.57414811 85 0.956913517 206 2.319108052 17 848567 78.774742 61 0.774359883 50 0.634721216 78 0.990165096 189 2.399246195 18 744095 76.117153 57 0.748845664 0.630606875 0.656882162 155 2.036334701 19 636170 63.811651 41 0.642515894 35 0.548489178 1.096978356 2.287983428 20 1540941 62.435964 82 1.313345622 81 1.297329212 245 3.924020457 21 399977 46.944323 0.745564059 37 0.788167719 1.086393343 2.620125121 22 811301 49.55471 54 1.089704692 62 1.251142424 88 1.775815054 204 4.11666217 X 1834825 154.824264 0.52317381 105 0.678188272 69 0.445666578 255 1.64702866 Y 96878 57.701691 0.017330515 0.034661029 35397665 3076.78 2500 0.813 2358 0.767 2435 0.791 7293 2.371 References 1. Inoue, K. & Lupski, J. R. Molecular mechanisms for genomic disorders. Annu. Rev. Genomics Hum. Genet. 3, 199–242 (2002). 2. Botstein, D. & Risch, N. Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat. Genet. 33 Suppl, 228–237 (2003).  3. Sebat, J. et al. Large-scale copy number polymorphism in the human genome. Science 305, 525–528 (2004).  4. Iafrate, A.J. et al. Detection of large-scale variation in the human genome. Nat. Genet. 36, 949–951 (2004).  5. Bennett, E.A. et al. Natural genetic variation caused by transposable elements in humans. Genetics 168, 933-951 (2004). 6. Tuzun, E. et al. Fine-scale structural variation of the human genome. Nat. Genet. 37, 727–732 (2005). 7. Wang J, Song L, Grover D, Azrak S, Batzer MA, Liang P. dbRIP: a highly integrated database of retrotransposon insertion polymorphisms in humans. Hum Mutat. 27, 323-329 (2006). Acknowledgement: The Project is funded by the Wellcome Trust.