Precise Identification of Structural Variations in the Human Genome by Splitting Shotgun Reads Zemin Ning1, Anthony Cox1, David Adams1, Paul Flicek2, Charles Shaw-Smith1, Mark Griffiths1, Adam Spargo1, Jane Rogers1 and Richard Durbin1 1The Wellcome Trust Sanger Institute 2EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA UK S Target Site Duplications and Length Distribution INTRODUCTION A large extent of structural variations exists in the human genome between individuals1,2. Disease and disease susceptibility may be associated with this type of genetic variation.. Current experimental or computational methods provide a means to study human diversity and investigations include copy-number variations using array CGH3,4, identification of insertions/deletions using sequencing traces5, and fine scale mapping using pair-ending fosmids6, from which hundreds of submicroscopic copy-number variants and inversions have been identified. It was reported that the sequences involved sometimes contained entire genes and their regulatory regions, up to millions of DNA bases in size. However, the comparative microarray studies reported in the literature lack the sequence level precision on breakpoints and also the surveys were only on a small fraction of the sequence. The in silico strategy6 using fosmid ends achieved higher resolution, but it still cannot, in most cases, provide exact loci for breakpoints, nor a solution to detect variants less than 5 kb. Short indels (<50 bps) can be identified by aligning shotgun reads against the genome assembly. However, there is still much progress to be made in order to detect accurately all types of structural variations in the different size ranges. We have developed a computational method for the precise identification of structural variants across the genome by aligning shotgun reads against the reference sequence. As individual reads covering the boundaries of variation regions are split, this enables us to pinpoint the exact breakpoint loci as well as to extract sequences involved between the boundaries if applicable. DNA samples used in this analysis were from 10 different human individuals and one chimpanzee male with a total number of 74 million shotgun reads, providing a wealth of resources and diversity in studying structural variations in the human genome. Reference Sample Reads VNTR a b d ’ d ’’ Deletion d Figure 1. Length distribution of structural variants with Chimp ancestral data included. Figure 2. Length distribution of target site duplications. Detection of Structural Variants Deletion b a (a) Deletion Sample Reads Reference Insertion VNTR (b) Insertion (d) VNTRs (c) Insertion with sequence 2’ A’ 1 A’’ 1’ 2 Experimental validation – PCR Tests 1. Insertion Chr1:237001745 2. Deletion Chr1:56954646-56954968 4. Insertion Chr13:30790030 3. Deletion Chr6:39030177-39030481 Results and Conclusion DNA Sources and Reads Exonic, Intronic and Noncoding Mapping 2549 145 236 285 1831 1281 A B C A total number of 7,293 structural variants have been identified: 2,500 deletions, 2,358 insertions and 2,435 VNTRs, using 44 million shotgun reads from 10 different human individuals. To assess the ancentral states of variation with the chimpanzee genome, we also used 30 million chimp reads. Compared with one existing database dbRIP7 of structural variations, there are 545 exact matches among 2095 retrotransposon insertion polymorphisms (L1, Alu and SVA). 66% of sequences of structural variants can be masked as retrotransposons; 28% of human variants share the same location with the chimp, i.e. ancestral states; 89% of ancestral deletions are retrotransposons, 66% for VNTRs; 38% of variants are located in exon/intro regions; Conclusion: Mobile transposons are significantly more active in the intro-genetic regions and this might lead to phenotype differences among human individuals. Type of Variation Coding_Exonic Coding_Intronic Noncoding Total SV_deletion 17 892 1591 2500 SV_insertion 2 897 1459 2358 SV_VNTRs 8 966 1461 2435 Species Cell lines Number of reads Human HAPMAP 17109 1,841,054 HAPMAP 17119 5,977,374 HAPMAP 11321 4,488,765 HAPMAP 07340 3,728,821 HAPMAP 10470 557,845 Celera HuAA 2,788,046 Celera HuBB 19,397,599 Celera HuCC 1,745,337 Celea HuDD 2,011,152 Celera HuFF 1,507,522 Total Human 44,043,515 Chimpanzee Clint 30,838,333 Total Reads 74,881,848 Genes Affected by Detected Variants Type of Variation Chr Name of the gene offset_start offset_end SV_deletion 1 SEC22L1 142581992 142586126 10 ENSG00000080218 114102173 114106649 SFMBT2 7491228 7492455 11 ENK17_HUMAN 101071004 101080469 FAM55A 113930121 113937136 16 UBN1 4837686 4839024 18 DHFR 22001811 22005321 21 BAGE4 10118653 10130352 22 ARVCF 18378162 18379189 GSTT2 22598698 22635852 Q8N7Q6_HUMAN 36068891 36074529 XP_372900.2 3 PFN2 151170677 151171962 Q96EG4_HUMAN 4 Q9UN78_HUMAN 18755734 18761820 6 ENSG00000197659 81682701 81687383 KIAA1949 30760573 30764396 SV_insertion 12 KRT4 51493851 51493869 2 ENSG00000177083 31960130 31960137 SV_VNTRs 13 ENSG00000182751 36266859 36266939 Q8WYY0_HUMAN 97987112 97987144 NP_660337.2 655266 655362 17 KRTAP4-10 36594293 36594448 ENOSF1 702428 702517 CU025_HUMAN 42245600 42249864 KIF25 168248733 168248958 X FMR1NB 146768339 146768419 (a) 230 331 903 658 296 1171 C B A (b) Chromosomes, Reads and Structural Variants Figure 3. Data overlaps among three datasets: A – Devine_lab5; B – This Study; C – dbRIP7. (a) Deletion; (b) Insertion. Chr No Reads Chr_Size Deleletion Deletion/Size Insertion Insertion/Size VNTRs VNTRs/Size Total Total/Size 1 3371754 245.522847 194 0.790150499 191 0.777931677 182 0.74127521 567 2.309357385 2 2171482 243.018229 147 0.604892895 146 0.600777977 127 0.522594542 420 1.728265413 3 1935162 199.50574 128 0.64158555 114 0.571412131 123 0.616523615 365 1.829521296 4 1523375 191.411218 139 0.726185233 104 0.543332836 115 0.600800733 358 1.870318802 5 1476265 180.857866 110 0.608212418 80 0.442336304 86 0.475511527 276 1.526060249 6 2768970 170.975699 224 1.310127704 219 1.280883782 662 3.871895269 7 1696097 158.628139 154 0.970823972 0.718661901 132 0.832134833 400 2.521620707 8 1354111 146.274826 112 0.765681991 83 0.567425047 74 0.50589703 269 1.839004068 9 1709487 138.429268 119 0.859644797 0.823525268 100 0.722390586 333 2.405560651 10 2100332 135.413628 149 1.100332383 158 1.166795413 456 3.367460179 11 2046883 134.452384 152 1.130511751 168 1.249512987 130 0.96688505 450 3.346909788 12 2066086 132.449811 142 1.07210421 1.049454121 1.124954418 430 3.246512749 13 1561043 114.14298 1.121400545 1.287858439 141 1.235292788 416 3.644551772 14 886057 106.368585 58 0.545273776 48 0.451261056 55 0.51706996 161 1.513604792 15 959438 100.338915 64 0.63783827 72 0.717568054 71 0.707601831 207 2.063008156 16 858369 88.827254 70 0.788046425 51 0.57414811 85 0.956913517 206 2.319108052 17 848567 78.774742 61 0.774359883 50 0.634721216 78 0.990165096 189 2.399246195 18 744095 76.117153 57 0.748845664 0.630606875 0.656882162 155 2.036334701 19 636170 63.811651 41 0.642515894 35 0.548489178 1.096978356 2.287983428 20 1540941 62.435964 82 1.313345622 81 1.297329212 245 3.924020457 21 399977 46.944323 0.745564059 37 0.788167719 1.086393343 2.620125121 22 811301 49.55471 54 1.089704692 62 1.251142424 88 1.775815054 204 4.11666217 X 1834825 154.824264 0.52317381 105 0.678188272 69 0.445666578 255 1.64702866 Y 96878 57.701691 0.017330515 0.034661029 35397665 3076.78 2500 0.813 2358 0.767 2435 0.791 7293 2.371 References 1. Inoue, K. & Lupski, J. R. Molecular mechanisms for genomic disorders. Annu. Rev. Genomics Hum. Genet. 3, 199–242 (2002). 2. Botstein, D. & Risch, N. Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease. Nat. Genet. 33 Suppl, 228–237 (2003). 3. Sebat, J. et al. Large-scale copy number polymorphism in the human genome. Science 305, 525–528 (2004). 4. Iafrate, A.J. et al. Detection of large-scale variation in the human genome. Nat. Genet. 36, 949–951 (2004). 5. Bennett, E.A. et al. Natural genetic variation caused by transposable elements in humans. Genetics 168, 933-951 (2004). 6. Tuzun, E. et al. Fine-scale structural variation of the human genome. Nat. Genet. 37, 727–732 (2005). 7. Wang J, Song L, Grover D, Azrak S, Batzer MA, Liang P. dbRIP: a highly integrated database of retrotransposon insertion polymorphisms in humans. Hum Mutat. 27, 323-329 (2006). Acknowledgement: The Project is funded by the Wellcome Trust.