By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack
SSAHA2 ssahaEST cDNA/EST Alignment cross_genome Genome Alignment ssaha2 Sequence Alignment TraceSearch Trace Alignment ssahaSNP SNP/indel detection ssahaSV Structural Variation
Exon/Intron Splice Sites mRNA 5’-XXXXX XXXXXXXXX-3’ 5’-XXXXXGTXXXXXXXXXAXXXXXXXXXXAGXXXXXXXXX-3’ genomic DNA n Introns have conserved splice sites (Donor, Acceptor, Branch point) => Define an intron as a gap with splice signals. n Initially, it was discovered that GT-AG introns are spliced by spliceosome containing U1, U2, U4/U6 and U5 snRNPs n However, real donors vary significantly DonorAcceptor Branch point
Site Modelling Weight Matrix Model (WMM): > Donor A C G T Staden R. (1984) Nucleic Acids Res. 12, n WMMs are constructed for donor, acceptor and branch sites based on EnsEMBL annotation
U2 and U12 Donors n U2 donor logo: n U12 donor logo:
U2 and U12 Branch n U2 branch signal logo: n U12 branch logo:
U2 and U12 Acceptors n U2 acceptor logo: n U12 acceptor logo:
1. Improvement of SSAHA SSAHA2 EnsEMBL Differences n Query Subject Query Subject n >tr:ENST n | | n | | n | | n | | n | | n | | n | | n | | n | | n | | n | | n | | SSAHA2 - “Unaware” of Splice Sites
>tr:ENST | | | | | | | | | | | | | | | | | | | | | | ssahaEST – Adjusted Splice Sites n ssahaEST EnsEMBL Differences n Query Subject Query Subject
SSAHA 2 Client Client Client SNP/indel Locus ReferenceRead_mRead_i Read_1 Current Packages: Gap4, POLYBASES, POLYPHRED, PTA, TGICL, autoSNP, miraEST, and SeqDoC, etc. ssahaSNP – Detecting SNPs/indels by Genomic Alignment Multiple read alignment can be reconstructed from individual alignments as aligned positions of each base for each read are based on a common reference (consensus).
Neighbourhood Quality Standard (NQS) (1) the quality value (Q) of the SNP base is 23, the Q value for the 5 bases on either side of the SNP is 15 (2) At least nine of the flanking ten bases matched between reads. (3) The cluster depth is no greater than e.g. 8 reads, on the basis that deeper clusters might comprise a low-copy repeat. (4) The number of candidate SNPs in a cluster is 4, on the basis that clusters with more divergent sequences might be composed of low-copy repeats (recently diverged paralogous sequences, accumulating sequence differences between them.) Mullikin et al. Nature 407, 516 (2000)
Output Format of ssahaSNP
Output Format of Parsed SNPs
Output Format of Parsed Indels
ssahaSV - A Computational Method to Detect Structural Variations
Reference Sequence Sample Reads Deletion Insertion VNTR 1 1’2’ 2’2’ A’ A’’ Detection of Structural Variations
DNA Sources and Reads SpeciesCell linesNumber of reads HumanHAPMAP ,841,054 HumanHAPMAP ,977,374 HumanHAPMAP ,488,765 HumanHAPMAP ,728,821 HumanHAPMAP ,845 HumanCelera HuAA2,788,046 HumanCelera HuBB19,397,599 HumanCelera HuCC1,745,337 HumanCelea HuDD2,011,152 HumanCelera HuFF1,507,522 Total Human44,043,515 ChimpanzeeClint30,838,333 Total Reads74,881,848
Length distribution of structural variants with Chimp ancestral data included.
Reference Sample Reads Reference VNTR ’’ ’’ ’’ Deletion Target Site Duplications - Retrotransposons
Distribution of Target Site Duplication
Computational Validation - NOD (Non-Obese Diabetic) Mouse clone vs Reference Sequence NOD Sequence Reference Sequence Deletion Insertion
4. Insertion Chr13: Deletion Chr6: Insertion Chr1: Deletion Chr1: Experimental validation – PCR Tests
Type of VariationExonicIntronicNon-codingTotal SV_deletion SV_insertion SV_VNTRs Mapping Variants to Ensembl A total number of 7,293 structural variants have been identified: 2,500 deletions, 2,358 insertions and 2,435 VNTRs, using 44 million shotgun reads from 10 different human individuals. 66% of sequences of structural variants can be masked as retrotransposons; 28% of human variants share the same location with the chimp, i.e. ancestral states; 89% of ancestral deletions are retrotransposons, 66% for VNTRs; 38% of variants are located in exon/intron regions; Conclusion: Mobile transposons are not more active in the intro- genetic regions as gene coverage on the human genome is also ~38%
Acknowledgements: Jim Mullkin Two “Tony Cox”es Nikolar Ivanov Richard Durbin The Project is funded by the Wellcome Trust.