By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack.

Slides:



Advertisements
Similar presentations
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Advertisements

Genetic Map and Forward Genetics Tools for C. briggsae Presented by Dan Koboldt Ray Miller’s Group.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Chap. 6 Problem 2 Protein coding genes are grouped into the classes known as solitary (single) genes, and duplicated or diverged genes in gene families.
WGS Assembly and Reads Clustering Zemin Ning Production Software Group Informatics Division.
Transcriptome Sequencing with Reference
1 of 25 Sequence Variation in Ensembl. 2 of 25 Outline SNPs SNPs in Ensembl Linkage disequilibrium SNPs in BioMart DAS sources.
The bonobo genome compared with the chimpanzee and human genomes Kay Pruüfer et al. Nature (June,2012) Presenter: Chia-Ying Chen.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Molecular Evolution Revised 29/12/06
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction methods Gene indices Mapping cDNA on genomic DNA Genome-genome.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
CSE182-L12 Gene Finding.
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary, May 2006.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Human Genome Sequence and Variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen, Hungary,
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
RExPrimer Pongsakorn Wangkumhang, M.Sc. Biostatistics and Informatics Laboratory, Genome Institute, National Center for Genetic Engineering and Biotechnology.
Introduction In higher eukaryotes splicing of pre-mRNA occurs with a help of at least two different major (U2) and minor (U12) spliceosomes. Introns, spliced.
Todd J. Treangen, Steven L. Salzberg
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Cryptic Variation in the Human mutation rate Alan Hodgkinson Adam Eyre-Walker, Manolis Ladoukakis.
발표자 석사 2 년 김태형 Vol. 11, Issue 3, , March 2001 Comparative DNA Sequence Analysis of Mouse and Human Protocadherin Gene Clusters 인간과 마우스의 PCDH 유전자.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,
Sequencing a genome and Basic Sequence Alignment
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Assembly of Paired-end Solexa Reads by Kmer Extension using Base Qualities Zemin Ning The Wellcome Trust Sanger Institute.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. [many slides borrowed from various sources]
ABC for the AEA Basic biological concepts for genetic epidemiology Martin Kennedy Department of Pathology Christchurch School of Medicine.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. WTCCB Bioinformatics Core [many slides borrowed from various sources]
Alternative Splicing (a review by Liliana Florea, 2005) CS 498 SS Saurabh Sinha 11/30/06.
A Non-EST-Based Method for Exon-Skipping Prediction Rotem Sorek, Ronen Shemesh, Yuval Cohen, Ortal Basechess, Gil Ast and Ron Shamir Genome Research August.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Geuvadis Analysis Meeting 16/02/2012 Micha Sammeth CNAG – Barcelona.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Research about Alternative Splicing recently 楊佳熒.
.1Sources of DNA and Sequencing Methods.1Sources of DNA and Sequencing Methods 2 Genome Assembly Strategy and Characterization 2 Genome Assembly.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
From: Duggan et.al. Nature Genetics 21:10-14, 1999 Microarray-Based Assays (The Basics) Each feature or “spot” represents a specific expressed gene (mRNA).
Evolutionary Genome Biology Gabor T. Marth, D.Sc. Department of Biology, Boston College
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Phusion2 Assemblies and Indel Confirmation Zemin Ning The Wellcome Trust Sanger Institute.
Variation Detections and De novo Assemblies from Next-gen Data Zemin Ning The Wellcome Trust Sanger Institute.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Population sequencing using short reads: HIV as a case study Vladimir Jojic et.al. PSB 13: (2008) Presenter: Yong Li.
From Reads to Results Exome-seq analysis at CCBR
Precise Identification of Structural Variations in the Human Genome by Splitting Shotgun Reads Zemin Ning1, Anthony Cox1, David Adams1, Paul Flicek2, Charles.
SNP Detection Congtam Pham 2/24/04 Dr. Marth’s Class.
A multi-strain, high-resolution mouse haplotype map reveals three distinctive genetic signatures Laboratory of Population Genetics.
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
CSE182-L12 Gene Finding.
Eukaryotic Gene Finding
Ab initio gene prediction
Position specific effect of SNP on signal ratio from long oligonucleotide CGH microarrays; most single probe aberrations represent genuine genomic variants.
2nd (Next) Generation Sequencing
From: TopHat: discovering splice junctions with RNA-Seq
Discovery tools for human genetic variations
Ensembl Genome Repository.
.1Sources of DNA and Sequencing Methods 2 Genome Assembly Strategy and Characterization 3 Gene Prediction and Annotation 4 Genome Structure 5 Genome.
The Toy Exon Finder.
Figure Genetic characterization of the novel GYG1 gene mutation (A) GYG1_cDNA sequence and position of primers used. Genetic characterization of the novel.
Presentation transcript:

By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack

SSAHA2 ssahaEST cDNA/EST Alignment cross_genome Genome Alignment ssaha2 Sequence Alignment TraceSearch Trace Alignment ssahaSNP SNP/indel detection ssahaSV Structural Variation

Exon/Intron Splice Sites mRNA 5’-XXXXX XXXXXXXXX-3’ 5’-XXXXXGTXXXXXXXXXAXXXXXXXXXXAGXXXXXXXXX-3’ genomic DNA n Introns have conserved splice sites (Donor, Acceptor, Branch point) => Define an intron as a gap with splice signals. n Initially, it was discovered that GT-AG introns are spliced by spliceosome containing U1, U2, U4/U6 and U5 snRNPs n However, real donors vary significantly DonorAcceptor Branch point

Site Modelling Weight Matrix Model (WMM): > Donor A C G T Staden R. (1984) Nucleic Acids Res. 12, n WMMs are constructed for donor, acceptor and branch sites based on EnsEMBL annotation

U2 and U12 Donors n U2 donor logo: n U12 donor logo:

U2 and U12 Branch n U2 branch signal logo: n U12 branch logo:

U2 and U12 Acceptors n U2 acceptor logo: n U12 acceptor logo:

1. Improvement of SSAHA SSAHA2 EnsEMBL Differences n Query Subject Query Subject n >tr:ENST n | | n | | n | | n | | n | | n | | n | | n | | n | | n | | n | | n | | SSAHA2 - “Unaware” of Splice Sites

>tr:ENST | | | | | | | | | | | | | | | | | | | | | | ssahaEST – Adjusted Splice Sites n ssahaEST EnsEMBL Differences n Query Subject Query Subject

SSAHA 2 Client Client Client SNP/indel Locus ReferenceRead_mRead_i Read_1 Current Packages: Gap4, POLYBASES, POLYPHRED, PTA, TGICL, autoSNP, miraEST, and SeqDoC, etc. ssahaSNP – Detecting SNPs/indels by Genomic Alignment Multiple read alignment can be reconstructed from individual alignments as aligned positions of each base for each read are based on a common reference (consensus).

Neighbourhood Quality Standard (NQS) (1) the quality value (Q) of the SNP base is 23, the Q value for the 5 bases on either side of the SNP is 15 (2) At least nine of the flanking ten bases matched between reads. (3) The cluster depth is no greater than e.g. 8 reads, on the basis that deeper clusters might comprise a low-copy repeat. (4) The number of candidate SNPs in a cluster is 4, on the basis that clusters with more divergent sequences might be composed of low-copy repeats (recently diverged paralogous sequences, accumulating sequence differences between them.) Mullikin et al. Nature 407, 516 (2000)

Output Format of ssahaSNP

Output Format of Parsed SNPs

Output Format of Parsed Indels

ssahaSV - A Computational Method to Detect Structural Variations

Reference Sequence Sample Reads Deletion    Insertion    VNTR 1 1’2’ 2’2’ A’ A’’ Detection of Structural Variations

DNA Sources and Reads SpeciesCell linesNumber of reads HumanHAPMAP ,841,054 HumanHAPMAP ,977,374 HumanHAPMAP ,488,765 HumanHAPMAP ,728,821 HumanHAPMAP ,845 HumanCelera HuAA2,788,046 HumanCelera HuBB19,397,599 HumanCelera HuCC1,745,337 HumanCelea HuDD2,011,152 HumanCelera HuFF1,507,522 Total Human44,043,515 ChimpanzeeClint30,838,333 Total Reads74,881,848

Length distribution of structural variants with Chimp ancestral data included.

Reference Sample Reads Reference VNTR   ’’   ’’   ’’ Deletion  Target Site Duplications - Retrotransposons

Distribution of Target Site Duplication

Computational Validation - NOD (Non-Obese Diabetic) Mouse clone vs Reference Sequence NOD Sequence Reference Sequence Deletion Insertion

4. Insertion Chr13: Deletion Chr6: Insertion Chr1: Deletion Chr1: Experimental validation – PCR Tests

Type of VariationExonicIntronicNon-codingTotal SV_deletion SV_insertion SV_VNTRs Mapping Variants to Ensembl A total number of 7,293 structural variants have been identified: 2,500 deletions, 2,358 insertions and 2,435 VNTRs, using 44 million shotgun reads from 10 different human individuals. 66% of sequences of structural variants can be masked as retrotransposons; 28% of human variants share the same location with the chimp, i.e. ancestral states; 89% of ancestral deletions are retrotransposons, 66% for VNTRs; 38% of variants are located in exon/intron regions; Conclusion: Mobile transposons are not more active in the intro- genetic regions as gene coverage on the human genome is also ~38%

Acknowledgements:  Jim Mullkin  Two “Tony Cox”es  Nikolar Ivanov  Richard Durbin The Project is funded by the Wellcome Trust.