Copyright © 2004 Synamatix sdn bhd (538481-U) For audio portion of webcast please dial: +44 (0)870 22 333 65 (please omit zero if calling from outside.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Copyright © 2004 Synamatix sdn bhd ( U) Applications of a Novel Structured Pattern Database Technology for Analysis of Data from Second Generation.
Copyright © 2004 Synamatix sdn bhd ( U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and.
Next Generation Sequencing, Assembly, and Alignment Methods
Bioinformatics at WSU Matt Settles Bioinformatics Core Washington State University Wednesday, April 23, 2008 WSU Linux User Group (LUG)‏
Seeds for Similarity Search Presentation by: Anastasia Fedynak.
March 2006Vineet Bafna Designing Spaced Seeds March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May.
Heuristic alignment algorithms and cost matrices
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Similar Sequence Similar Function Charles Yan Spring 2006.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Tiered architectures 1 to N tiers. 2 An architectural history of computing 1 tier architecture – monolithic Information Systems – Presentation / frontend,
Copyright © 2004 Synamatix sdn bhd ( U) Please dial: Pin: Please note that this is a UK number Challenges of data management.
Chapter 5 Multiple Sequence Alignment.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Mouse Genome Sequencing
HOGENOM a phylogenomic database
Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview.
MySQL. Dept. of Computing Science, University of Aberdeen2 In this lecture you will learn The main subsystems in MySQL architecture The different storage.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Large Scale Parallel File System and Cluster Management ICT, CAS.
Whole Genome Repeat Analysis Package A Preliminary Analysis of the Caenorhabditis elegans Genome Paul Poole.
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Orthology & Paralogy Alignment & Assembly Alastair Kerr Ph.D. WTCCB Bioinformatics Core [many slides borrowed from various sources]
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
From Smith-Waterman to BLAST
Sequence Alignment.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.
Galaxy Community Conference July 27, 2012 The National Center for Genome Analysis Support and Galaxy William K. Barnett, Ph.D. (Director) Richard LeDuc,
1 MAVID: Constrained Ancestral Alignment of Multiple Sequence Author: Nicholas Bray and Lior Pachter.
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Copyright OpenHelix. No use or reproduction without express written consent1.
CIP HPC CIP - HPC HPC = High Performance Computer It’s not a regular computer, it’s bigger, faster, more powerful, and more.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Heuristic Methods for Sequence Database Searching BMI/CS 776 Mark Craven February 2002.
DNA Sequences Analysis Hasan Alshahrani CS6800 Statistical Background : HMMs. What is DNA Sequence. How to get DNA Sequence. DNA Sequence formats. Analysis.
Bioinformatics Overview
Sequence similarity, BLAST alignments & multiple sequence alignments
Introduction to Bioinformatics Resources for DNA Barcoding
A database index to large biological sequences
Blast Basic Local Alignment Search Tool
Assembly Language for Intel-Based Computers, 5th Edition
Department of Computer Science
LSM3241: Bioinformatics and Biocomputing Lecture 4: Sequence analysis methods revisited Prof. Chen Yu Zong Tel:
Sequence comparison: Multiple testing correction
Dr Tan Tin Wee Director Bioinformatics Centre
BLAST.
Basic Local Alignment Search Tool (BLAST)
Basic Local Alignment Search Tool
Accelerating Regular Path Queries using FPGA
Presentation transcript:

Copyright © 2004 Synamatix sdn bhd ( U) For audio portion of webcast please dial: +44 (0) (please omit zero if calling from outside the UK) PIN =

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Personal Introductions Robert Hercus - MD and Inventor, Synamatix Over 30 years IT experience Pioneered many large-scale IT projects “Language of Biology” basis of Synamatix Interests: Linguistics, Genomics, Artificial Intelligence Ali Zamli – Bioinformatician Research Scientist Synamatix applications development Dr. Arif Anwar – VP, Synamatix 10 yrs+ post-Ph.D. US and EU genomics background Ex – Agilent, CLONTECH and Axon Instruments

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Questions to answer today? 1.What is a SynaBASE? 2.What are the advantages of using SynaBASE? 3.In which situations has SynaBASE been applied to? 4.Does the use of SynaBASE offer any advantages for phylogenetics?

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Core IP - SynaBASE ™ - PLATFORM Main partners and users in US and EU 50+ staff split across group Open approach to development – engine not software Focused on efficient HPC for Genomics and Life Sciences

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = API calls Graphical Interface Command line interface Applications SXSequenceRefs SXLRESearch SXFuzzyPatternSearch SXAlign Sxpet SXParse CORE Database platform Data analysis Develop Tools SynaRex Bulk SynaProbe Bulk SynaSearch Bulk

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Software policy More than 40 existing applications All open source to licensees of SynaBASE Users can also develop, modify and share all applications

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = What do we know about data ? Similarity & association Common PATTERNS and functionality

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = ACT AAACCTTC AACACTCTC AACTACTC AACTC Pattern Trie Going to leaf node finds all sources and positions More memory efficient than variable length data structures

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = ACT AAACCTTC AACACTCTC AACTACTC AACTC Pattern Trie f=20f=100 AAA Low complexity repeats - filtered High frequency patterns removed from alignment seeding

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Building a SynaBASE – easy and fast

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Takes 8 minutes for Swissprot The fields in the build form are equivalent to the command-line XML configuration Fields data is converted into XML format and added to the existing entry in the Synabase XML configuration file

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = ACT AAACCTTC AACACTCTC AACTACTC AACTC Pattern Trie Trie Boundary Frequency is greater than build limit

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Flexibility to use CMD line

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Single-server IT architecture SynaBASE & SynaSuite Server HP Integrity rx4640 server Dual Intel Itanium2 1.5GHz CPU 64 GB DDR memory 146GB Ultra320 SCSI hard disk x 2 Red Hat Enterprise Linux AS 3 for IA64

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = SynaBASE scales efficiently

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = SynaBASE enables very fast access Number of levels small For a query: Match 1 st longest pattern Follow Eulerian path through network, picking up longest matching pattern for each posn. In query Processing time is: Proportional to query size to obtain all unique subpatterns ACT AAACCTTC AACACTCTC AACTACTC AACTC ACTCG CTCG CTCGA TCGA

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Efficiency leads to high performance Only 15million nodes are needed to represent 56million residues The storage of the shorter nodes has little effect

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = SynaBASE is very fast - Q* logN base A Size of database mega bp Speed milliseconds Conventional SynaBASE

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = BLASTN vs. SynaSearch-Bulk Cumulative Number of hits shows SynaSearch Bulk found extra hits at low-mid identities SynaBASE and Blast DB of Bacterial ORFs queried with 100 1kb sequences Novel hits

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = The elephant and the giraffe walked up the mountain A graph showing Frequency of “string (word)” patterns in a sentence does not reflect meaning A graph showing Probabilities of predicting Precessor and Successor Characters/events (string Significance) reflecting meaning 4. Novel annotation using SynaBASE

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = S ig (a 1 a 2 a 3 ) = F(a 1 a 2 a 3 ) / Ef(a 1 a 2 a 3 ) = F r (a 1 a 2 a 3 ) * F(a 2 ) F(a 1 a 2 ) * F(a 2 a 3 ) a1a1 a2a2 a3a3 a1a2a1a2 a2a3a2a3 a1a2a3a1a2a3 Expected Frequency Ef(a 1 a 2 a 3 ) = F(a 1 a 2 ) * F(a 2 a 3 ) F(a 2 ) Actual Freq/Expec Freq SIGNIFICANCE

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = PIM1 Oncogene F2 F3 Ensembl Gene Gene models correlate with “ SIGNIFICANCE”

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Example assembly result 400,000 reads assembled into 11 contigs in 11 minutes, 2 minutes for error correction Genome coverage 99.89%

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = FragBASE – using the SynaBASE structure…. Select patterns of high coverage Use corrected FragBASE Use FragBASE network* to extend patterns Increase pattern size to overcome shorter repeat sections

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Example 2 - Microarrays Probe design – mer probes, 8 per gene in 8h compared to previous 3 month+ process Probe evaluation and mapping Mapping of 600,000 Affymetrix 25mer probes to Human genome in 17s Compares to over 2 weeks with BLAST

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Example 3 – Comparative Genomics 3 yrs SynaBASEBLAST 6h PatternHunter 22days

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Example 4 – Genome mapping Aims: Mapping of whole genome shotgun reads from a mammalian genome to the Human Genome, to facilitate genome assembly using Synamatix and public tools. Compare sensitivity, specificity and performance advantages of Synamatix technologies. Results: In comparison to BLASTz, SynaSearch: Is 219 fold faster Finds 11% more true positives Finds 17% more unique hits to queries Has a higher specificity: 113% fewer false positives fewer multiple placements per read – 2.7 v 5.3 Benefits: Enables significant enhancements in workflow throughput. 219 fold compute time improvement SynaSearch requires only 1 search process whereas BLASTz requires genome to be separated into 5MB chunks and apportioned across multiple processors. Results in better assemblies of new genomes. Reduces current reliance on outsourcing of BLASTz analysis.

Copyright © 2004 Synamatix sdn bhd ( U) “Inference of a phylogenetic network of whole prokaryotic genomes using SynaBASE” Further example of use of SynaBASE engine: applying SynaBASE to Phylogenetics

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Outline of study Primary data set 1: 101 Bacterial and Archaeal Genomes Used “SynaTree” – exhaustive comparison between “Sequences” in SynaBASE structure Generates phylogenetic tree Used prototype Synamatix application: “SXComparePattern” – exhaustive pattern based similarity matching Evaluation of methods using: C-score method* Group visualisation and clustering analysis Tested “SXComparePattern” method with a larger 488 Bacterial Genome data set *Henz S.R., Huson D.H., Auch A.F. Struwe K.N-. and Schuster S.C. (2005) Whole-genome prokaryotic phylogeny. Bioinformatics. 21(10):

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Phylogenetics using SynaTree For each query genome, can search SynaBASE for all alignments with all other genome sequences {srefs, posn, length} The alignment scores can then be used to calculate a distance matrix: The distance matrix is used to generate a phylogenetic tree Where: A = alignment score L = length of respective genomes

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = SynaTree Interface

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = It can be seen from the chart that the resulting triplet in a sliding window include significant alignments and also spurious short matches that are not significant. The SynaBASE align function, SXAlign, includes a filter to remove the random short alignments or 'noise' from the alignment data. The alignment scores are then used to calculate a distance matrix SynaTree uses the SXAlign API for comparing alignments SynaTree uses SXAlign API for comparing alignments

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Example of filtering Chart shows the effect of using diagonal alignment filter on the alignment of 2 Serine Kinase aa sequences

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = minutes! Compared to 7 days with BLAST SynaTree for 101 bacterial & archaeal genomes

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Cyanobacteria FirmicuteChlamydiae SynaTree for 101 bacterial & archaeal genomes

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = nd method: SXComparePattern Frequency of each pattern Raw score for patterns Calculation of distance matrix from raw score by distance formula

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = SXComparePattern Approach Distance matrix calculated is the same as before with some exceptions: Here, the calculation is based on shared patterns between each genomic sequences Where: A =shared patterns between genomes i and j L= number of patterns for respective genomes

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = seconds! Compared to 7 days with BLAST SXComparePattern tree for 101 bacterial and archaeal genomes

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Chlamydiae Cyanobacteria Firmicute SXComparePattern tree for 101 bacterial and archaeal genomes

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Perfomance based on grouping

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Evaluation of phylogenetic networks Evaluation of phylogenetic networks based on c-score proposed by Henz, et al. (2005) Which is essentially a sum of compatible non-trivial splits (Tc) divided by the sum of all non-trivial splits in the test tree Assumption is that the compatability of non-trivial splits is compared against a reference tree which is deemed 'correct'.

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = NCBI Reference Tree

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN =

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Zoomed tree of 488 Bacterial Genomes

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Performance comparison Rapid method for inferring phylogenetic networks. SXComparePattern highlighted above and marked with * is with 488 bacterial sequences

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = Summary SynaBASE platform extensible to phylogenetics Pattern based approach provides for a very rapid and scalable means of clustering genomes into phylogenetic networks Enables multi-supercomputer performance from a single server This same approach can be used to cluster and analyse previously improbable data sets, e.g. All primate genomes All genes Iterative analysis of evolutionary phylogenetics

Copyright © 2006 Synamatix sdn bhd ( U) For audio of webcast please dial: +44 (0) (omit zero if calling from outside the UK) PIN = END OF WEBCAST Thank you for your participation! Next Webcast will be on April 30 – “Use of SynaBASE for assembly of reads from 454 Life Sciences sequencing platform” A full paper of the work presented will be sent to you on Monday next week Please if you have any questions or would like a free trial