© Copyright Tucana Technologies, Inc All rights reserved. T004v03 Massive Scalability for RDF Storage and Analysis Presented by David Wood, CTO Tom Adams, Sales Engineer Andrew Newman, Software Engineer Tucana Technologies, Inc. Reston, Virginia USA May 2004
© Copyright Tucana Technologies, Inc All rights reserved.T004v Agenda The Tucana Knowledge Server and Kowari Where we fit Performance metrics & scaling Real-world deployment examples Where are we headed?
© Copyright Tucana Technologies, Inc All rights reserved.T004v Tucana and Kowari The Tucana Knowledge Server is a secure, distributed, scalable, transaction-safe, native RDF database. – Stores, manages and analyzes RDF data – iTQL/RDQL query language support – Single instance scales to 1B triples – Federated query capability available – JRDF & Jena API support – Pluggable data models (full text, RDBMSs, etc) – Commercial (academic licenses available) – 100% Java
© Copyright Tucana Technologies, Inc All rights reserved.T004v Tucana and Kowari Kowari is the Open Source basis of the Tucana Knowledge Server – MPL v1.1 – No security, limited APIs/documentation, no pluggable data models – Limited data types (string, URI, date, datetime, number) – Limited scaling (>10M triples on 32-bit, >50M on 64-bit) – No graph-based analysis algorithm support (graph segment matching) Colophon: Kowari is a small Australian marsupial and Tucana is a constellation in the Southern sky.
© Copyright Tucana Technologies, Inc All rights reserved.T004v Tucana Knowledge Server (TKS) in Enterprise Architecture
© Copyright Tucana Technologies, Inc All rights reserved.T004v Tucana Knowledge Server (TKS) Data Flow & Federation
© Copyright Tucana Technologies, Inc All rights reserved.T004v Tucana System Interfaces Data Sources RDF native Structured data sources (e.g. RDBMS) via importation Metadata from unstructured data sources via entity extractors XML or other tagged formats via XSLT Rich Site Summary (RSS) feeds Access Web services (SOAP, WSDL) COM (ASP, etc.) JavaBean Java APIs JRDF & Jena JSP tag library XSLT Descriptors Query language Command line Web UI RDF/OWL editors/viewers via evolving industry APIs
© Copyright Tucana Technologies, Inc All rights reserved.T004v Tucana Supported Platforms Runs 64- or 32-bit (requires Java 1.4.2) GNU/Linux on Intel or Opteron Sun Solaris on SPARC or Intel (Opteron coming Dec ‘04) Windows on Intel – NT4 – 2000 – XP Note: AIX, HP/UX, Mac OS X operational – Future support on roadmap based upon customer demand
© Copyright Tucana Technologies, Inc All rights reserved.T004v Performance Metrics Read/Write comparisons to RDBMSs when storing RDF Load performance Query execution performance Go triple crazy!
© Copyright Tucana Technologies, Inc All rights reserved.T004v Read/Write Comparison
© Copyright Tucana Technologies, Inc All rights reserved.T004v Load Performance
© Copyright Tucana Technologies, Inc All rights reserved.T004v Query Execution Performance
© Copyright Tucana Technologies, Inc All rights reserved.T004v Go triple crazy! 32 bit: about 100 million statements (using explicit I/O, which is now the default on 32 bit platforms) 64 bit: about a billion statements (using mapped I/O)
© Copyright Tucana Technologies, Inc All rights reserved.T004v Why do we scale? Designed from the ground up to be scalable Optimized for reads/very fast writing Dealing with low level aspects of file system Have lots of room for further speedups – Drop indices, increase triple block size, flatten tree Bottlenecks – Virtual memory limits of OS – Thread stacks – Sharing same area of VM
© Copyright Tucana Technologies, Inc All rights reserved.T004v Real-World Deployment Examples Business Needs Satisfied Enterprise Software Company Automobile Manufacturer Genomics Research Defense Integrator
© Copyright Tucana Technologies, Inc All rights reserved.T004v Business Needs Satisfied Get answers to questions – Inferencing and discovery – Change impact and dependency analysis – Variable views of data elements and their relationships Unify disparate information sources – Metadata repositories – Unstructured information (MSOffice and PDF documents, , content mgt, web pages, RSS sources/news feeds) – Other complex data sources Share and re-use knowledge – Within and between enterprises
© Copyright Tucana Technologies, Inc All rights reserved.T004v Enterprise Software Company Critical Need: Provide automated document routing based on a business-specific ontology. Solution: Classify documents against ontology, store classifications and ontology in the Tucana Knowledge Server and build multiple business applications on top. Result: Standards-based metadata management unifies and delivers change impact analysis across a multi-application distributed, staged, software environment.
© Copyright Tucana Technologies, Inc All rights reserved.T004v Auto Manufacturer Critical Need: Analyze quality test and measurement over time for trends – Relying on entrenched vendor - a Tucana OEM – OEM tried RDBMS – not an option Solution: Embed Tucana Knowledge Server into OEM’s existing product for test and measurement. Result: Enables high value trend analyses that have not been possible before for customer.
© Copyright Tucana Technologies, Inc All rights reserved.T004v Genomics Research Critical Need: Collaborative project with big pharma – Concerned with Oracle flexibility & “schema hell” – Need scalability & secure collaborative environment Solution: Rapidly analyze data in Tucana Knowledge Server using application they co- develop with integration partner. Result: Deliver collaborative research system for use with strategic customer to accelerate joint discovery and competitive advantage to both companies.
© Copyright Tucana Technologies, Inc All rights reserved.T004v Defense Integrator Critical Need: Intel agency overwhelmed with data – Automated analysis to improve decision speed & accuracy. – Proto-type software does not scale / agency requires COTS Solution: Deploy Tucana Knowledge Server with metadata extraction incumbent (SRA NetOwl) and scale to billions of records Result: More automated analysis, faster accurate decisions against large data volumes on scalable COTS platform
© Copyright Tucana Technologies, Inc All rights reserved.T004v Analyze Disparate Data - Now Query Engine RDF Full Text (Lucene) RSS Feeds RDF API
© Copyright Tucana Technologies, Inc All rights reserved.T004v Analyze Disparate Data - Soon XML DB XPath SQL RDBMS Query Engine
© Copyright Tucana Technologies, Inc All rights reserved.T004v Analyze Disparate Data - Soon Single Query Representation Other Data Sources Note: Distributed queries already supported. RDBMS
© Copyright Tucana Technologies, Inc All rights reserved.T004v Thank You David Wood Tom Adams Andrew Newman Tucana Technologies, Inc.