Native XML Databases for Information Systems Chris Wallace XQuery workshop April 2006
Chris Wallace, UWE, Bristol 2 Exploring the design space Native XML database (NXD) –Storing, querying and updating XML documents without mapping into relations –Schema-free –Trees are to NXD what tables are to RDBMS –Tables are trees Information Systems –Focus on semi-structured data (mixture of simple data items, text and complex nested structures) –Searching, derived data, visualisation –Process support –Large problem space variously supported by spreadsheets, word documents, ad-hoc databases, increasingly web-integrated data “design as a conversation with the materials in the situation” (Schon)
Chris Wallace, UWE, Bristol 3 Solution: eXist Native XML Database eXist –Open source Java –European team of developers led by Wolfgang Meier –Under development for several years, mature except for documentation Supports –XQuery –XUpdate –XSLT –Free-text searching –XQuery Extensions to allow complete applications to be developed Documents (files) are organised in collections (folders) in a file store –XML Documents stored in an efficient, B+ tree structure with indexes –Non-XML resources (XQuery, CSS, JPEG..), etc can be stored as binary Deployable in different ways –Embedded in a Java application –Part of a Cocoon pipeline –As web application in Apache/Tomcat –With embedded Jetty HTTPServer Multiple Interfaces –REST – to Java servlet –SOAP –XML-RPC
Chris Wallace, UWE, Bristol 4 Sample Implementations Family photos and history –Integration of meta-data on family photos with family history (births, deaths and marriages) and Google Earth FOLD –modules, programmes, scheme operations, staff, organisational structures, events Other demos on the eXist demo siteOther demos
Chris Wallace, UWE, Bristol 5 FOLD – Faculty OnLine Data Operations at student level (2000 in CEMS) supported by central systems (student records, finance) FOLD Scope – teaching and assessment management and organisational knowledge –Modules [450] and their specification –Programmes (Courses) [100] and their structures –Operations – Runs, Coursework, exams –Staff (300+) –Organisational structure (100) –Events Information currently distributed over word documents, spreadsheets, access databases, SQL database, flat text files, LDAP Aims –To support distributed data ownership –To provide a web of data within and between systems –To support organisational processes –To improve data veracity
Chris Wallace, UWE, Bristol 6 FOLD Entity Types Entity TypeIdentifier No of instancesDocument Map No of documents / year Document Type Module SpecificationModuleCode/Version450one each450 (40) complex structure Module RunModuleCode/Year/Runno460one per field/year6table Module assessments ModuleCode/Year/Runno/Eleme ntNo800one per field/year6Table ExaminationModuleCode/Year/Exam420one per year1 Student numbersDate/ModuleCode450 * 4one per date5table Award typesPrimaryAward8one only1simple structure ProgrammesProgrammeCode/Year100one per year1table Programme Structure ProgrammeCode/Pathway/Versi on110one each110 (20) complex structure Organisational structureGroupName100 several per major group60simple structure EventsEventGroup/EventID300all events in a group50simple structure StaffName400per responsibility5table with reps TrainingName/Course200one only1table Training CoursesCourse40one only1table ucasKey wordsUCASCode/Keyword4000one only1table UWE calendarDate365one only1table SuggestedHoursLevel5one only1simple structure Entity Type metadataDatasetName20one only1table System ConfigurationFaculty1one only1table
Chris Wallace, UWE, Bristol 7 FOLD current stats Code –XQuery –XSLT –XSD (one schema) –CSS –PHP - 10 ( vcal) Pages –about 25 user –Only 1 admin as yet Information System development –CW (4 months) –Placement Student (8 months) –Phase allocation: Project (20%) Code (20%) Data – gathering, conversion, cleaning (60%)
Chris Wallace, UWE, Bristol 8 The FOLD
Chris Wallace, UWE, Bristol 9 Areas for attention Conceptual Modelling –Identifiers –Relationships and links –Versioning Logical Modelling (in XML) –Element/attribute –Views –Validation Physical layer (in NXD) – Structuring documents and collections –Mapping to editors –Responsibilities Programming –Functional allocation between tiers –Views and constructed elements –Integrity –XQuery programming User interface –Editing –Long transactions Development Process –Case Tool requirements Scope of application of NXD
Chris Wallace, UWE, Bristol 10 Conceptual Modelling Conventional normalised data model –EAR ++ Entity (not XML entities like &) Attribute (multi-valued) Relationships –Association –Composition –Object Orientation? methods are mainly getters (of derived values) Inheritance only useful in the schema domain Instance inheritance more useful in IS –Expressivity Problems Identifiers Order of parts Verbosity ? Conceptual Scope –Edit trails, versioning, activity tracking Generality problem –Roles as Attributes Stewart Green –Roles as Entities Module Leader Stewart Green
Chris Wallace, UWE, Bristol 11 Identifiers Principle adopted – use naturally occurring identifiers wherever possible –Persons : “Chris Wallace” –Rooms : “3P14” Yes –Reduces gap between Real World domain and system –Names in minutes of meetings, on spreadsheets are readable No –Duplicates Duplicates not tolerable in the RW either, resolved through RW negotiation within a RW namespace e.g. the Faculty Mergers generate duplicates –Aliases –Not all entities have unique domain identifiers Gives rise to confusion in the problem domain and should be resolved there Po –All names need namespace – “Chris Wallace” at CEMS at UWE –Need to replace multiple naming conventions with a single naming scheme (e.g. initials) –URN’s and semantic web
Chris Wallace, UWE, Bristol 12 Conceptual to Logical Attributes v elements Relationships Integrity Views
Chris Wallace, UWE, Bristol 13 Attributes v elements E.g. – … – UFIEKG What criteria to use? –Attributes as ‘meta’ is vague –FOLD uses only elements
Chris Wallace, UWE, Bristol 14 Relationships Implementing Relationships –One – Many RDBMS – primary key on the One side becomes foreign key on the Many side NXD – choose which side on the basis of complexity and responsibility –Sequence (modules in a stage) –Complex (pre-requisite expression) –Many-Many RDBMS – intersection table NXD– as for one-many or either side as appropriate – e.g. Groups and subgroups
Chris Wallace, UWE, Bristol 15 Integrity Structural integrity –Schema validation too weak and too restructive –NXD stores any well-formed XML Referential Integrity –RDBMS – ‘eager’ data not allowed in unless valid, updates maintain integrity integrity failures transient, repair outside database –NXD – ‘lazy’ store the data and provide on-demand or on-trigger validation Integrity failures can be persisted (XLinkit) and repair is inside database Identifier Uniqueness –XML ids only checked within a document –NXD stores all XML nodes with internal identifiers For Information Systems, veracity of the model is what’s important
Chris Wallace, UWE, Bristol 16 Logical to Physical layers What criteria to use in allocation of logical units to the physical layer: –Documents – a physical aggregation of entity instances –Collections – a physical aggregation of documents Examples –Module Specification [moduleCode] Module Spec is an Entity Each Module Spec is a Document –Module Run [moduleCode/year/runNo] Module Run is an Entity Set of Module Runs for a Field is a Document Issues –Schemas needed per entity, not per document –Principle: No concepts modelled in the physical layer –Use Physical layer for responsibility, access rights ?
Chris Wallace, UWE, Bristol 17 Programming issues Tier design Views and constructed elements XQuery programming
Chris Wallace, UWE, Bristol 18 Tier design Allocation of functionality to tiers –Initially nearly all XQuery generating HTML –As work matured, code moved into function libraries and XSLT –XQuery for request input, sessions, selection of nodes, computation of views for –XSLT to generate interface for –CSS to style
Chris Wallace, UWE, Bristol 19 Views Views arise from the need for de-normalisation for presentation –Coursework Element As a simple element –Key : moduleCode/Year/runNo/elementNo –Data: due date As an extended de-normalised element –SuggestedHours (computed from Hours table) –Late date (computed from UWE calendar) –Weighings (extracted from relevant specification) –Module Leader (extracted from Module Run) Views as intermediate structures –From low level functions –For output to XSL –Constructed elements in XQuery use copy (losing reference so cant update through a constructed element) View caching for efficiency –Triggers can invoke cache renewal
Chris Wallace, UWE, Bristol 20 declare function fold:courseworkElement($moduleCode, $year, $runNo, $elementNo) { let $mod := fold:moduleSpecification($moduleCode,$year), $run := fold:moduleRun($moduleCode,$year,$runNo), $elementRun := fold:elementRun($moduleCode,$year,$runNo,'B', $elementNo), $elementSpec := $mod/Assessment/FirstAttempt/Components/ComponentB/Element[position() = $elementNo], $dueDate := $elementRun/DueDate, $returnDate := fold:workingDays($dueDate,20), $componentWeight := $mod/Assessment/Weighting/ComponentWeightB, $weightInComponent := data($elementSpec/Weight), $weightInModule := round($weightInComponent * $componentWeight div 100), $load := fold:load($mod/Level), $hrs := round(data($mod/UWERating) div data($load/Credits) * $weightInModule div 100 * data($load/Hours)) return {$moduleCode} {$mod/Title} {$runNo} {$run/ModuleLeader} {$run/InternalModerator} {$run/ExternalExaminer} CW {$elementNo} {$elementSpec/Description} {$hrs} {$weightInComponent} {$weightInModule} {data($dueDate)} {data($returnDate)} };
Chris Wallace, UWE, Bristol 21 Integrity Unlike RDBMS, integrity checks not inherent in Database –Structural ( schema validation) –Referential integrity –Business rules Policies –Restrictive - allow in only data which has satisfied integrity constraints Unitary view of data – model must be consistent at all times –Permissive – allow in un-validated data with on-demand validation reconciliation Pluralist view – model will probably never be consistent but have to work with this On-demand validation –Structure via eXist validation –Referential (via explicit coding) –Extensive Business rules
Chris Wallace, UWE, Bristol 22 XQuery programming Functional style yields good clean code But its not OO! Need to rethink some algorithms Strict data typing needs explicit conversion Schema not missed XPath 2.0 in XQuery, Xpath 1.0 in XSLT (xalan) causes confusion Fast and responsive
Chris Wallace, UWE, Bristol 23 User Interface Table structured Document editing –Allows maintenance using familiar Spreadsheet tools (Excel Add-in) –Schema is induced by Excel –Accommodations Multi-valued fields as concatenated values –XPath Join and tokenise functions –Embedded separator problem (a name with ‘,’ as a legitimate character) –Defeats conventional indexing but eXist supports full text indexing Optional elements increase table width Formatting choices not maintained (e.g. column widths, freeze-window location) –WebDav to provide Web Folder access (still not functioning) Structured Document editing –Allows maintenance with Word without a schema With difficulty –not schema awareness –Use InfoPath to create desktop form based on schema Need to redo if schema changes –Document editors (Arbotext, XMetal..) - expensive In-situ updates –With Xquery-generated forms and update –With XForms using Orbeon (open-source XForms server)
Chris Wallace, UWE, Bristol 24 Development Tools eXist Java Client provides basic tools –Syntax-aware editor –Query execution –User and database management XML spy Any text editor Model-driven development –Conceptual Model -> logical Model -> physical Model – Rose, QSEE ?
Chris Wallace, UWE, Bristol 25 Development Process Co-development of Information system structure (code and schemas) and content (documents) Support schema migration and refactoring (using XQuery/XSLT) Slide from prototype to production Pluses and Minuses of user enthusiasm Go for ‘low-hanging fruit’ Pay attention to the learning process –XQuery, XSLT are non-trivial languages because deeply unlike Java/PHP Project management via steering group, discussion boards but needs forceful lead developer Reflection forced by presentations and workshops Is Agile IS development different to Agile Software development?
Chris Wallace, UWE, Bristol 26 Characteristics of good fit ? FOLD –Low update rate / medium access rate –High document complexity –Document-centric ownership –Navigational interface –Integration with central systems – (via XML interfaces?)