19 January 2007 Data Quality Meeting Alex Poulovassilis.

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

DIMNet Workshop 7 & 8/10/2002 AutoMed: Automatic generation of Mediator tools for heterogeneous database integration Alex Poulovassilis (Birkbeck College)
Using AutoMed Metadata in Data Warehousing Environments Hao FanAlexandra Poulovassilis School of Computer Science & Information Systems Birkbeck college,
RDFTL: An Event-Condition- Action Language for RDF George Papamarkos Alexandra Poulovassilis Peter T. Wood School of Computer Science and Information Systems.
Intelligent Technologies Module: Ontologies and their use in Information Systems Revision lecture Alex Poulovassilis November/December 2009.
Intelligent Technologies Module: Ontologies and their use in Information Systems Part II Alex Poulovassilis November/December 2009.
October 2007 Data integration architectures and methodologies for the Life Sciences Alexandra Poulovassilis, Birkbeck, U. of London.
December 2009 Data Integration in Grid Environments Alex Poulovassilis, Birkbeck, U. of London.
SeLeNe Kick-off Meeting 15-16/11/2002 SeLeNe-related Research At Birkbeck Alex Poulovassilis and Peter T.Wood Database and Web Technologies Group School.
OMII-UK Steven Newhouse, Director. © 2 OMII-UK aims to provide software and support to enable a sustained future for the UK e-Science community and its.
DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
TU e technische universiteit eindhoven / department of mathematics and computer science Modeling User Input and Hypermedia Dynamics in Hera Databases and.
Crucial Patterns in Service- Oriented Architecture Jaroslav Král, Michal Žemlička Charles University, Prague.
A Stepwise Modeling Approach for Individual Media Semantics Annett Mitschick, Klaus Meißner TU Dresden, Department of Computer Science, Multimedia Technology.
Data Access & Integration in the ISPIDER Proteomics Grid N. Martin – A. Poulovassilis – L. Zamboulis
Data Access & Integration in the ISPIDER Proteomics Grid L. Zamboulis, H. Fan, K. Bellhajjame, J. Siepen, A. Jones, N. Martin, A. Poulovassilis, S. Hubbard,
0 General information Rate of acceptance 37% Papers from 15 Countries and 5 Geographical Areas –North America 5 –South America 2 –Europe 20 –Asia 2 –Australia.
1 Richard White Design decisions: architecture 1 July 2005 BiodiversityWorld Grid Workshop NeSC, Edinburgh, 30 June - 1 July 2005 Design decisions: architecture.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
A Data Curation Application Using DDI: The DAMES Data Curation Tool for Organising Specialist Social Science Data Resources Simon Jones*, Guy Warner*,
George Papastefanatos 1, Panos Vassiliadis 2, Alkis Simitsis 3,Yannis Vassiliou 1 (1) National Technical University of Athens
Semantic Mediation & OWS 8 Glenn Guempel
Ontology-based Access Ontology-based Access to Digital Libraries Sonia Bergamaschi University of Modena and Reggio Emilia Modena Italy Fausto Rabitti.
OIL: An Ontology Infrastructure for the Semantic Web D. Fensel, F. van Harmelen, I. Horrocks, D. L. McGuinness, P. F. Patel-Schneider Presenter: Cristina.
26 / 3/ 2007 MyPlan – Personal Planning for Learning throughout Life Alex Poulovassilis School of Computer Science.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Systems Analysis – Analyzing Requirements.  Analyzing requirement stage identifies user information needs and new systems requirements  IS dev team.
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. Towards Translating between XML and WSML based on mappings between.
Information Systems: Modelling Complexity with Categories Four lectures given by Nick Rossiter at Universidad de Las Palmas de Gran Canaria, 15th-19th.
University of Dublin Trinity College Localisation and Personalisation: Dynamic Retrieval & Adaptation of Multi-lingual Multimedia Content Prof Vincent.
Database System Concepts and Architecture
Introduction to MDA (Model Driven Architecture) CYT.
Proteome data integration characteristics and challenges K. Belhajjame 1, R. Cote 4, S.M. Embury 1, H. Fan 2, C. Goble 1, H. Hermjakob, S.J. Hubbard 1,
Taverna and my Grid Open Workflow for Life Sciences Tom Oinn
Teranode Tools and Platform for Pathway Analysis Michael Kellen, Solution Manager June 16, 2006.
Information System Development Courses Figure: ISD Course Structure.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Data Integration by Bi-Directional Schema Transformation Rules Data Integration by Bi-Directional Schema Transformation Rules By Peter McBrien and Alexandria.
1 Introduction to Software Engineering Lecture 1.
Workshop on Future Learning Landscapes: Towards the Convergence of Pervasive and Contextual computing, Global Social Media and Semantic Web in Technology.
©Ferenc Vajda 1 Semantic Grid Ferenc Vajda Computer and Automation Research Institute Hungarian Academy of Sciences.
Data Access and Security in Multiple Heterogeneous Databases Afroz Deepti.
Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.
Grid Computing & Semantic Web. Grid Computing Proposed with the idea of electric power grid; Aims at integrating large-scale (global scale) computing.
Object Oriented Multi-Database Systems An Overview of Chapters 4 and 5.
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Aberdeen, 28/1/2003 AutoMed: Automatic generation of Mediator tools for heterogeneous data integration Alex Poulovassilis School of Computer Science and.
Any data..! Any where..! Any time..! Linking Process and Content in a Distributed Spatial Production System Pierre Lafond HydraSpace Solutions Inc
Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.
LeGE WS 16 th December 2002 SeLeNe : Self e-Learning Networks Alex Poulovassilis, Birkbeck, Univ. of London One-year Accompanying Measure for IST V.1.9.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Visit to HP Labs, 22/10/2002 Heterogeneous information integration Alex Poulovassilis Database and Web Technologies Group School of Computer Science and.
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
Developing GRID Applications GRACE Project
XML 1. Chapter 8 © 2013 Pearson Education, Inc. Publishing as Prentice Hall SAMPLE XML SCHEMA (XSD) 2 Schema is a record definition, analogous to the.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Architecture Components
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Ontology Evolution: A Methodological Overview
Dr. Bhavani Thuraisingham The University of Texas at Dallas
ONTOMERGE Ontology translations by merging ontologies Paper: Ontology Translation on the Semantic Web by Dejing Dou, Drew McDermott and Peishen Qi 2003.
Presentation transcript:

19 January 2007 Data Quality Meeting Alex Poulovassilis

19 January 2007 Some current & recent research projects AutoMed (EPSRC, BBSRC, MoD) – has developed tools for semi-automatic transformation and integration of heterogeneous information sources – provides a single framework for data cleansing/transformation/integration – can handle both structured and semi-structured (RDF/S, XML) data; virtual, materialised and hybrid integration scenarios; bottom-up, top-down and P2P data integration ISPIDER (BBSRC) – is developing an integrated platform of proteomic data sources – in collaboration with groups at EBI, Manchester, UCL – is using AutoMed, in conjunction with OGSA-DAI, DQP, Taverna – to support biological data integration and web service interoperability

19 January 2007 Some current & recent research projects SeLeNe (EU) – technologies for syndication and personalisation of learning resources: semantic reconciliation and integration of heterogeneous educational metadata, structured and unstructured querying of learning object descriptions, including through virtual views (RQL/RVL) automatic propagation notification of changes in the descriptions of learning objects – our XML and RDF ECA rule processing languages and systems were developed in this context L4All and MyPlan (JISC) –new techniques to support personalised planning of lifelong learning –developing a system that allows users to record and share learning pathways through courses and modules in the London area – in collaboration with IoE, Community College Hackney, UCAS, LearnDirect, Linking London Lifelong Learning Network

19 January 2007 The AutoMed Project Partners: Birkbeck and Imperial Colleges Data integration based on schema equivalence Low-level metamodel, the Hypergraph Data Model (HDM), in terms of which higher-level modelling languages are defined – extensible therefore with new modelling languages Automatically provides a set of primitive equivalence-preserving schema transformations for higher-level modelling languages: addT(c,q,C) deleteT(c,q,C) renameT(c,n,n,C) There are also two more primitive transformations for capturing imprecise knowledge: extendT(c,Range q q,C) contractT(c,Range q q,C)

19 January 2007 AutoMed Features Schema transformations are automatically reversible: addT/deleteT(c,q,C) by deleteT/addT(c,q,C) extendT(c,Range q1 q2,C) by contractT(c,Range q1 q2,C) renameT(c,n,n,C) by renameT(c,n,n,C) Hence bi-directional transformation pathways (more generally networks) are defined between schemas i.e. both-as-view (BAV) transformation/integration The queries within transformations allow automatic data translation, query translation and data lineage tracing Schemas may or may not have a data source associated with them; thus, virtual, materialised or hybrid integration can be supported

19 January 2007 Schema Transformation/Integration Networks US1US2USiUSn LS1LS2LSiLSn GS id … … … …

19 January 2007 AutoMed Architecture Global Query Processor Global Query Optimiser Schema Evolution Tool Schema Transformation and Integration Tools Model Definition Tool Schema and Transformation Repository Model Definitions Repository Wrapper

19 January 2007 Data warehousing scenario

19 January 2007 ISPIDER Project Partners: Birkbeck, EBI, Manchester, UCL Aims: Vast, heterogeneous biological data Need for interoperability Need for efficient processing Development of Proteomics Grid Infrastructure, use existing proteomics resources and develop new ones, develop new proteomics clients for querying, visualisation, workflow etc.

19 January 2007 Project Aims

19 January 2007 my Grid / DQP / AutoMed my Grid: collection of services/components allowing high-level integration of data/applications for in-silico biology experiments DQP: OGSA-DAI (Open Grid Services Architecture Data Access and Integration) Distributed query processing over OGSA-DAI enabled resources AutoMed + DQP: interoperation for integration and query processing over heterogeneous data resources AutoMed + my Grid : interoperation for processing workflows incorporating heterogeneous services and resources

19 January 2007 Recent/current AutoMed research Using AutoMed for virtual data integration: BAV query processing: integrates GAV and LAV techniques supporting source or target schema evolution Using AutoMed for materialised data integration: incremental view maintenance data lineage tracing Lucas Zamboulis has been working on techniques for automatically transforming and integrating XML data Has also investigated using correspondences to ontologies – RDFS schemas – to enhance these techniques

19 January 2007 Other recent/ongoing AutoMed research Dean Williams has been working on extracting structure from unstructured text sources The aim here is to integrate information extracted from unstructured text with structured information available from other sources, using IE techniques in conjunction with AutoMed Dean has used existing IE technology (the GATE tool from Sheffield) for the text annotation and IE part of this work P2P query and update processing over AutoMed pathways Extension with ECA rules and a P2P ECA rule execution engine – Sandeep Mittal – will allow automatic propagation of updates e.g. for view and constraint maintenance Planning to undertake further investigation of constraints and conditional data transformation/integration

19 January 2007 Some possible synergies with the proposed data quality project AutoMed & BAV provide a single framework to support data cleansing, transformation and integration Applicable in a broad range of integration scenarios (top-down, bottom-up, P2P; virtual, materialised, hybrid) Schema transformations can, optionally, be accompanied by a constraint, giving the possibility of investigating conditional data transformation and integration Schema transformations can be used to propagate data forwards (view maintenance) and backwards (lineage tracing) – it would be interesting to see what other information could be propagated e.g. accuracy and timeliness of data Flexible global query processing could be used to support imprecise/incomplete data integration

19 January 2007 Extra slides

19 January 2007 Schema Transformation/Integration Networks (contd) On the previous slide: GS is a global schema LS1, …, LSn are local schemas US1, …, USn are union-compatible schemas the transformation pathways between each pair LSi and USi may consist of add, delete, rename, expand and contract primitive transformation, operating on any modelling construct defined in the AutoMed Model Definitions Repository the transformation pathway between USi and GS is similar the transformation pathway between each pair of union- compatible schemas consists of id transformation steps

19 January 2007 Comparison with GAV & LAV Data Integration Global-As-View (GAV) approach: specify GS constructs by view definitions over LS i constructs Local-As-View (LAV) approach: specify LS constructs by view definitions over GS constructs

19 January 2007 GAV Example student(id,name,left,degree) = [ x,y,z,w | x,y,z,w,_ ug x,_,_,_,_ phd x,y,z,w,_ phd w = phd] monitors(sno,id) = [ x,y | x,_,_,_,y ug x,_,_,_,_ phd x,y supervises] staff(sno,sname,dept) = [ x,y,z | x,y,z,w,_ tutor x,_,_ supervisor x,y,z supervisor]

19 January 2007 LAV Example tutor(sno,sname) = [ x,y | x,y,_ staff x,z monitors z,_,_,w student w phd] ug(id,name,left,degree,sno) = [ x,y,z,w,v | x,y,z,w student v,x monitors w phd] phd, supervises, supervisor are defined similarly

19 January 2007 Evolution problems of GAV and LAV GAV does not readily support evolution of local schemas e.g. adding an age attribute to phd invalidates some of the global view definitions In LAV, changes to a local schema impact only the derivation rules defined for that schema e.g. adding an age attribute to phd affects only the rule defining phd But LAV has problems if one wants to evolve the global schema since all the rules defining local schema constructs in terms of the global schema would need to be reviewed These problems are exacerbated in P2P data integration scenarios where there is no distinction between local and global schemas

19 January 2007 AutoMed approach, Growing Phase assuming initially a schema U = S 1 + S 2 addRel( >, [x | x > x >]) addAtt( >, [ | ( > x >) >]) addAtt( >, [ | ( > x >) >]) …

19 January 2007 AutoMed approach, Shrinking Phase (contd) contrAtt( >, Range Void Any) delAtt( >, [ | > x >]) delAtt( >, [ | > x >]) delRel( >, [x | x > >]) Similarly deletions for supervises and supervisor

19 January 2007 AutoMed approach, `Shrinking Phase contrAtt( >, Range [ | > >] Any) contrRel( >, Range [x | x > >] Any) Similarly contractions for the ug attributes and relation

19 January 2007 Schema Evolution in BAV Unlike GAV/LAV/GLAV, BAV framework readily supports the evolution of both local and global schemas The evolution of the global or local schema is specified by a schema transformation pathway from the old to the new schema For example, the figure on the right shows transformation pathways T from an old to a new global or local schema Global Schema S New Global Schema S T New Local Schema S Local Schema S T

19 January 2007 Global Schema Evolution Each transformation step t in T:S S is considered in turn if t is an add, delete or rename then schema equivalence is preserved and there is nothing further to do (except perhaps optimise the extended transformation pathway); the extended pathway can be used to regenerate the necessary GAV or LAV views if t is a contract then there will be information present in S that is no longer available in S; again there is nothing further to do if t is an extend then domain knowledge is required to determine if the new construct in S can in fact be derived from existing constructs; if not, there is nothing further to do; if yes, the extend step is replaced by an add step

19 January 2007 Local Schema Evolution This is a bit more complicated as it may require changes to be propagated also to the global schema(s) Again each transformation step t in T:S S is considered in turn In the case that t is an add, delete, rename or contract step, the evolution can be carried out automatically If it is an extend, then domain knowledge is required See our CAiSE02, ICDE03 and ER04 papers for more details The last of these discusses a materialised data integration scenario where the old/new global/local schemas have an extent

19 January 2007 Global Query Processing We handle query language heterogeneity by translation into/from a functional intermediate query language – IQL A query Q expressed in a high-level query language on a schema S is first translated into IQL (this functionality is not yet supported in the AutoMed toolkit) View definitions are derived from the transformation pathways between S and the requested data source schemas These view definitions are substituted into Q, reformulating it into an IQL query over source schema constructs

19 January 2007 Global Query Processing (contd) Query optimisation (currently algebraic) and query evaluation then occur During query evaluation, the evaluator submits to wrappers sub-queries that they are able to translate into the local query language. Currently, AutoMed supports wrappers for SQL, OQL, XPath, XQuery and flat-file data sources The wrappers translate sub-query results back into the IQL type system Further query post-processing then occurs in the IQL evaluator

19 January 2007 Other AutoMed research at Imperial Automatic generation of equivalences between different data models A graphical schema & transformations editor Data mining techniques for extracting schema equivalences Optimising schema transformation pathways

19 January 2007 DQP – AutoMed Interoperability Data sources wrapped with OGSA-DAI AutoMed OGSA-DAI wrappers extract data sources metadata Semantic integration of data sources using AutoMed transformation pathways into an integrated AutoMed schema IQL queries submitted to this integrated schema are: Reformulated to IQL queries on the data sources, using the AutoMed transformation pathways Submitted to DQP for evaluation

19 January 2007 Data source schema extraction AutoMed wrapper requests the schema of the data source using an OGSA-DAI service The service replies with the source schema encoded in XML The AutoMed wrapper creates the corresponding schema in the AutoMed repository

19 January 2007 Using AutoMed for in the BioMap Project Relational/XML data sources containing protein sequence, structure, function and pathway data; gene expression data; other experimental data Wrapping of data sources Translation of source and global schemas into AutoMeds XML schema Domain expert provides matchings between constructs in source and global schemas Automatic schema restructuring, with automatic generation of schema transformation pathways See DILS05 paper for more details RDB XML File RDB AutoMed Relational Schema AutoMed Integrated Schema AutoMed XMLDSS Schema AutoMed Relational Schema XML Wrapper RDB Wrapper RDB Wrapper T r a n s f o r m a t i o n p a t h w a y T r a n s f o r m a t i o n p a t h w a y T r a n s f o r m a t i o n p a t h w a y Integrated Database Wrapper Integrated Database …..

19 January 2007 purpose designed building Science Research Infrastructure Fund: £ 6m Research staff and students: 50 Location: Bloomsbury Open: June 2004 Institute of Education University of London Birkbeck College University of London Social scientists Experts in education, sociology, culture and media, semiotics, philosophy, knowledge management... Computer scientists Experts in information systems, information management, web technologies, personalisation, ubiquitous technologies … The London Knowledge Lab

19 January 2007 LKL Research Themes Research at the London Knowledge Lab consists mainly of externally funded projects by EU, EPSRC, ESRC, AHRB, BBSRC, JISC, Wellcome Trust – currently about 25 projects. Four broad themes guide our work and inform our research strategy: new forms of knowledge turning information into knowledge the changing cultures of new media creating empowering technologies for formal and informal learning

19 January 2007 Turning Information Into Knowledge The need to cope with ubiquitous, complex, incomplete and inconsistent information is pervasive in our societies How can people benefit from this information in their learning, working and social lives ? What new techniques are necessary for managing, accessing, integrating and personalising such information ? How to design and build tools that help people to understand such information and generate new knowledge from it ?