WP3 – Information Platform Mário J. Silva Universidade de Lisboa, Faculdade de Ciências, Departamento de Informática mjs@di.fc.ul.pt
Epiwork 24 Mar 2010 - Epiwork Review Brussels The EPIWORK project proposes a multidisciplinary research effort aimed at developing the appropriate framework of tools and knowledge needed for the design of epidemic forecast infrastructures to be used in by epidemiologists and public health scientists. The project is a truly interdisciplinary effort, anchored to the research questions and needs of epidemiology research by the participation in the consortium of leading epidemiologists, public health specialists and mathematical biologists. Epidemic researchers along with informatics, computer science, complex systems and physics leading scientists, will tackle most of the much needed development in epidemic forecast of modeling, computational and ICT tools such as i) the foundation and development of the mathematical and computational methods needed to achieve prediction and predictability of disease spreading in complex techno-social systems; ii) the development of large scale, data driven computational models endowed with a high level of realism and aimed at epidemic scenario forecast; iii) the design and implementation of original data-collection schemes motivated by identified modelling needs, such as the collection of real-time disease incidence, through innovative web and ICT applications; v) the set up of a computational platform for epidemic research and data sharing that will generate important synergies between research communities and countries. 24 Mar 2010 - Epiwork Review Brussels
Data in Epiwork [National Bureau of Statistics] demographics, transportation data, .. [Public Health authorities] surveillance data (maybe?) [Internet Social Networks] behavioural data To be shared by epidemic modellers in a digital library, dubbed the Epidemic Marketplace 24 Mar 2010 - Epiwork Review Brussels
What will be necessary to predict epidemics precisely? Data of many different types and many unrelated sources. Improved accuracy makes required data a never-ending story We all want to see realistic and timely plots of epidemics propagation. Available, but hard to find, collect and maintain! 24 Mar 2010 - Epiwork Review Brussels
http://www.gripenet.pt/
Other Internet Monitoring Sources 24 Mar 2010 - Epiwork Review Brussels
Other Internet Monitoring Sources 24 Mar 2010 - Epiwork Review Brussels
Linked Data http://linkeddata.org/ 24 Mar 2010 - Epiwork Review Brussels
Data.gov 24 Mar 2010 - Epiwork Review Brussels
Data.gov.uk http://data.gov.uk/data/list?keyword=epidemiology 24 Mar 2010 - Epiwork Review Brussels
Epidemic Marketplace (EM) Catalogue of data sources containing the metadata describing existing databases; Forum to publish information about data seek modellers to collaborate with, seek sources of data that could be of interest to their epidemiological modelling efforts; Mediating software to automatically process queries to epidemiological data, harvest data, assemble datasets.... The overall objective of an epidemic forecast infrastructure is ultimately the support of modeling approaches addressing the required complexity/realism requirements for explaining and predicting the spatio-temporal dynamics of infectious disease propagation at the global scale (with a focus on Europe). Here we develop an information platform to mediate access to distributed collections of public health data, offering an easy and safe way to share data for those data providers who want to collaborate with epidemiological modelers. Researchers will use this platform in multiple ways: as catalogue of data sources containing the metadata describing existing databases; as a forum to publish information about their own data, seeking modellers to collaborate with, and/or to seek sources of data that could be of interest to their epidemiological modelling efforts; as the host of mediating software that can automatically process queries for epidemiological data available from the information sources connected to the platform. 24 Mar 2010 - Epiwork Review Brussels
Outline The need for an Epidemic Marketplace Metadata and Ontologies for Epidemic Modelling (Deliverable D3.1) Epidemic Marketplace Architecture & Implementation (Deliverable D3.2) Where we stand and forecasts 24 Mar 2010 - Epiwork Review Brussels
Steps for Creating the EM Elaborate meta-model for describing datasets used by epidemic modellers. Provide query services over the meta-data to discover resources. Select ontologies for characterizing data and develop an ontology of epidemic concepts. Ingest, harmonize and cross-link data. Provide query services to select epidemic data using the EM meta-data and ontologies. 24 Mar 2010 - Epiwork Review Brussels
Common Reference Model Open domain: detailed description of the datasets used in the models of all sorts of epidemics would require describing virtually every kind of information, given the diversity of factors and the interdisciplinary of epidemiologic studies. The description of the datasets used in the models of all sorts of epidemics would require all the necessary to propose a model capable of describing virtually every kind of information, given the diversity of factors and the interdisciplinary of epidemiologic studies. In the study of a specific disease it is possible to have datasets describing the disease, how it spreads, clinical data about a population and so on. Data may be geo-referenced and geospatial data may be necessary for the modelling of the disease transmission. Other data can be important for the study of diseases, such as genetic, socio-economic, demographic, environmental and behavioural data. The need to encompass so many areas of study will reflect on the contents of the datasets and ultimately on their metadata, calling for a data organisation supporting interlinked data (Bodenreider and Stevens 2006; Bizer in press) Given the high diversity and heterogeneity of epidemic data involved, a common reference model based on metadata is needed. Metadata terms are being defined based on controlled vocabularies and ontology terms, and ontologies will be also used to characterize the entities and relationships among them in the datasets. As a result, the information model of the Epidemic Marketplace is directly defined through metadata and ontologies. Together, they will be essential in the development of epidemic modelling digital libraries, as they make documents and other data sources accessible in a more sophisticated, structured and meaningful manner. For example, using a specific ontology to describe a specific disease makes everybody referring to a specific disease to use the same term, making the information discovery simpler and more complete. But it also keeps the metadata text simpler, since the ontology itself contains other data that doesn’t need to be inserted as metadata. For example, through an ontology of places (a geographic ontology), if we have a specific location code, we can obtain other information about that location, such as country, coordinates, altitude, city and so on. Data model needs to support interlinked data. 24 Mar 2010 - Epiwork Review Brussels
Common Reference Model Information to be described as metadata Property-lists describing the epidemic datasets stored in the marketplace Level of detail is key design point Must be/become machine-readable Discoverable Searchable Accessible To manage the information in the Epidemic Marketplace, mainly catalogues of datasets and the datasets themselves, it is necessary to adopt a common reference model and provide its description as metadata. Metadata is information about data. It provides a context for the data, helping to understand, to manage and to search it. The level of detail of metadata can change according to end use of the described data. Metadata enables more correct and accurate data exchange and retrieval. The use of metadata standards makes the datasets’ information models easier to be understood and used by different users and applications. As automatic tools for the manipulation, edition and exchange of data become more common and data needs to be machine- readable, the implementation of standard metadata becomes more and more important. For example in the epidemic marketplace, the existence of metadata and a catalogue allows for the search of specific information without having to download and open a document to see its contents. If a researcher is looking for datasets relative to a specific disease or a specific geographic location, it is possible to obtain that information by searching the catalogue of metadata in the repository. 24 Mar 2010 - Epiwork Review Brussels
Meta-data and Ontologies The information model of the EM is directly defined as metadata and ontologies. Advantages of using a specific ontology to describe a specific disease makes everybody referring to a specific disease to use the same term, making the information discovery simpler and more complete; keeps the metadata text simpler, the ontology itself contains other data that doesn’t need to be inserted as metadata
Meta-data Standards ISO/IEC 11179 - representing metadata for an organization in a Metadata Registry (MDR) that has been implemented by organizations in the Health domain. DCES (Dublin Core Metadata Element Set) preferrred for describing web resources (that’s what the EM is). vocabulary of properties to be used to describe document-like files in the web. Standards There are several standards for the collection and management of metadata. ISO/IEC 11179 is the international standard for representing metadata for an organization in a Metadata Registry (MDR) that has been implemented by organizations in the Health domain. Several health organizations have created implementations of this MDR, such as METeOR (Australian Institute of Health and Welfare 2009). However, the DCES (Dublin Core Metadata Element Set) is the most relevant standard to our epidemic modelling e-science infrastructure, because it was conceived for describing web resources, and that is the way the Epidemic Marketplace will be primarily available (DCMI 2008). The DCES is a vocabulary of fifteen properties to be used to describe document-like files in the web. These fifteen elements are a part of a larger set of metadata vocabularies and technical specifications maintained by the DCMI (Dublin Core Metadata Initiative), the DCMI Metadata Terms (DCMI 2008b). DCMI includes, formal domains and ranges in the definitions of its properties. This means that each property may be related to one or more classes by a has domain relationship, indicating the class of resources that the property should be used to describe, and to one or more classes by a has range relationship, indicating the class of resources that should be used as values for that property (Powell et al. 2008). The DCMI recommends the use of controlled languages whenever possible for the description of each element. However, the development of an ontology that is accepted by the whole community is a complex lengthy and costly endeavour, so it is important to reuse as much as possible existing ontologies to reduce costs and implementation time. In addition, adopting already used ontologies simplifies the access to interlinked datasets. The OBO (Open Biomedical Ontologies) is a repository of openly available and relevant ontologies to our problem domain (Smith 2007). We will adopt relevant ontologies from this realm and also controlled languages, such as the ones recommended by the DCMI. One example is the UMLS Metathesaurus (Bodenrieder 2004), which is commonly used in metadata descriptions based on the Dublin Core Standards in the biomedical domain. The DCMI suggests the use of the TGN - Thesaurus of Geographic Names (Harpring 1997) for location references. However, the use of ontologies, such as Geo-Net-PT that we developed for Portugal (Pellicer et al. 2010) or Yahoo! GeoPlanet (Yahoo, 2009) can make the annotation more exact. We are also tracking novel services provided by INSPIRE, an initiative of the European Commission for establishing an infrastructure for spatial information in Europe (European Commission 2007). 24 Mar 2010 - Epiwork Review Brussels
Metadata standards ISO/IEC 11179 Metadata Registry (MDR) Dublin Core (DC) metada for the Web, 15 properties ISO Standard Standard 15836-2003 of February 2003, ANSI/NISO Standard Z39.85-2007 of May 2007 and IETF RFC 5013 of August 2007. DCMI namespace: Since 2008, DCMI includes formal domains and ranges in the definitions of its properties. ISO/IEC 11179 Metadata Registry (MDR) Standard The fifteen element descriptions that have been formally endorsed in the ISO Standard 15836-2003 of February 2003, ANSI/NISO Standard Z39.85-2007 of May 2007 and IETF RFC 5013 of August 2007. Since January 2008, DCMI includes formal domains and ranges in the definitions of its properties. This means that each property may be related to one or more classes by a has domain relationship, indicating the class of resources that the property should be used to describe, and to one or more classes by a has range relationship, indicating the class of resources that should be used as values for that property (Powell et al. 2008). In order to not affect the conformance of existing implementations in RDF, domains and ranges have not been specified for the fifteen properties of the dc: namespace (http://purl.org/dc/elements/1.1/). Rather, fifteen new properties with "names" identical to those of the Dublin Core Metadata Element Set Version 1.1 have been created in the dcterms: namespace (http://purl.org/dc/terms/). These fifteen new properties have been defined as sub-properties of the corresponding properties of DCMES Version 1.1. The use of the new and semantically more precise dcterms is recommended in order to best implement the use of machine processable metadata 24 Mar 2010 - Epiwork Review Brussels
Ontology Standards? UMLS http://www.nlm.nih.gov/research/umls/ Too complex? OBO http://www.obofoundry.org/ Inspire http://www.inspire-geoportal.eu/ Geonames, etc. Epidemiology is an open domain (will never be bounded) First, we have to see the data. 24 Mar 2010 - Epiwork Review Brussels
Strategies for Creating an Epidemic Data Metadata Model Start with a catalogue of epidemic datasets… Focus on collecting extensive metadata. Leverage ontologies and their technologies establish the common terminology interlink heterogeneous metadata classifications. connect with the OBO (Open Biomedical Ontologies) initiative Strategies for Creating an Epidemic Data Metadata Model In a first stage, the Epidemic Marketplace aims at creating a catalogue of epidemic datasets with extensive metadata describing their main characteristics. Ontologies will play an important role in establishing the common terminology to be used in this process and to interlink heterogeneous metadata classifications. The Epidemic Marketplace will explore a comprehensive set of relevant ontologies that besides being used to characterise datasets, will also become important datasets to epidemic modellers. Some of these are already being organised in collections. At a later stage, the marketplace will provide a unified and integrated approach for the management of epidemic data sources. Ontologies will have an important role in integrating these heterogeneous data sources by providing semantic relationships among the described objects. Further on, the marketplace will include methods and services for aligning the ontologies. The aligned ontologies and annotated datasets will eventually serve as the basis for a distributed information reference for epidemic modellers, which will help further on the integration and communication among the community of epidemiologists. To describe the epidemic datasets, it is first necessary to describe the datasets as web resources. This will be done using the DCMI terms and conventions. It will also be necessary to describe the information contained in the datasets. These descriptions constitute what health professionals and researchers will be ultimately looking for. The level of detail of the metadata is another aspect that must be carefully designed: a low level of detail may not be able to sufficiently describe the datasets, making the right information harder to find, but a too detailed metadata scheme can turn the annotation of a specific dataset into a daunting task, hindering the acceptance of the model by the user community. In view of this, we intend to start modelling the datasets with a low level of detail, annotating the 15 standard DC elements as character data. Further down the line, we will support the extension of the DC elements annotations with semantically richer descriptions. That will be initially done with the analysis of datasets to be provided by Epiwork partners. The collaboration with these partners will enable the assessment of which level of detail will be most adequate to the epidemic modellers community. To be useful, metadata annotation criteria have to follow a common standard, so data can be comparable and searched using similar queries. In order to obtain a standardization of the metadata annotation it is fundamental to use controlled languages as much as possible and languages for describing data structures, progressively limiting the use of free text. To understand the metadata to be added to annotate epidemic datasets and what properties should be extended in the future for a better data representation, we have analysed a selected sample of datasets: EM Twitter Datasets: Twitter data harvested by an initial prototype of the Data Collector module of the Epidemic Marketplace (Lopes et al. 2009). Each dataset contains tweets (messages) with disease and geographic specific keywords. It also contains, for each message, information about the author name (nickname), the source (in this case the Twitter.com service), the keywords searched, the date, the source and a possible score (assigned according to the confidence on the specific message). US Airports Dataset: Data about the airport network of the United States. This dataset provides information about the US transportation network, containing data about the 500 US airports with most traffic. The file contains an anonymised list of connected pairs of nodes and the weight associated to the edge, expressed in terms of number of available seats on the given connection on a yearly basis. In addition, to add more diversity to these initial datasets and start with a larger study base, we surveyed published articles in epidemiology journals for analysis and inferred the attributes of that datasets reported in those papers. Most of the studies do not provide information on how to access all the used datasets or fully describe them for the purposed of cataloguing with the detail we are envisioning. Nevertheless, this kind of survey provided insights on the metadata modelling aspects that have to be accounted for. We characterized datasets used in studies like: Cohen et al. (2008): Analyses the relation of levels of household malaria risk with topography related humidity. East et al. (2008): Analyses the patterns of bird migration in order to identify areas in Australia where the risk of avian influenza transmission from migrating birds is higher. Starr et al. (2009): Introduces a model for predicting the spread of Clostridium difficile in hospital context. Using this approach, we have annotated datasets to which we did not actually have access, but devised what would be their metadata description as DC elements, based on the information provided. 24 Mar 2010 - Epiwork Review Brussels
Strategies for Creating a Metadata Model for Epidemic Data (II) Ontologies will serve to integrate heterogeneous data sources as they provide semantic relationships among the described objects. Further on, the EM will include methods and services for aligning the ontologies. We expect that this can spawn a virtuous cycle, stimulating the cataloguing and linking by the epidemic modellers community. 24 Mar 2010 - Epiwork Review Brussels
Strategies for Creating a Metadata Model for Epidemic Data (III) With DCMI terms and conventions + Linked Data conventions, turn datasets into web resources. describe the data structures in the datasets using ontologies. descriptions will be used by people and information discovery tools 24 Mar 2010 - Epiwork Review Brussels
Strategies for Creating a Metadata Model for Epidemic Data (IV) Define policies establishing the level of detail of the metadata. low level of detail may not be able to sufficiently describe the datasets, making the right information harder to find a too detailed metadata scheme can turn the annotation of a specific dataset into a daunting task, hindering the acceptance of the model by the user community. 24 Mar 2010 - Epiwork Review Brussels
Strategies for Creating a Metadata Model for Epidemic Data (V) Metadata annotation criteria have to follow a common standard, so data can be comparable and searched using similar queries use controlled languages as much as possible and languages for describing data structures, progressively limiting the use of free text. Started modelling the datasets with low detail, annotating the 15 standard DC elements as character data. Further down the line, we initiate the annotation of DC elements with semantically richer descriptions 24 Mar 2010 - Epiwork Review Brussels
Strategies for Creating a Metadata Model for Epidemic Data (VI) Surveyed published articles in epidemiology journals and inferred the attributes of the used datasets Analysed selected sample of datasets EM Twitter Datasets: harvested with software prototype of the EM. US Airports Dataset: Data about the airport network of the United States. We annotated datasets to which we did not actually have access, but devised what would be their metadata description as DC elements. 24 Mar 2010 - Epiwork Review Brussels
Sample datasets (simulated) Cohen et al. (2008) – relationship between levels of household malaria risk and topography related humidity. East et al. (2008) - bird migration patterns; risk areas in Australia for avian influenza transmission from migrating birds. Starr et al. (2009) - Model for predicting the spread of Clostridium difficile in hospital context. 24 Mar 2010 - Epiwork Review Brussels
Outline The need for an Epidemic Marketplace Metadata and Ontologies for Epidemic Modelling (Deliverable D3.1) Epidemic Marketplace Architecture & Implementation (Deliverable D3.2) Where we stand and forecasts 24 Mar 2010 - Epiwork Review Brussels
The EM as a Virtual Repository The Epidemic Marketplace is composed of a set of, geographically distributed, interconnected data management nodes, sharing: common data models, an authorization infrastructure access interfaces. At each node, a set of software components implements a set of requirements that characterize their performance and interfaces. The Epidemic Marketplace is composed of a set of, geographically distributed, interconnected data management nodes, sharing common data models, an authorization infrastructure and access interfaces. At each node, a set of software components implements a set of functional and non-functional requirements that characterize their performance and interfaces. The Epidemic Marketplace is a distributed virtual repository, a platform supporting transparent, seamless access to distributed, heterogeneous and redundant resources (Kuliberda et al. 2006, Ohno-Machado et al. 1997). It is a virtual repository because data can be stored in systems that are external to the Epidemic Marketplace, and it provides transparent access because several heterogeneities are hidden from its users. Data can be either stored in one or more repositories or retrieved from external data sources using authorization credentials provided by clients. Data can also be replicated among repositories to improve access time, availability and fault tolerance. However, data replication is not mandatory; in several cases data must be stored in a single site due to, for instance, security constraints. It is worth noting, though, that any individual repository that composes the Marketplace will enable virtualized access to these data, once a user provides adequate security credentials. 24 Mar 2010 - Epiwork Review Brussels
EM: Main Components Repository: Stores epidemic data sets and ontologies to characterise the semantic information of the data sets. Mediator: A collection of web services that will provide access to internal data and external sources, using state-of-the-art semantic-web/grid technologies. Collector: Retrieves information of real-time disease incidences from publicly available data sources, such as social networks;. Forum: Allows users to organize discussions centred on the datasets fostering collaboration among modellers. Repository: Stores epidemic data sets and ontologies to characterise the semantic information of the data sets. Mediator: A collection of web services that will provide access to internal data and external sources, based on a catalogue describing existing epidemic databases through their metadata using state-of-the-art semantic-web/grid technologies. Collector: Retrieves information of real-time disease incidences from publicly available data sources, such as social networks; after retrieval, the collector groups the incidences by subject and creates data sets to store in the repository. Forum: Allows users to organize discussions centred on the datasets managed by the Epidemic Marketplace, fostering collaboration among modellers. 24 Mar 2010 - Epiwork Review Brussels
24 Mar 2010 - Epiwork Review Brussels
EM: Main System Requirements EM needs to define policies and provide services for: Sharing and management of epidemiological data sets. Seamless integration of heterogeneous data sources. Creation of a virtual community for epidemic research. Distributed Architecture. Secure access to data. Support for data analysis and simulation in grid environments:. Workflows Epidemic Marketplace Requirements A number of projects retrieve epidemic data and make them available to users, such as Healthmap (Borownstein et al. 2006), MedISys (Mawudeku et al. 2005) and GIDEON (Gideon 2010). However, the set of requirements of the Epidemic Marketplace makes this platform quite different from previous projects. The main system requirements identified of the Epidemic Marketplace are listed below: Support the sharing and management of epidemiological data sets: Registered users should be able to upload annotated data sets, and a data set rating assessment mechanism should be available. The annotated data set will then compose a catalogue that will be available to users. Support the seamless integration of multiple heterogeneous data sources: Users should be able to have a unified view of related data sources. Data should be available from streaming, static and dynamic sources. All data retrieved by users or other services should be available through a common interface. Support the creation of a virtual community for epidemic research: The platform will serve as a forum for discussion that will guide the community into uncovering the necessities of sharing data between providers and modellers. Users will become active participants, generating information and providing data for sharing and collaborating online. Distributed Architecture: The Epidemic Marketplace should implement a geographically distributed architecture deployed in several sites. The distributed architecture should provide improved data access performance, improved availability and fault-tolerance. Support secure access to data: Access to data should be controlled. The marketplace should provide single sign on, distributed federated authorization and multiple access policies, customizable by users. Support data analysis and simulation in grid environments: The Epidemic Marketplace will provide data analysis and simulation services in a grid environment. Therefore, the Epidemic Marketplace should operate seamlessly with grid-specific services, such as grid security services, information services and resource allocation services. Workflow: The platform should provide workflow support for data processing and external service interaction. This requirement is particularly important for those services that retrieve data from the Epidemic Marketplace, process it, and store the processed data back in the marketplace, such as grid-enabled data analysis and simulation services. 24 Mar 2010 - Epiwork Review Brussels
EM: Main non-functional requirements Interoperability with other software. Open-source. Standards-based. The main non-functional requirements that have been identified for the Epidemic Marketplace are listed below: Interoperability: The Epidemic Marketplace must interoperate with other software. Its design must take into account that in the future, systems developed by other researchers across the world may need to query the Epidemic Marketplace catalogue for access to its datasets. Open-source: All software packages to be used in the implementation and deployment of the Epidemic Marketplace should be open source, as well as the new modules developed specifically to the Epidemic Marketplace. An open-source based solution reduces development cost, improves software trustworthiness and reliability and simplifies support. Standards-based: To guarantee software interoperability and the seamless integration of all geographically dispersed sites of the Epidemic Marketplace, the system will be entirely built over standards defining web services, authentication and metadata. 24 Mar 2010 - Epiwork Review Brussels
EM Repository Requirements Separation of data and metadata metadata may contain information not directly accessible. Support for Metadata standards Dublin Core, because that’s what everyone seems to be using Ontology support for describing and characterising the data. Repository Requirements The objective of the Epidemic Marketplace repository is to organize the information about existing datasets. While it is expected that the datasets be deposited in the repository, it is possible to have information about specific datasets even if they are not stored at the repository. This may happen, for example, for security reasons. For these special datasets, the metadata services to be provided by the content repository will become the only alternative. The metadata repository will store information about specific datasets even if they are not in the repository. The metadata will describe the datasets in detail, including their contents, providing information about the authors, where the dataset is available and who has access to it. The main requirements of the Repository are: Separation of data and metadata: An important architectural feature for scientific repositories in general, and also the Epidemic Marketplace, is a clear separation between data and metadata (Stolte et al. 2003). For instance, there should be a clear separation between metadata and actual data schemes, since metadata may contain information not directly available in data schemes. Support for Metadata standards: Extensive support for metadata standards for web resources management and processing (e.g. searching) is required. This means the adoption of Dublin Core. It is possible that only the metadata of some data sources is available through the Epidemic Marketplace, due to privacy constraints. In those cases the client should retrieve the data directly from the site hosting the data source, following directives described in the Epidemic Marketplace. Ontology support: One step further in the deployment of the Epidemic Marketplace is to have a semantically enabled repository using ontologies for describing and characterising the data. The Epidemic Marketplace will provide a framework for the creation and development of epidemiological ontologies, openly addressing the needs of this community and fostering its active involvement (Goni et al. 1997, Fox et al. 2006). 24 Mar 2010 - Epiwork Review Brussels
EM Mediator Requirements Responsible for data exchanges with Clients, IMS and other data providers (RSS ProMed Mail, ..): Query and search capabilities on heterogeneous datasets: in epidemic modelling, diversity is unlimited. Access to “plug-in-able” resources:. RESTful interfaces. Mediator Requirements The Mediator is responsible for communicating with: 1) clients, which retrieve the data collections of the Epidemic Marketplace and produce dynamical trends graphs or geographical maps according to user interaction; 2) Epiwork applications, such as Internet-based Monitoring Systems (IMS) or computational platforms (CP) for simulating the propagation of diseases; 3) other data providers, such as online news wires, RSS feeds, ProMED Mail, validated official alerts (WHO) and other event generators. The main requirements of the mediator are: Heterogeneous datasets query and search capabilities: The Mediator has to manage the access to data from many different sources, pertaining to different diseases, and in different formats, using data query or search interfaces. Besides medical information, other types of information are needed in epidemics simulations, such as geographic, sociological and about transportation networks. The data needed for an epidemic study can change significantly from disease to disease and even between studies on the same subject, depending on experimental conditions and data collection methods. Access to “plug-in-able” resources: One important feature to be supported by the Epidemic Marketplace, in particular for external data resources, is access to “plug-in-able” resources (Kuliberda et al. 2006). These external resources provide data not stored in an internal repository and may require virtualized access. Some resources can appear and disappear unexpectedly, due, for instance, to web site unavailability. “Plug-in-able” resources enable the dynamic addition of data sources through wrappers that assure physical connection to a source and convert the gathered data to one or more of the canonical data models supported by the repository. RESTful interface: Clients should be able to search and query datasets and corresponding metadata through RESTful interfaces. 24 Mar 2010 - Epiwork Review Brussels
Collector Requirements Active data harvesting: focused web crawler, subscription of newsfeeds and email services. Passive data collection: EM preserves and distributes deposited datasets originating from IMS Local storage capability: all collected data in at least one EM site. Meaningful data partitioning policies: to epidemic modellers and accounting for legal/administrative barriers Collector Requirements Recent epidemiological surveillance projects are collecting data from the Internet to identify disease propagation. These systems mainly collect data from pre-selected data sources somehow related to the subject. However, other sources, like social networks and search engine query data, may present early evidence of an infection event and propagation (Ginsberg et al. 2008). Given the increasing popularity of social networks, we can find a large amount of personal information in real time, which can help in detect earlier the beginning or the propagation of an epidemic event. The main requirements of the Collector are: Active data harvesting: The Collector should actively harvest data about putative infections by automatically retrieving infection alerts from the Web using a focused web crawler (Chakrabarty, 1999), subscription of newsfeeds and email services. Passive data collection: The data collector should also be able to receive information directly from online users accessing the Epidemic Marketplace using data upload forms or deposited from Internet Monitoring Systems (van Noort et al. 2007). Local storage capability: All collected data should be physically stored in at least one site of the Epidemic Marketplace. This is important since the data may be no longer available from its source after some time. Data should be organized as datasets following partitioning criteria meaningful to epidemic modellers. 24 Mar 2010 - Epiwork Review Brussels
EM Forum Requirements Group-oriented discussions with access control restrictions. Discusstions centred on EM datasets and collections Support for distributed authentication Forum Requirements The Epidemic Marketplace will serve as an exchange platform for connecting modellers, who search for input data for calibrating and evaluating and their models, and providers, who seek the help of modellers for obtaining analyses and interpret their data. Therefore, its user community requires an online meeting point for discussions about the data collections and for uncovering the data sharing requirements among providers and modellers. This will promote collaborations, through direct trustful sharing of data within the communities and establishment of consensus agreements between modellers and data providers on sharing data for epidemics modelling. The results will be reported to EU-agencies, such as the ECDC and the EMCDDA, as a contribution to setting European standards for sharing epidemic data. The main requirements of the Epidemic Marketplace Forum are: Group-oriented discussions with access restrictions: Every discussion should be associated with a group of users. A user uploading a new dataset into the EM Repository defines membership and access restrictions to the corresponding online discussion group. Support for distributed authentication: As it is the case with the Mediator, clients must authenticate to at least one site of the Epidemic marketplace to Access the forum. The same set of credentials for a given client should be accepted by any instance of the Epidemic Marketplace. After authentication, the client is redirected to the Epidemic Marketplace site hosting the discussion, if the user is included in the associated group access list. 24 Mar 2010 - Epiwork Review Brussels
Outline The need for an Epidemic Marketplace Metadata and Ontologies for Epidemic Modelling (Deliverable D3.1) Epidemic Marketplace Architecture & Implementation (Deliverable D3.2) Where we stand and forecasts 24 Mar 2010 - Epiwork Review Brussels
EM Catalogue – 1st. cut 24 Mar 2010 - Epiwork Review Brussels In order to overcome this issue the Epiwork information platform will include a metadata catalogue that will support accurate searches for epidemic datasets. The implementation of this catalogue will be phased. At first a simple Dublin Core (DC) scheme with the 15 legacy DCMI elements will be used in order to annotate the datasets (Dublin Core 2009). Later, a metadata schema for epidemiologic and related datasets will be developed based on the current DCMI terms (Dublin Core 2009) and Epiwork extensions. 24 Mar 2010 - Epiwork Review Brussels
Software Components Fedora Commons for the implementation of the main features of the repository. Access control in the platform XACML (OASIS 2010), LDAP (Tuttle et al. 2004) Shibolleth (identity management). Front-end based in Muradora now being replaced by the Drupal CMS. 24 Mar 2010 - Epiwork Review Brussels
Repository– Initial Deployment Web services interface to Fedora commons LDAP user registry Policy enforcement point (PEP) and XACML role-based access control Muradora front-end likely to go 24 Mar 2010 - Epiwork Review Brussels
APIs and Machine access PMH - Standard protocol/API for DC meta-data exchange ORE – SW style data constellations The Semantic Web stack 24 Mar 2010 - Epiwork Review Brussels
Architecture 24 Mar 2010 - Epiwork Review Brussels
EM 24 Mar 2010 - Epiwork Review Brussels
Current Focus Refining and populating, enriching the catalogue of epidemic resources using initial prototype. The method of scanning published epidemic modelling studies and then inferring the metadata descriptions has shown to be very useful. Designing the user interface for the second version. Must be useful to the expert and occasional user. 24 Mar 2010 - Epiwork Review Brussels
Forthcoming Developments Identifying ontologies (and ontology terms) to use. Linking to ontology definition initiatives. Linking ontologies and web data using linked data conventions and ontology alignment methods. 24 Mar 2010 - Epiwork Review Brussels
Outline The need for an Epidemic Marketplace Metadata and Ontologies for Epidemic Modelling (Deliverable D3.1) Epidemic Marketplace Architecture & Implementation (Deliverable D3.2) Where we stand and forecasts 24 Mar 2010 - Epiwork Review Brussels
WP3: status Deliverable D3.1 (meta-model) released Deliverable D3.2 (prototype) released Hardware and base software deployed; Initial prototype of EM with initial set of characterized datasets Overcoming the initial difficulties in hiring the planned resources. 24 Mar 2010 - Epiwork Review Brussels
Publications in the 1st year Mário J. Silva, Fabrício A.B. Silva, Luís Filipe Lopes, Francisco M. Couto, Building a Digital Library for Epidemic Modelling. Proceedings of ICDL 2010 - The International Conference on Digital Libraries 1, p. 447--459, New Delhi, India, 23--27 February, 2010. TERI Press -- New Delhi, India. Invited Paper. Luis Filipe Lopes, João Zamite, Bruno Tavares, Francisco Couto, Fabrício A.B. Silva, Mário J. Silva, Automated Social Network Epidemic Data Collector. INForum - Simpósio de Informática September, 2009. 24 Mar 2010 - Epiwork Review Brussels
Current Challenges Motivate the community to populate the Epidemic Marketplace. Chicken and egg situation. Data anonymization is a major concern Rights management to the sentence level! Anyone giving away curated UGC? Access control policies Dataset selection and generation policies We are currently developing and implementing the the Catalogue of the Epidemic Marketplace. This process is tightly connected with the population of the Repository with different kinds of epidemic datasets and discussions for better understanding how a metadata description can be made as exact and complete as needed, and still be useful and acceptable to the occasional visitor who deposits a dataset or wants to annotate it. We have started using existing ontologies, such as the UMLS. Our goal is to contribute to making ontologies widely accepted by the Epidemiological community and ensuring their sustainable evolution, by replicating the success of similar initiatives, such as the Gene Ontology in Molecular Biology (Ashburner et al. 2000). The method of scanning published epidemic modelling studies, extracting references (explicit and implicit) to the described datasets, and then inferring the metadata descriptions they should have, is being very useful, since it makes it possible to understand the variety of data used in epidemic studies, how it is related, and understand the difficulties that this community would experience in providing it. Moreover, these inferred annotations can also be used as examples to new Epidemic Marketplace users with no previous metadata definition experience. We believe that this may spawn the development of increasingly richer and accurate metadata characterisations of epidemic datasets. The biggest challenge that lies ahead is how to motivate the community to populate the Epidemic Marketplace. We will soon be facing an instance of the classic “chicken-and-egg problem,” where the prototype is not perceived as an attractive resource because it has not a rich collection of datasets, and it hasn’t more datasets because the community does not perceive its potential. Thus, our Epiwork partners who have been active in creating models using real world data that they have collected over the years will have a key role. Another strategy involves the active collection of data and updates to datasets from the web, for automatic annotation or archival into the Epidemic Marketplace. Our current prototype has been collecting data from Twitter on a daily basis. It is worth noting that the EM not only collects the data, but also stores them. This is important because messages in Twitter are only available for one month. As we are periodically assembling these messages into semantically annotated data collections in the Repository, they could become a useful resource for researchers modelling the spreading of diseases. In the future we could correlate the predictions made from the data in these collections with official statistics and assess its accuracy. Previous work with web search logs data, which are private, has shown how effective these short texts can be for predicting epidemic outbreaks when the date and location of their authors can be traced (Ginsberg et al. 2008). We will welcome any other organizations willing to participate in the Epidemic Marketplace after the stress tests that are underway and the software reaches beta-level quality. We will also make the full source code available as Open Source and encourage the development of extensions. Later on, we expect to publish the first integrated models providing integrated views of both internally and externally stored data together with a catalogue of available epidemiological data. 24 Mar 2010 - Epiwork Review Brussels
Kdnuggets, march 2010 24 Mar 2010 - Epiwork Review Brussels
WP3 SWOT Analysis Strengths Weaknesses Epiwork-driven EM Standards-based Open Source modules Supported (until 2012) Unpopulated EM Looking for the right policies What are the incentives? Interfaces to WP4 and WP5? 24 Mar 2010 - Epiwork Review Brussels
WP3 SWOT Analysis Opportunities Threats Epiwork testbed Creation of a baseline for epidemic modelling Showcase for partners’ outputs Consortium enters “everyone for himself” mode. “Somebody will take care of that” attitude EM perceived as a very expensive, complex and useless cache 24 Mar 2010 - Epiwork Review Brussels
Todo list and planning Populate Repository Linked Epidemic Data Ethics, Privacy and Anonimization Access control policies Dataset selection generation Distributed Authentication Replicate EM node 24 Mar 2010 - Epiwork Review Brussels
Scheduled Deliverables 24 Mar 2010 - Epiwork Review Brussels
The EPIWORK project proposes a multidisciplinary research effort aimed at developing the appropriate framework of tools and knowledge needed for the design of epidemic forecast infrastructures to be used in by epidemiologists and public health scientists. The project is a truly interdisciplinary effort, anchored to the research questions and needs of epidemiology research by the participation in the consortium of leading epidemiologists, public health specialists and mathematical biologists. Epidemic researchers along with informatics, computer science, complex systems and physics leading scientists, will tackle most of the much needed development in epidemic forecast of modeling, computational and ICT tools such as i) the foundation and development of the mathematical and computational methods needed to achieve prediction and predictability of disease spreading in complex techno-social systems; ii) the development of large scale, data driven computational models endowed with a high level of realism and aimed at epidemic scenario forecast; iii) the design and implementation of original data-collection schemes motivated by identified modelling needs, such as the collection of real-time disease incidence, through innovative web and ICT applications; v) the set up of a computational platform for epidemic research and data sharing that will generate important synergies between research communities and countries. http://www.epiwork.eu
Scheduled Deliverables WP3 24 Mar 2010 - Epiwork Review Brussels
The falacies of free-text Initial “proof-of-concept” prototype showed the limitations spanning from annotating the datasets using free text in the meta-data description fields. A much simpler model, inspired on web2.0 “tags.” EM users will be able to freely annotate their datasets using their own terminologies (also dubbed as “folksonomies”). 24 Mar 2010 - Epiwork Review Brussels