WHD Colloquium, March 27, 20121 Historical Data Integration based on Collective Intelligence Vladimir Zadorozhny Graduate Information Science and Technology.

Slides:



Advertisements
Similar presentations
Improving Learning Object Description Mechanisms to Support an Integrated Framework for Ubiquitous Learning Scenarios María Felisa Verdejo Carlos Celorrio.
Advertisements

OMII-UK Steven Newhouse, Director. © 2 OMII-UK aims to provide software and support to enable a sustained future for the UK e-Science community and its.
16/11/ IRS-II: A Framework and Infrastructure for Semantic Web Services Motta, Domingue, Cabral, Gaspari Presenter: Emilia Cimpian.
General introduction to Web services and an implementation example
MS DB Proposal Scott Canaan B. Thomas Golisano College of Computing & Information Sciences.
ITEC810 Project By: P. M. Mathindri Nilushika Pathiraja 1.
An Architecture for Creating Collaborative Semantically Capable Scientific Data Sharing Infrastructures Anuj R. Jaiswal, C. Lee Giles, Prasenjit Mitra,
A Data Curation Application Using DDI: The DAMES Data Curation Tool for Organising Specialist Social Science Data Resources Simon Jones*, Guy Warner*,
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
Data Sources & Using VIVO Data Visualizing Scholarship VIVO provides network analysis and visualization tools to maximize the benefits afforded by the.
ESupport Shifting Customers to the Internet for Support Published: January 2002.
Cloud based linked data platform for Structural Engineering Experiment Xiaohui Zhang
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Using the Drupal Content Management Software (CMS) as a framework for OMICS/Imaging-based collaboration.
RDA Wheat Data Interoperability Working Group Outcomes RDA Outputs P5 9 th March 2015, San Diego.
Networking Session: Global Information Structures for Science & Cultural Heritage - The Interoperability Challenge «INTEROPERABILITY FROM THE CULTURAL.
Toward a World of World-Historical Data Ruth Mostern University of California, Merced World Historical Dataverse Colloquium University of Pittsburgh, March.
Key integrating concepts Groups Formal Community Groups Ad-hoc special purpose/ interest groups Fine-grained access control and membership Linked All content.
The Old World Meets the New: Utilizing Java Technology to Revitalize and Enhance NASA Scientific Legacy Code Michael D. Elder Furman University Hayden.
Data on the Web Life Cycle Bernadette Farias Lóscio March, 2014.
Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources on the Web Benno Lee 1 Sumit Purohit 2
Metadata, the CARARE Aggregation service and 3D ICONS Kate Fernie, MDR Partners, UK.
Outline  Enterprise System Integration: Key for Business Success  Key Challenges to Enterprise System Integration  Service-Oriented Architecture (SOA)
Chapter © 2012 Pearson Education, Inc. Publishing as Prentice Hall.
material assembled from the web pages at
DBSQL 14-1 Copyright © Genetic Computer School 2009 Chapter 14 Microsoft SQL Server.
April 30, 2007 openSUSE.org Build Service a short introduction Moiz Kohari VP Engineering.
Franck Theeten 1, Patricia Mergen 1, Olivier Bakasanda 2, Jörg Holetschek 3, Patricia Kelbert 3, Motonobu Kasajima 2, Garin Cael 1, Charles Kahindo 4 1.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
NREL is a national laboratory of the U.S. Department of Energy, Office of Energy Efficiency and Renewable Energy, operated by the Alliance for Sustainable.
© 2012 xtUML.org Bill Chown – Mentor Graphics Model Driven Engineering.
Large Scale Nuclear Physics Calculations in a Workflow Environment and Data Provenance Capturing Fang Liu and Masha Sosonkina Scalable Computing Lab, USDOE.
Stefan Falke Center for Air Pollution Impact and Trend Analysis Washington University in St. Louis Brooke Hemming US EPA – Office of Research and Development.
Semantic Web: The Future Starts Today “Industrial Ontologies” Group InBCT Project, Agora Center, University of Jyväskylä, 29 April 2003.
ICT TOOLS AND SOCIETY INVOLVEMENT AMONG THE EUPAN NETWORK HIGHLIGHTS FROM THE SURVEY RESULTS TANYA CHETCUTI AND MARCO FICHERA - WORKSHOP EUROPEAN COMMISSION.
The Mint Mapping tool The MoRe aggregator Vassilis Tzouvaras, Dimitris Gavrilis National Technical University of Athens Digital Curation Unit - IMIS, Athena.
Livia Bizikova and Laszlo Pinter
P088; Presented in Canberra, 27 th March, 2008 GR000: Presented in Fremantle on 20 th October, 2008 GAIA RESOURCES Experiences in mobilizing biodiversity.
Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,
Fire Emissions Network Sept. 4, 2002 A white paper for the development of a NSF Digital Government Program proposal Stefan Falke Washington University.
Providing web services to mobile users: The architecture design of an m-service portal Minder Chen - Dongsong Zhang - Lina Zhou Presented by: Juan M. Cubillos.
School of Education Technology, Beijing Normal University Research on the Organization Model of Ubiquitous Learning Resource Shengquan Yu
IoT Meets Big Data Standardization Considerations
Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources on the Web Benno Lee 1 Sumit Purohit 2
Storing digital assets on Grid/EGI FedCloud with gLibrary Giuseppe La Rocca, INFN DARIAH ERIC.
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
1 Open Discovery Space Overview Argiris Tzikopoulos, Ellinogermaniki Agogi Open Discovery Space [CIP-ICT-PSP ][elearning] A socially-powered and.
Big Data: Every Word Managing Data Data Mining TerminologyData Collection CrowdsourcingSecurity & Validation Universal Translation Monolingual Dictionaries.
Store and exchange data with colleagues and team Synchronize multiple versions of data Ensure automatic desktop synchronization of large files B2DROP is.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GOCDB4 Gilles Mathieu, RAL-STFC, UK An introduction.
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
University of Colorado at Denver and Health Sciences Center Department of Preventive Medicine and Biometrics Contact:
INTRODUCTION TO GENERATING SERVICES
Cloud based linked data platform for Structural Engineering Experiment
Overview of MDM Site Hub
Integrating Data for Archaeology
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Establishing A Data Management Fabric For Grid Modernization At Exelon
MTM Tools key to running
SMART GROUND platform overview
Monitoring of the infrastructure from the VO perspective
Data Warehousing and Data Mining
Health Ingenuity Exchange - HingX
WIS Strategy – WIS 2.0 Submitted by: Matteo Dell’Acqua(CBS) (Doc 5b)
SDMX in the S-DWH Layered Architecture
Metadata The metadata contains
JISC and SOA A view Robert Sherratt.
AI Discovery Template IBM Cloud Architecture Center
Presentation transcript:

WHD Colloquium, March 27, Historical Data Integration based on Collective Intelligence Vladimir Zadorozhny Graduate Information Science and Technology Program School of Information Sciences University of Pittsburgh NADM Group V. Zadorozhny

Challenge Diverse, Heterogeneous, Semi-structured Data Sources WHD Data Integration Infrastructure Consolidated Structured Information 2V. Zadorozhny

Web of Data? Linked Data: using the Web to create typed links between data from different sources Linked Data uses RDF (Resource Description Framework) to make typed statements (triples) Expected result: Web of Data extending the Web with a global data space connecting diverse domains (people, companies, publications, etc.) In general, Web of Data has a potential (still questionable) to support loose data coupling that may facilitate more efficient data utilization  While WHD can utilize LD and related Web mashup technologies to some extent, it would be premature to rely upon the Linked Data infrastructure 3WHD Colloquium, March 27, 2012V. Zadorozhny

Dataverse Network? An open source application to publish, share, reference, extract and analyze research data that facilitates making data available to others "Dataverse owners can upload any file type and format (excel, txt,pdf, doc, etc.), and the files will be stored and made available in the original format“ ( Information consumers should further integrate data sources to perform analysis using multiple "dataverses".  While WHD aims to be a part of the Dataverse Network, it would not encourage users to contribute data in ANY format. Instead, users integrate their data into the WHD repository while submitting the data.  To summarize, WHD infrastructure crowdsourses the data integration task, not just data contribution task. 4WHD Colloquium, March 27, 2012V. Zadorozhny

Data Submission System Structured homogeneous historical data Structured homogeneous historical data Information Providers Annotated historical data Internal Data Reliability Assessment Fused historical data Fused historical data Information Consumers … Wrapper Heterogeneous historical data sources Wrapper Generation Wrapper Registration External Data Reliability Assessment Data Fusion General WHD Architecture 5WHD Colloquium, March 27, 2012V. Zadorozhny

According to the 2006 revision of the World Population Prospects the total population in the region of Liberia in 1950 was 824,000. The average population growth percent per year for the following ten years was 2.5. For Ivory Coast those numbers are 2,505,000 and 3.6 correspondingly Extendable Target Schema (relational is not mandatory): Source | Location | From | To | Population | Data Source: s1 (xl) Data Source: s2 (doc) Source|Location | From |To | Population| s2 | Liberia | 01/01/1950 | 12/31/1950| | s2 |Liberia | 01/01/1960 | 12/31/1960| 1,052,000 | s2 |Ivory Coals | 01/01/1950 | 12/31/1950| 2,505, 000 | s2 |Ivory Coast | 01/01/1950 | 12/31/1950| 3,692,000 | Materialize Data Keep Data Remotely select * from Population s1 |Mauritania | 01/01/1950 | 12/31/1950| 692,000 | s1 |Mauritania | 01/01/1960 | 12/31/1960| 892,000 | s1 | Senegal | 01/01/1950 | 12/31/1950| 2,543,000 | s1 | Senegal | 01/01/1960 |12/31/1960 | 3,277,000 | Simple Scenario Mapping: Territories -> Location Population -> Population Data Aggregation -> Total Year -> From,To Wrapper Mapping: region -> Location Population -> Population Data Aggregation -> Total Year -> From,To Wrapper WHD Infrastructure

Data Curation Data Collection Data Utilization Big Picture: continuously growing infrastructure (a la Wikipedia) 7WHD Colloquium, March 27, 2012V. Zadorozhny

Group of graduate IS students: special project in Advanced Data Management class (INFSCI2711) Content Management → Pligg ( Open Source Content Management System, Apache, PHP, and MySQL based) Data Integration Engine → Pentaho Kettle (Open Source Data Integration Engine, Java-based GUI and Command Line Tools, XML based data transformation file) Data providers  download Wrapper Generating Software  configure wrappers on their workstation ( using preconfigured templates)  register wrappers on WHD Server WHD Prototype 8WHD Colloquium, March 27, 2012V. Zadorozhny

Data Source Data Transformation Transformed Data XML Wrapper 10

11WHD Colloquium, March 27, 2012V. Zadorozhny

Data Reliability Assessment and Data Fusion The systems based on crowdsourcing require mechanisms to ensure data quality. WHD Infrastructure will support efficient data curation strategies based on advanced data reliability assessment and data fusion methods. As system continuously receives new historical reports, WHD estimates reliability of this data, which evolves with respect to new evidence. WHS uses a measure of inconsistency caused by a report to assess its internal reliability. WHD also allows users to submit their subjective feedback on reliability of data to assess external reliability. WHD utilizes subjective logic to combine internal and external reliability assessment 12

Historical Data: Redundancy t1 | source_ref1 | Measles | NYC |10/10/1900 | 10/10/1920 | 700 t2 | source_ref2 | Measles | NYC |10/20/1910 | 10/30/1930 | 300 Total number of Measles cases in New York City from 1900 to 1930: = 1000 ??? Temporal overlap between t1 and t Measles reports: Temporal Overlaps t5 | source_ref1 | Yellow fever | NY |10/10/1900 | 10/10/1920 | 700 t6 | source_ref2 | Hepatitis | NY|10/10/1900 | 10/10/1920 | 700 t7 | source_ref4 | Hepatitis B | NY| 10/20/1910 | 10/30/1930 | 300 Total number of Hepatitis cases in New York State from 1920 to 1930: =1700 ??? Naming overlap between t5, t6 and t7 Naming Overlaps Total number of Smallpox cases in New York State from 1900 to 1930: = 1100 ??? Spatial overlap between t3 and t4 Smallpox reports: 500 (NY)600 (NYC) t3 | source_ref1 | Smallpox | NY |10/20/1900 | 10/20/1920 | 500 t4 | source_ref1 | Smallpox | NYC |10/30/1920 | 10/30/1930 | 600 Spatial Overlaps

Historical Data: Inconsistency time Measles reports in NYC: ………. R1: R2: Redundant and Inconsistent : 14WHD Colloquium, March 27, 2012V. Zadorozhny

Information Consumer Toolset: Data Visualization Dashboard

ICTS: Map Exhibits and Timeline Widgets

CV ICTS: Motion Chart Animation

Conclusion We explore a novel approach to reliable, large-scale historical data integration based on collective intelligence We implement this approach in WHD infrastructure for consolidation heterogeneous historical data Major challenge: how to engage a large community of researchers to share their data and collectively resolve the data heterogeneities in a continuously growing large-scale distributed historical repository? – contributions from CHAI members (only a small fraction of Wikipedia users contributes information to ensure its growth) – as the infrastructure evolves users may become interested in “embedding” their data in a larger context to perform global analysis and to utilize WHD tools – open development platform (extendable data transformation library and toolsets) 18WHD Colloquium, March 27, 2012V. Zadorozhny

Acknowledgements Graduate IS Students (WHD system development team): Andrew Barnett (team leader) Andrew Entin Thomas Junker Jidapa Kraisangka Han Liao Eric Miller Ye Peng Evan Pulgino Henry Quattrone Mark Swartz Miao Tan Liu Yuchen Lihong Zhang Doctoral Students: Ying-Feng Hsu Julian Lee 19WHD Colloquium, March 27, 2012V. Zadorozhny