Download presentation
Presentation is loading. Please wait.
Published byLorin Rice Modified over 9 years ago
1
WHD Colloquium, March 27, 20121 Historical Data Integration based on Collective Intelligence Vladimir Zadorozhny Graduate Information Science and Technology Program School of Information Sciences University of Pittsburgh NADM Group V. Zadorozhny
2
Challenge Diverse, Heterogeneous, Semi-structured Data Sources WHD Data Integration Infrastructure Consolidated Structured Information 2V. Zadorozhny
3
Web of Data? Linked Data: using the Web to create typed links between data from different sources Linked Data uses RDF (Resource Description Framework) to make typed statements (triples) Expected result: Web of Data extending the Web with a global data space connecting diverse domains (people, companies, publications, etc.) In general, Web of Data has a potential (still questionable) to support loose data coupling that may facilitate more efficient data utilization While WHD can utilize LD and related Web mashup technologies to some extent, it would be premature to rely upon the Linked Data infrastructure 3WHD Colloquium, March 27, 2012V. Zadorozhny
4
Dataverse Network? An open source application to publish, share, reference, extract and analyze research data that facilitates making data available to others "Dataverse owners can upload any file type and format (excel, txt,pdf, doc, etc.), and the files will be stored and made available in the original format“ (http://thedata.org/files/dataversehandout.pdf) Information consumers should further integrate data sources to perform analysis using multiple "dataverses". While WHD aims to be a part of the Dataverse Network, it would not encourage users to contribute data in ANY format. Instead, users integrate their data into the WHD repository while submitting the data. To summarize, WHD infrastructure crowdsourses the data integration task, not just data contribution task. 4WHD Colloquium, March 27, 2012V. Zadorozhny
5
Data Submission System Structured homogeneous historical data Structured homogeneous historical data Information Providers Annotated historical data Internal Data Reliability Assessment Fused historical data Fused historical data Information Consumers … Wrapper Heterogeneous historical data sources Wrapper Generation Wrapper Registration External Data Reliability Assessment Data Fusion General WHD Architecture 5WHD Colloquium, March 27, 2012V. Zadorozhny
6
According to the 2006 revision of the World Population Prospects the total population in the region of Liberia in 1950 was 824,000. The average population growth percent per year for the following ten years was 2.5. For Ivory Coast those numbers are 2,505,000 and 3.6 correspondingly Extendable Target Schema (relational is not mandatory): Source | Location | From | To | Population | Data Source: s1 (xl) Data Source: s2 (doc) Source|Location | From |To | Population| s2 | Liberia | 01/01/1950 | 12/31/1950| 824000 | s2 |Liberia | 01/01/1960 | 12/31/1960| 1,052,000 | s2 |Ivory Coals | 01/01/1950 | 12/31/1950| 2,505, 000 | s2 |Ivory Coast | 01/01/1950 | 12/31/1950| 3,692,000 | Materialize Data Keep Data Remotely select * from Population s1 |Mauritania | 01/01/1950 | 12/31/1950| 692,000 | s1 |Mauritania | 01/01/1960 | 12/31/1960| 892,000 | s1 | Senegal | 01/01/1950 | 12/31/1950| 2,543,000 | s1 | Senegal | 01/01/1960 |12/31/1960 | 3,277,000 | Simple Scenario Mapping: Territories -> Location Population -> Population Data Aggregation -> Total Year -> From,To Wrapper Mapping: region -> Location Population -> Population Data Aggregation -> Total Year -> From,To Wrapper WHD Infrastructure
7
Data Curation Data Collection Data Utilization Big Picture: continuously growing infrastructure (a la Wikipedia) 7WHD Colloquium, March 27, 2012V. Zadorozhny
8
Group of graduate IS students: special project in Advanced Data Management class (INFSCI2711) Content Management → Pligg ( Open Source Content Management System, Apache, PHP, and MySQL based) Data Integration Engine → Pentaho Kettle (Open Source Data Integration Engine, Java-based GUI and Command Line Tools, XML based data transformation file) Data providers download Wrapper Generating Software configure wrappers on their workstation ( using preconfigured templates) register wrappers on WHD Server WHD Prototype 8WHD Colloquium, March 27, 2012V. Zadorozhny
10
Data Source Data Transformation Transformed Data XML Wrapper 10
11
11WHD Colloquium, March 27, 2012V. Zadorozhny
12
Data Reliability Assessment and Data Fusion The systems based on crowdsourcing require mechanisms to ensure data quality. WHD Infrastructure will support efficient data curation strategies based on advanced data reliability assessment and data fusion methods. As system continuously receives new historical reports, WHD estimates reliability of this data, which evolves with respect to new evidence. WHS uses a measure of inconsistency caused by a report to assess its internal reliability. WHD also allows users to submit their subjective feedback on reliability of data to assess external reliability. WHD utilizes subjective logic to combine internal and external reliability assessment 12
13
Historical Data: Redundancy t1 | source_ref1 | Measles | NYC |10/10/1900 | 10/10/1920 | 700 t2 | source_ref2 | Measles | NYC |10/20/1910 | 10/30/1930 | 300 Total number of Measles cases in New York City from 1900 to 1930: 700+300 = 1000 ??? Temporal overlap between t1 and t2 1900193019201910 Measles reports: 700 300 Temporal Overlaps t5 | source_ref1 | Yellow fever | NY |10/10/1900 | 10/10/1920 | 700 t6 | source_ref2 | Hepatitis | NY|10/10/1900 | 10/10/1920 | 700 t7 | source_ref4 | Hepatitis B | NY| 10/20/1910 | 10/30/1930 | 300 Total number of Hepatitis cases in New York State from 1920 to 1930: 700+700+300 =1700 ??? Naming overlap between t5, t6 and t7 Naming Overlaps Total number of Smallpox cases in New York State from 1900 to 1930: 500+600 = 1100 ??? Spatial overlap between t3 and t4 Smallpox reports: 500 (NY)600 (NYC) t3 | source_ref1 | Smallpox | NY |10/20/1900 | 10/20/1920 | 500 t4 | source_ref1 | Smallpox | NYC |10/30/1920 | 10/30/1930 | 600 Spatial Overlaps 1900193019201910 13
14
Historical Data: Inconsistency time Measles reports in NYC: 200 500 300 400 700 ………. R1: R2: Redundant and Inconsistent : 14WHD Colloquium, March 27, 2012V. Zadorozhny
15
Information Consumer Toolset: Data Visualization Dashboard
16
ICTS: Map Exhibits and Timeline Widgets
17
CV ICTS: Motion Chart Animation
18
Conclusion We explore a novel approach to reliable, large-scale historical data integration based on collective intelligence We implement this approach in WHD infrastructure for consolidation heterogeneous historical data Major challenge: how to engage a large community of researchers to share their data and collectively resolve the data heterogeneities in a continuously growing large-scale distributed historical repository? – contributions from CHAI members (only a small fraction of Wikipedia users contributes information to ensure its growth) – as the infrastructure evolves users may become interested in “embedding” their data in a larger context to perform global analysis and to utilize WHD tools – open development platform (extendable data transformation library and toolsets) 18WHD Colloquium, March 27, 2012V. Zadorozhny
19
Acknowledgements Graduate IS Students (WHD system development team): Andrew Barnett (team leader) Andrew Entin Thomas Junker Jidapa Kraisangka Han Liao Eric Miller Ye Peng Evan Pulgino Henry Quattrone Mark Swartz Miao Tan Liu Yuchen Lihong Zhang Doctoral Students: Ying-Feng Hsu Julian Lee 19WHD Colloquium, March 27, 2012V. Zadorozhny
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.