Linked DataTables Automatically Generating Linked Data from Tables Varish Mulwad University of Maryland, Baltimore County November 15, 2011.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Pas¸ca, Warren Shen, Fei Wu, Gengxin Miao, Chung Wu 2011, VLDB Xunnan Xu.
Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Learning to Map between Ontologies on the Semantic Web AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy Databases and Data Mining group University.
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
Notes on Contemporary Table Recognition Embley, Lopresti, and Nagy  February 2006  Slide 1 Notes on Contemporary Table Recognition David W. Embley 1,
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Helping people find content … preparing content to be found Enabling the Semantic Web Joseph Busch.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Research Problems in Semantic Web Search Varish Mulwad ____________________________ 1.
Presented by Zeehasham Rasheed
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Managing & Integrating Enterprise Data with Semantic Technologies Susie Stephens Principal Product Manager, Oracle
LOD 123: Making the semantic web easier to use Tim Finin University of Maryland, Baltimore County Joint work with Lushan Han, Varish Mulwad, Anupam Joshi.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Attribute Extraction and Scoring: A Probabilistic Approach Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang Microsoft Research Asia Speaker: Bo.
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
WEB SEARCH PERSONALIZATION WITH ONTOLOGICAL USER PROFILES Data Mining Lab XUAN MAN.
Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!
Google Fusion Tables: Web-Centered Data Management and Collaboration Hector Gonzalez, Alon Y. Halevy, Christian S. Jensen, Anno Langen, Jayant Madhavan,
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
LOD for the Rest of Us Tim Finin, Anupam Joshi, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 15 March 2012
MyActivity: A Cloud-Hosted Ontology-Based Framework for Human Activity Querying Amin BakhshandehAbkear Supervisor:
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
The Semantic Web: there and back again
Christoph F. Eick University of Houston Organization 1. What are Ontologies? 2. What are they good for? 3. Ontologies and.
Qi Guo Emory University Ryen White, Susan Dumais, Jue Wang, Blake Anderson Microsoft Presented by Tetsuya Sakai, Microsoft Research.
T2LD – An automatic framework for extracting, interpreting and representing tables as linked data Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim.
The role of knowledge in conceptual retrieval: a study in the domain of clinical medicine Jimmy Lin and Dina Demner-Fushman University of Maryland SIGIR.
Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad University of Maryland, Baltimore County.
Institute of Computing Technology, Chinese Academy of Sciences 1 A Unified Framework of Recommending Diverse and Relevant Queries Speaker: Xiaofei Zhu.
Using linked data to interpret tables Varish Mulwad September 14,
Shridhar Bhalerao CMSC 601 Finding Implicit Relations in the Semantic Web.
Using linked data to interpret tables Varish Mulwad, Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County November 8, 2010.
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
DeepDive Introduction Dongfang Xu Ph.D student, School of Information, University of Arizona Sept 10, 2015.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
CityStateMayorPopulation BaltimoreMDS.C.Rawlings-Blake637,418 SeattleWAM.McGinn617,334 BostonMAT.Menino645,169 RaleighNCC.Meeker405,791 We are laying a.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Date: 2012/5/28 Source: Alexander Kotov. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Interactive Sense Feedback for Difficult Queries.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore.
Linked Data for the Rest of Us Tim Finin, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 12 January 2012
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Making Software Agents Smarter Tim Finin University of Maryland, Baltimore County ICAART 2010, 22 January 2010
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
TDM in the Life Sciences Application to Drug Repositioning *
Queensland University of Technology
An Empirical Study of Learning to Rank for Entity Search
Query Rewriting Framework for Spatial Data
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Science Fair Data Table
Wikitology Wikipedia as an Ontology
KDD Reviews 周天烁 2018年5月9日.
Data Integration for Relational Web
Summarization for entity annotation Contextual summary
ProBase: common Sense Concept KB and Short Text Understanding
Presentation transcript:

Linked DataTables Automatically Generating Linked Data from Tables Varish Mulwad University of Maryland, Baltimore County November 15, 2011

What ? 2

State FIPS County FIPS GroupLabelValue Alabama1Macon87Farms with Black or African American operators Value of sales of grains, oil seeds, dry beans, and dry peas (farms) 5 Arizona….Navajo…. Arkansas5Union 139Farms with women principal Operators Total value of agricultural products sold (farms) 56 California6Humboldt23…….19 AdministrativeRegion Map literals as values of properties dbpedia-owl:state Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 3

State FIPS County FIPS GroupLabelValue Alabama1Macon87Farms with Black or African American operators Value of sales of grains, oil seeds, dry beans, and dry peas (farms) 5 Arizona….Navajo…. Arkansas5Union 139Farms with women principal Operators Total value of agricultural products sold (farms) 56 dgtwc:. is rdfs:label of dbpedia-owl:AdminstrativeRegion. [ a dgtwc:DataEntry; dbpedia-owl:state dbpedia:Alabama; dbpedia:FIPS county code 000; dbpedia:Federal Information Processing Standard state code 001; dbpedia-owl:ethnicGroup “Farm with women principal dbpedia-owl:number dgtwc:. is rdfs:label of dbpedia-owl:AdminstrativeRegion. [ a dgtwc:DataEntry; dbpedia-owl:state dbpedia:Alabama; dbpedia:FIPS county code 000; dbpedia:Federal Information Processing Standard state code 001; dbpedia-owl:ethnicGroup “Farm with women principal dbpedia-owl:number 6444]. All this in a completely automated way !! Contribution Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 4

Why ? 5

Tables are everywhere !! … yet … The web – 154 million high quality relational tables [1] Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 6

Evidence–based medicine Figure: Evidence-Based Medicine - the Essential Role of Systematic Reviews, and the Need for Automated Text Mining Tools, IHI 2010 The idea behind Evidence-based Medicine is to judge the efficacy of treatments or tests by meta-analyses or reviews of clinical trials. Key information in such trials is encoded in tables. However, the rate at which meta-analyses are published remains very low … hampers effective health care treatment … # of Clinical trials published in 2008 # of meta analysis published in

> 400,000 raw and geospatial datasets ~ < 1 % in RDF Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 8

Current Systems – Require users to have knowledge of the Semantic Web – Do not automatically link to existing classes and entities on the Semantic Web / Linked Data cloud – RDF data in some cases is as useless as raw data – Majority of the work focused on relational data where schema is available – Web tables systems use ‘semantically poor knowledge bases’ Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 9

How ? 10

Preliminary work / Baseline system Analysis and Evaluation of baseline “Domain Independent” Framework grounded in graphical models and probabilistic reasoning 11 Building a table interpretation framework Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

The System’s Brain (Knowledgebase) Yago Wikitology 1 – A hybrid knowledgebase where structured data meets unstructured data 1 – Wikitology was created as part of Zareen Syed’s Ph.D. dissertation Syed, Z., and Finin, T Creating and Exploiting a Hybrid Knowledge Base for Linked Data, volume 129 of Revised Selected Papers Series: Communications in Computer and Information Science. Springer. 12

The Baseline System 13

T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations T2LD Framework 14 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Predicting Class Labels for column State Alabama Arizona Arkansas California Class Instance Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 1. Alabama 2.Alabama_(band) 3.Alabama_(people) 1. Alabama 2.Alabama_(band) 3.Alabama_(people) {dbpedia-owl:Place, dbpedia- owl:AdministrativeRegion,yago:S tatesOfTheUnitedStates, dbpedia-owl:Band, yago:NativeAmericanTribes …} {dbpedia-owl:Place, yago:StatesOfTheUnitedStates, dbpedia-owl:Film, …. ….. ….. } {……………………………………………… ……………. } dbpedia-owl:Place, dbpedia- owl:AdministrativeRegion,yago:StatesOfTheUnitedStates, dbpedia- owl:Band, yago:NativeAmericanTribes,dbpedia-owl:Film... 15

Linking table cells to entities Macon + County + Alabama Farms with Black or African American operators dbpedia- owl:AdministrativeRegio n Macon + County + Alabama Farms with Black or African American operators dbpedia- owl:AdministrativeRegio n 1. Macon County, Alabama 2. Macon County, Illinois 1. Macon County, Alabama 2. Macon County, Illinois Classifier 1 – SVM Rank (Ranks the set of entities) Classifier 1 – SVM Rank (Ranks the set of entities) Classifier 2 – SVM (Computes Confidence) Link to the top ranked entity Don’t link 16 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Identify Relations State Alabama Arizona Arkansas California County Macon Navajo Union Humboldt Rel ‘A’ Rel ‘A’, ‘C’ Rel ‘A’, ‘B’, ‘C’ Rel ‘A’, ‘B’ 17 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Generating a linked RDF dgtwc:. is rdfs:label of dbpedia-owl:AdminstrativeRegion. [ a dgtwc:DataEntry; dbpedia-owl:state dbpedia:Alabama; dbpedia:FIPS county code 000; dbpedia:Federal Information Processing Standard state code 001; dbpedia-owl:ethnicGroup “Farm with women principal dbpedia-owl:number dgtwc:. is rdfs:label of dbpedia-owl:AdminstrativeRegion. [ a dgtwc:DataEntry; dbpedia-owl:state dbpedia:Alabama; dbpedia:FIPS county code 000; dbpedia:Federal Information Processing Standard state code 001; dbpedia-owl:ethnicGroup “Farm with women principal dbpedia-owl:number 6444]. 18 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Evaluation of the baseline system 19

Dataset summary Number of Tables15 Total Number of rows199 Total Number of columns56 (52) Total Number of entities639 (611) * The number in the brackets indicates # excluding columns that contained numbers 20 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Evaluation # 1 (MAP) Compared the system’s ranked list of labels against a human–ranked list of labels Metric - Average Precision (a.p.) [Mean Average Precision gives a mean over set of queries] Commonly used in the Information Retrieval domain to compare two ranked sets 21 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Evaluation # 1 (MAP) MAP = System Ranked: 1. Person 2. Politician 3. President Evaluator Ranked: 1. President 2. Politician 3. OfficeHolder 22 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Accuracy for Entity Linking Overall Accuracy: % 23 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Lessons Learnt Sequential System – Error percolated from one phase to the next Current system favors general classes over specific ones (MAP score = 0.411) Largely, a system driven by “heuristics” Although we consider evidence, we don’t do assignment jointly Predict Class for Columns Linking the table cells Identify and Discover relations T2LD Framework 24 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

KB a,b,c,… m,n,o,… x,y,z,… Probabilistic Graphical Model / Joint Inference Model KB Domain Knowledge – Linked Data Cloud / Medical Domain / Open Govt. Domain Query Linked Data A “Domain Independent” Framework 25

Joint Inference over evidence in a table Probabilistic Graphical Models 26

Parameterized graphical model C1 C2 C3 R11R12R13R21R22R23R31R32 R33 Function that captures the affinity between the column headers and row values Row value Variable Node: Column header Captures interaction between column headers Captures interaction between row values Factor Node Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 27

Challenges 28

Challenges - Literals Population 690, , , ,000 Age Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion Population / Profit ? Age / Percentage ? Use evidence from the rest of the table to decide 29

Challenges - Metadata Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 30

More Challenges ! Sampling and Interpretation – Data set 1425 has > 400,000 rows ! Human in the Loop Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 31

Conclusion Presented a framework for inferring the semantics of tables and generating Linked data Evaluation of the baseline system show feasibility in tackling the problem Work in progress for building framework grounded in graphical models and probabilistic reasoning Working on tackling challenges posed by tables from domains such as the medical and open government data Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

References 1.Cafarella, M. J.; Halevy, A. Y.; Wang, Z. D.; Wu, E.; and Zhang, Y Webtables:exploring the power of tables on the web. PVLDB 1(1):538–549 2.M. Hurst. Towards a theory of tables. IJDAR,8(2-3): , D. W. Embley, D. P. Lopresti, and G. Nagy. Notes on contemporary table recognition. In Document Analysis Systems, pages , Wang, Jingjing, Shao, Bin, Wang, Haixun, and Zhu, Kenny Q. Understanding tables on the web. Technical report, Microsoft Research Asia, Venetis Petros, Halevy Alon, Madhavan Jayant, Pasca Marius, Shen Warren, Wu Fei, Miao Gengxin, and Wu Chung. Recovering semantics of tables on the web. In Proc. of the 37th Int'l Conference on Very Large Databases (VLDB), Limaye Girija, Sarawagi Sunita, and Chakrabarti Soumen. Annotating and searching web tables using entities, types and relationships. In Proc. of the 36th Int'l Conference on Very Large Databases (VLDB),

Thank You ! Questions Project Page: 34

Backup slides 35

Evidence–based medicine Figure: Evidence-Based Medicine - the Essential Role of Systematic Reviews, and the Need for Automated Text Mining Tools, IHI 2010 The idea behind Evidence-based Medicine is to judge the efficacy of treatments or tests by meta-analyses or reviews of clinical trials. Key information in such trials is encoded in tables. However, the rate at which meta-analyses are published remains very low … hampers effective health care treatment … # of Clinical trials published in 2008 # of meta analysis published in

Evaluation # 2 (Correctness) Evaluated whether our predicted class labels were “fair and correct” Class label may not be the most accurate one, but may be correct – E.g. dbpedia:PopulatedPlace is not the most accurate, but still a correct label for column of cities Three human judges evaluated our predicted class labels 37 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Evaluation # 2 (Correctness) Column – Nationality Prediction – MilitaryConflict Column – Birth Place Prediction – PopulatedPlace Overall Accuracy: % 38 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Querying Wikitology 39

A graphical model for tables C1 C2C3 R11 R12 R13 R21 R22 R23 R31 R32 R33 State Alabama Arizona Arkansas California Class Instance Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 40

Dataset Number of Farms Farms with women principal operators Alabama <rdf:type rdf:resource=“ /data-gov-twc.rdf#DataEntry”/> 6444 Number of Farms Farms with women principal operators Alabama <rdf:type rdf:resource=“ /data-gov-twc.rdf#DataEntry”/> Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 41