Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore.

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Recovering Semantics of Tables on the Web Fei Wu Google Inc. Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Paşca, Warren Shen, Gengxin Miao, Chung.
Petros Venetis, Alon Halevy, Jayant Madhavan, Marius Pas¸ca, Warren Shen, Fei Wu, Gengxin Miao, Chung Wu 2011, VLDB Xunnan Xu.
Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
1/1/ A Knowledge-based Approach to Citation Extraction Min-Yuh Day 1,2, Tzong-Han Tsai 1,3, Cheng-Lung Sung 1, Cheng-Wei Lee 1, Shih-Hung Wu 4, Chorng-Shyong.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.
Sensemaking and Ground Truth Ontology Development Chinua Umoja William M. Pottenger Jason Perry Christopher Janneck.
Data Mining, Information Theory and Image Interpretation Sargur N. Srihari Center of Excellence for Document Analysis and Recognition and Department of.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
ACT Question Analysis and Strategies for Science Presentation A.
Predicting Missing Provenance Using Semantic Associations in Reservoir Engineering Jing Zhao University of Southern California Sep 19 th,
Ontologies: Making Computers Smarter to Deal with Data Kei Cheung, PhD Yale Center for Medical Informatics CBB752, February 9, 2015, Yale University.
TransRank: A Novel Algorithm for Transfer of Rank Learning Depin Chen, Jun Yan, Gang Wang et al. University of Science and Technology of China, USTC Machine.
9/30/2004TCSS588A Isabelle Bichindaritz1 Introduction to Bioinformatics.
Managing & Integrating Enterprise Data with Semantic Technologies Susie Stephens Principal Product Manager, Oracle
LOD 123: Making the semantic web easier to use Tim Finin University of Maryland, Baltimore County Joint work with Lushan Han, Varish Mulwad, Anupam Joshi.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
Representing, Querying and Mining Knowledge about Autism Phenotypes
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
Attribute Extraction and Scoring: A Probabilistic Approach Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang Microsoft Research Asia Speaker: Bo.
Linked DataTables Automatically Generating Linked Data from Tables Varish Mulwad University of Maryland, Baltimore County November 15, 2011.
Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Building a Domain-Specific Document Collection for Evaluating Metadata Effects on Information Retrieval Walid Magdy, Jinming Min, Johannes Leveling, Gareth.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Machines learnt how to understand tables. What happens next will shock you. Welcome to the PhD dissertation defense of Varish Mulwad!
1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
LOD for the Rest of Us Tim Finin, Anupam Joshi, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 15 March 2012
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
The Semantic Web: there and back again
Qi Guo Emory University Ryen White, Susan Dumais, Jue Wang, Blake Anderson Microsoft Presented by Tetsuya Sakai, Microsoft Research.
T2LD – An automatic framework for extracting, interpreting and representing tables as linked data Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim.
Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad University of Maryland, Baltimore County.
Using linked data to interpret tables Varish Mulwad September 14,
Using linked data to interpret tables Varish Mulwad, Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County November 8, 2010.
Excel 2007 Part (3) Dr. Susan Al Naqshbandi
Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.
CityStateMayorPopulation BaltimoreMDS.C.Rawlings-Blake637,418 SeattleWAM.McGinn617,334 BostonMAT.Menino645,169 RaleighNCC.Meeker405,791 We are laying a.
The Unreasonable Effectiveness of Data
LINDEN : Linking Named Entities with Knowledge Base via Semantic Knowledge Date : 2013/03/25 Resource : WWW 2012 Advisor : Dr. Jia-Ling Koh Speaker : Wei.
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Linked Data for the Rest of Us Tim Finin, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 12 January 2012
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Making Software Agents Smarter Tim Finin University of Maryland, Baltimore County ICAART 2010, 22 January 2010
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
A research and policy informed discussion of cross-curricular approaches to the teaching of mathematics and science with a focus on how scientific enquiry.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Linguistic Graph Similarity for News Sentence Searching
An Empirical Study of Learning to Rank for Entity Search
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD
Web IR: Recent Trends; Future of Web Search
Wikitology Wikipedia as an Ontology
CSc4730/6730 Scientific Visualization
KDD Reviews 周天烁 2018年5月9日.
CSE 635 Multimedia Information Retrieval
Summarization for entity annotation Contextual summary
ProBase: common Sense Concept KB and Short Text Understanding
Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^
Presentation transcript:

Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore County September 2, 2011 Dr. Tim FininDr. Anupam Joshi

Goal 2 Image from : Zagari RM, Bianchi-Porro G, Fiocca R, Gasbarrini G, Roda E, Bazzoli F. Comparison of 1 and 2 weeks of omeprazole, amoxicillin and clarithromycin treatment for Helicobacter pylori eradication: the HYPER Study. Gut. 2007;56: [PMID: ]

Contribution NameTeamPositionHeight Michael JordanChicagoShooting guard1.98 Allen IversonPhiladelphiaPoint guard1.83 Yao MingHoustonCenter2.29 Tim DuncanSan AntonioPower forward nalBasketballAssociationTeams Map literals as values of properties dbprop:team 3

Contribution NameTeamPositionHeight Michael JordanChicagoShooting guard1.98 Allen IversonPhiladelphiaPoint guard1.83 Yao MingHoustonCenter2.29 Tim DuncanSan AntonioPower yago:. is rdfs:label of dbpedia-owl:BasketballPlayer. is rdfs:label of yago:NationalBasketballAssociationTeams. "Michael is rdfs:label of dbpedia:Michael Jordan. dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer. "Chicago is rdfs:label of dbpedia:Chicago Bulls. dbpedia:Chicago Bulls a yago:. is rdfs:label of dbpedia-owl:BasketballPlayer. is rdfs:label of yago:NationalBasketballAssociationTeams. "Michael is rdfs:label of dbpedia:Michael Jordan. dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer. "Chicago is rdfs:label of dbpedia:Chicago Bulls. dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams. All this in a completely automated way !! 4

Introduction & Motivation 5

Tables are everywhere ! 389, 697 raw and geospatial datasets The web – 154 million high quality relational tables (Cafarella et al. 2008) 6 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Evidence–based medicine Figure: Evidence-Based Medicine - the Essential Role of Systematic Reviews, and the Need for Automated Text Mining Tools, IHI 2010 The idea behind Evidence-based Medicine is to judge the efficacy of treatments or tests by meta-analyses or reviews of clinical trials. Key information in such trials is encoded in tables. However, the rate at which meta-analyses are published remains very low … hampers effective health care treatment … 7 # of Clinical trials published in 2008 # of meta analysis published in 2008

Related Work 8 Extracting tables from documents and web pages  Hurst (2006), Embley et al. (2006) Understanding semantics of tables  Wang et al. (2011), Ventis et al. (2011), Limaye et al. (2010) Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Current systems Use ‘semantically poor’ knowledge bases Only one system focuses on complete table interpretation Do not generate Linked Data No system tackles literal data Critical piece of evidence for interpreting medical tables No system dealing with tables in specialized domains (e.g. tables found medical literature) 9 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Preliminary work / Baseline system Analysis and Evaluation of baseline Framework grounded in graphical models and probabilistic reasoning 10 Building a table interpretation framework

The System’s Brain (Knowledgebase) Yago Wikitology 1 – A hybrid knowledgebase where structured data meets unstructured data 1 – Wikitology was created as part of Zareen Syed’s Ph.D. dissertation Syed, Z., and Finin, T Creating and Exploiting a Hybrid Knowledge Base for Linked Data, volume 129 of Revised Selected Papers Series: Communications in Computer and Information Science. Springer. 11

The Baseline System 12

T2LD Framework Predict Class for Columns Linking the table cells Identify and Discover relations T2LD Framework 13 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Predicting Class Labels for column Team Chicago Philadelphia Houston San Antonio Class Instance Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion 1. Chicago Bulls 2. Chicago 3. Judy Chicago 1. Chicago Bulls 2. Chicago 3. Judy Chicago {dbpedia-owl:Place,dbpedia- owl:City,yago:WomenArtist,yago :LivingPeople,yago:NationalBask etballAssociationTeams } {dbpedia-owl:Place, dbpedia- owl:PopulatedPlace, dbpedia- owl:Film,yago:NationalBasketb allAssociationTeams …. ….. ….. } {……………………………………………… ……………. } dbpedia-owl:Place, dbpedia-owl:City, yago:WomenArtist, yago:LivingPeople, yago:NationalBasketballAssociationTeams, dbpedia-owl:PopulatedPlace, dbpedia-owl:Film ….

Linking table cells to entities Michael Jordan + Chicago + Shooting Guard dbpedia- owl:BasketballPlayer 1. Michael Jordan 2. Michael-Hakim Jordan 1. Michael Jordan 2. Michael-Hakim Jordan Classifier 1 – SVM Rank (Ranks the set of entities) Classifier 1 – SVM Rank (Ranks the set of entities) Classifier 2 – SVM (Computes Confidence) Link to the top ranked entity Don’t link 15 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Identify Relations Name Michael Jordan Allen Iverson Yao Ming Tim Duncan Team Chicago Philadelphia Houston San Antonio Rel ‘A’ Rel ‘A’, ‘C’ Rel ‘A’, ‘B’, ‘C’ Rel ‘A’, ‘B’ 16 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Generating a linked RDF yago:. is rdfs:label of dbpedia-owl:BasketballPlayer. is rdfs:label of yago:NationalBasketballAssociationTeams. "Michael is rdfs:label of dbpedia:Michael Jordan. dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer. "Chicago is rdfs:label of dbpedia:Chicago Bulls. dbpedia:Chicago Bulls a yago:. is rdfs:label of dbpedia-owl:BasketballPlayer. is rdfs:label of yago:NationalBasketballAssociationTeams. "Michael is rdfs:label of dbpedia:Michael Jordan. dbpedia:Michael Jordan a dbpedia-owl:BasketballPlayer. "Chicago is rdfs:label of dbpedia:Chicago Bulls. dbpedia:Chicago Bulls a yago:NationalBasketballAssociationTeams. 17 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Evaluation of the baseline system 18

Dataset summary Number of Tables15 Total Number of rows199 Total Number of columns56 (52) Total Number of entities639 (611) * The number in the brackets indicates # excluding columns that contained numbers 19 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Evaluation # 1 (MAP) Compared the system’s ranked list of labels against a human–ranked list of labels Metric - Average Precision (a.p.) [Mean Average Precision gives a mean over set of queries] Commonly used in the Information Retrieval domain to compare two ranked sets 20 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Evaluation # 1 (MAP) MAP = System Ranked: 1. Person 2. Politician 3. President Evaluator Ranked: 1. President 2. Politician 3. OfficeHolder 21 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Evaluation # 2 (Correctness) Evaluated whether our predicted class labels were “fair and correct” Class label may not be the most accurate one, but may be correct – E.g. dbpedia:PopulatedPlace is not the most accurate, but still a correct label for column of cities Three human judges evaluated our predicted class labels 22 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Evaluation # 2 (Correctness) Column – Nationality Prediction – MilitaryConflict Column – Birth Place Prediction – PopulatedPlace Overall Accuracy: % 23 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Accuracy for Entity Linking Overall Accuracy: % 24 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Lessons Learnt Sequential System – Error percolated from one phase to the next Current system favors general classes over specific ones (MAP score = 0.411) Largely, a system driven by “heuristics” Although we consider evidence, we don’t do assignment jointly Predict Class for Columns Linking the table cells Identify and Discover relations T2LD Framework 25 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Joint Inference over evidence in a table Probabilistic Graphical Models Markov logic Networks 26

A graphical model for tables C1 C2C3 R11 R12 R13 R21 R22 R23 R31 R32 R33 27 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Parameterized graphical model C1 C2 C3 R11R12R13R21R22R23R31R32 R33 Function that captures the affinity between the column headers and row values Row value Variable Node: Column header Captures interaction between column headers Captures interaction between row values Factor Node 28 Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Challenges - Abbreviations Other examples: State Abbreviations Stock Tickers Airport Codes Currency codes Preprocessing – parse and identify such columns Replace abbreviations with expanded forms Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Challenges - Literals Population 690, , , ,000 Age Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

Conclusion Presented a framework for inferring the semantics of tables and generating Linked data Evaluation of the baseline system show feasibility in tackling the problem Work in progress for building framework grounded in graphical models and probabilistic reasoning Working on tackling challenges posed by tables from domains such as the medical and open government data Introduction  Related Work  Baseline  Results  Joint Inference  Conclusion

References 1.Cafarella, M. J.; Halevy, A. Y.; Wang, Z. D.; Wu, E.; and Zhang, Y Webtables:exploring the power of tables on the web. PVLDB 1(1):538–549 2.M. Hurst. Towards a theory of tables. IJDAR,8(2-3): , D. W. Embley, D. P. Lopresti, and G. Nagy. Notes on contemporary table recognition. In Document Analysis Systems, pages , Wang, Jingjing, Shao, Bin, Wang, Haixun, and Zhu, Kenny Q. Understanding tables on the web. Technical report, Microsoft Research Asia, Venetis Petros, Halevy Alon, Madhavan Jayant, Pasca Marius, Shen Warren, Wu Fei, Miao Gengxin, and Wu Chung. Recovering semantics of tables on the web. In Proc. of the 37th Int'l Conference on Very Large Databases (VLDB), Limaye Girija, Sarawagi Sunita, and Chakrabarti Soumen. Annotating and searching web tables using entities, types and relationships. In Proc. of the 36th Int'l Conference on Very Large Databases (VLDB),

Thank You ! Questions Web: 33