Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County

Slides:



Advertisements
Similar presentations
Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine.
Advertisements

Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Gerhard Weikum Max Planck Institute for Informatics & Saarland University Semantic Search: from Names and Phrases to.
Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.
Retrieving Documents with Geographic References Using a Spatial Index Structure Based on Ontologies Database Laboratory University of A Coruña A Coruña,
Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010.
Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Databases & Data Warehouses Chapter 3 Database Processing.
LOD 123: Making the semantic web easier to use Tim Finin University of Maryland, Baltimore County Joint work with Lushan Han, Varish Mulwad, Anupam Joshi.
Querying RDF Data with Text Annotated Graphs Lushan Han, Tim Finin, Anupam Joshi and Doreen Cheng SSDBM’15 
Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Linked DataTables Automatically Generating Linked Data from Tables Varish Mulwad University of Maryland, Baltimore County November 15, 2011.
University of Sheffield, NLP Entity Linking Kalina Bontcheva © The University of Sheffield, This work is licensed under the Creative Commons.
DBXplorer: A System for Keyword- Based Search over Relational Databases Sanjay Agrawal Surajit Chaudhuri Gautam Das Presented by Bhushan Pachpande.
Author: William Tunstall-Pedoe Presenter: Bahareh Sarrafzadeh CS 886 Spring 2015.
Artificial intelligence project
Semantic Search: different meanings. Semantic search: different meanings Definition 1: Semantic search as the problem of searching documents beyond the.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Michael Cafarella Alon HalevyNodira Khoussainova University of Washington Google, incUniversity of Washington Data Integration for Relational Web.
Ontological Classification of Web Pages Zafer Erenel Many users use search engines to locate and buy goods and services (such as choosing a vacation).
Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
CMSC 601 LaTeX 101 Spring 2011 Tim Finin
Databases. Not All Tables Are Created Equal Spreadsheets use tables to store data and formulas associated with that data The “meaning” of data is implicit.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
LOD for the Rest of Us Tim Finin, Anupam Joshi, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 15 March 2012
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
1 FollowMyLink Individual APT Presentation Third Talk February 2006.
The Semantic Web: there and back again
T2LD – An automatic framework for extracting, interpreting and representing tables as linked data Varish Mulwad Master’s Thesis Defense Advisor: Dr. Tim.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Government Linked Data Tables Automatically Generating Government Linked Data from Tables Varish Mulwad University of Maryland, Baltimore County.
ResistVir-Db The database of ResistVir European Project Co-ordination of Research on Genetic Resistance to Plant Pathogenic Viruses, and their Vectors,
Using linked data to interpret tables Varish Mulwad September 14,
Using linked data to interpret tables Varish Mulwad, Tim Finin, Zareen Syed and Anupam Joshi University of Maryland, Baltimore County November 8, 2010.
Creating and Exploiting a Web of Semantic Data Tim Finin, UMBC Earth and Space Science Informatics Workshop 05 August 2009
CityStateMayorPopulation BaltimoreMDS.C.Rawlings-Blake637,418 SeattleWAM.McGinn617,334 BostonMAT.Menino645,169 RaleighNCC.Meeker405,791 We are laying a.
Post-Ranking query suggestion by diversifying search Chao Wang.
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Challenge Problem: Link Mining Lise Getoor University of Maryland, College Park.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Linked Data inferringsemanticstables Generating Linked Data by inferring the semantics of tables Varish Mulwad University of Maryland, Baltimore.
Linked Data for the Rest of Us Tim Finin, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 12 January 2012
Making Software Agents Smarter Tim Finin University of Maryland, Baltimore County ICAART 2010, 22 January 2010
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
Web Page Design The Basics. The Web Page A document (file) created using the HTML scripting language. A document (file) created using the HTML scripting.
Swoogle: A Semantic Web Search and Metadata Engine Li Ding, Tim Finin, Anupam Joshi, Rong Pan, R. Scott Cost, Yun Peng Pavan Reddivari, Vishal Doshi, Joel.
Guangbing Yang Presentation for Xerox Docushare Symposium in 2011
An Empirical Study of Learning to Rank for Entity Search
Big Data Quality the next semantic challenge
1) What is a Database? A database is an organized collection of information about a subject. Examples: Address Book, Telephone Book.
Web IR: Recent Trends; Future of Web Search
Wikitology Wikipedia as an Ontology
Data Integration for Relational Web
Enriching Structured Knowledge with Open Information
1) What is a Database? A database is an organized collection of information about a subject. Examples: Address Book, Telephone Book.
DBpedia 2014 Liang Zheng 9.22.
CSE 635 Multimedia Information Retrieval
Information Retrieval and Web Design
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County 0

Age of Big Data Availability of massive amounts of data is driving many technical advances Extracting linked data from text and tables will help Databases & spreadsheets are obvious sources for tables but many are in documents and web pages, too A recent Google study found over 14B HTML tables – M. Cafarella, A. Halevy, D. Wang, E. Wu, Y. Zhang, Webtables: exploring the power of tables on the Web, VLDB, Only about 0.1% had high-quality relational data But that’s about 150M tables! 1

Problem: given a table 2

Generate linked cyc: \ dbp:Boston dbpo:PopulatedPlace/leaderName dbp:Thomas_Menino; cyc:partOf dbp:Massachusetts; dbpo:populationTotal "610000"^^xsd:integer. dbp:New_York_City cyc: \ dbp:Boston dbpo:PopulatedPlace/leaderName dbp:Thomas_Menino; cyc:partOf dbp:Massachusetts; dbpo:populationTotal "610000"^^xsd:integer. dbp:New_York_City …... Use classes, properties and instances from a linked data collection, e.g. DBpedia + Cyc + Geonames Confirm existing facts and discover new ones Create new entities as needed Create new relations when possible (harder) 3

What data do we want dbpo:Baltimore link cell values to entities find relationships between columns dbpo:Maryland dbpo:largestCity 4

What evidence can we find? Column one’s type is populated place, or is it US city, or a reference to a NBA team? 5

What do we want to extract? Column one’s type is populated place, or is it US city, or a reference to a NBA team? Column two’s type is person (or politician?) but is ‘mayor’ a type or a relation and if the later, to what? 5

What do we want to extract? Column one’s type is populated place, or is it US city, or a reference to a NBA team? Column two’s type is person (or politician?) but is ‘mayor’ a type or a relation and if the later, to what? Rows give important evidence too: Menino has a stronger connection to Boston than Massachusetts 5

What do we want to extract? Column one’s type is populated place, or is it US city, or a reference to a NBA team? Column two’s type is person (or politician?) but is ‘mayor’ a type or a relation and if the later, to what? Rows give important evidence too: Menino has a stronger connection to Boston than Massachusetts Both cities and states have populations, … 5

A Web of Evidence Table: Column headers, cell values, column position, column adjacency Language: headers have meaning, synonyms, … Ontologies: capitalOf is a 1:1 relation between a GPE region and a city Significance: pageRank-like metrics bias linking Facts: the LD KB asserts Boston is in MA and that Boston’s population is close to 610KBoston is in MA Graph analysis: PMI between Boston & Menino is much higher than for Massachusetts 6

Approach Query Knowledge base Predict Class for Columns Re query Knowledge base using the new evidence Link cell value to an entity using the new results obtained Input: Table Headers and Rows Identify Relationships between columns Output: Linked Data 7

Wikitology A hybrid KB of structured & unstructured information extracted from Wikipedia Augmented with knowledge from DBpedia, Freebase, Yago and Wordnet The interface via a specialized IR index Good for systems that need to do a combination of reasoning over text, graphs and semi- structured data 8

Querying the Knowledge–Base For every cell from the table – Cell Value + Column Header + Row Content Top N entities, Their Types, Page Rank (We use N = 5) Wikitology Baltimore + City + MD + S.Dixon + 640,000 1.Baltimore_Maryland 2.Baltimore_County 3.John_Baltimore 9

Predicting Classes for Columns Set of Classes per column Score the classes Choose the top class from each of the four vocabularies – Dbpedia, Freebase, Wordnet and Yago dbpedia-owl:Place, dbpedia-owl:Area, yago:AmericanConductors, yago:LivingPeople, dbpedia-owl:PopulatedPlace, dbpedia-owl:Band, dbpedia-owl:Organisation,... dbpedia-owl:Place, dbpedia-owl:Area, yago:AmericanConductors, yago:LivingPeople, dbpedia-owl:PopulatedPlace, dbpedia-owl:Band, dbpedia-owl:Organisation,... Score = w x ( 1 / R ) + (1 – w) Page Rank R: Entity’s Rank; E.g. [Baltimore,dbpedia:Area] = 0.89 Select the class that maximizes its sum of score over the entire column [Baltimore, dbpedia:Area] + [Boston, dbpedia:Area] + [New York, dbpedia:Area] = 2.85 Score = w x ( 1 / R ) + (1 – w) Page Rank R: Entity’s Rank; E.g. [Baltimore,dbpedia:Area] = 0.89 Select the class that maximizes its sum of score over the entire column [Baltimore, dbpedia:Area] + [Boston, dbpedia:Area] + [New York, dbpedia:Area] = 2.85 Column:City Dbpedia:PopulatedPlace Wordnet:City Freebase:Location Yago:CitiesinUnitedStates Column:City Dbpedia:PopulatedPlace Wordnet:City Freebase:Location Yago:CitiesinUnitedStates 10

Linking table cell to entities Once the classes are predicted, we re-query the knowledge– base with this new evidence Along with the original query, we also include the predicted types We pick the highest ranking entity which matches the predicted type from the new results For every cell from the table – Cell Value + Column Header + Row Content + Predicted Column Type Top N entities, Their Types (We use N = 5) KB

Preliminary results: entity linking In a preliminary evaluation, we used 5 Google Squared tables comprising 23 columns and 39 rows, comparing our results with human judgments The next will be on selected tables from the Google col- lection of >2500 involving 6 domains: bibliography, car, course, country, movie, people Ckasses used Accuracy Class Prediction for Columns: Dbpedia 85.7% Class Prediction for Columns : Freebase 90.5% Class Prediction for Columns : Wordnet 71.4% Class Prediction of Columns :Yago 71.4% Entity Linking76.6% 11

Ongoing and Future work Identifying relationships between columns Modules for common ‘special cases’, e.g. numbers, acronyms, phone numbers, stock symbols, addresses, URLs, etc. Replace heuristics by machine learning techniques for combining evidence and clustering 12

Conclusion There’s lots of data stored in tables: in spread- sheets, databases, Web pages and documents In some cases we can interpret them and generate a linked data representation In others we can at least link some cell values to LOD entities This can help contribute data to the Web in a form that is easy for machines to understand and use 13