SWIMs: From Structured Summaries to Integrated Knowledge Base

Slides:



Advertisements
Similar presentations
Data Mining and the Web Susan Dumais Microsoft Research KDD97 Panel - Aug 17, 1997.
Advertisements

eClassifier: Tool for Taxonomies
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Controlled Vocabularies in TELPlus Antoine ISAAC Vrije Universiteit Amsterdam EDLProject Workshop November 2007.
Natural Language Interfaces to Ontologies Danica Damljanović
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
OntoBlog: Informal Knowledge Management by Semantic Blogging Aman Shakya 1, Vilas Wuwongse 2, Hideaki Takeda 1, Ikki Ohmukai 1 1 National Institute of.
Information and Business Work
Evaluating Search Engine
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. 1 The Architecture of a Large-Scale Web Search and Query Engine.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Watson Supporting Next Generation Semantic Web Applications Mathieu d’Aquin, Claudio Baldassarre, Laurian Gridinoc, Marta Sabou, Sofia Angeletou, Enrico.
PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment Natalya Fridman Noy and Mark A. Musen.
IST NeOn-project.org The Semantic Web is growing… #SW Pages Lee, J., Goodwin, R. (2004) The Semantic.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Multimedia Data Mining Arvind Balasubramanian Multimedia Lab (ECSS 4.416) The University of Texas at Dallas.
Cloud based linked data platform for Structural Engineering Experiment Xiaohui Zhang
Web 3.0 or The Semantic Web By: Konrad Sit CCT355 November 21 st 2011.
Federated Searching Pre-Conference Workshop - The federated searching cookbook Qin Zhu HP Labs Research Library February 18, 2007.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
Approximated Provenance for Complex Applications
Semantic Web outlook and trends May The Past 24 Odd Years 1984 Lenat’s Cyc vision 1989 TBL’s Web vision 1991 DARPA Knowledge Sharing Effort 1996.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Harvesting Structured Summaries from Wikipedia and Large Text Corpora Hamid Mousavi May 31, 2014 University of California, Los Angeles Computer Science.
Entity Recognition via Querying DBpedia ElShaimaa Ali.
University of Sheffield, NLP Entity Linking Kalina Bontcheva © The University of Sheffield, This work is licensed under the Creative Commons.
Tables to Linked Data Zareen Syed, Tim Finin, Varish Mulwad and Anupam Joshi University of Maryland, Baltimore County
Semantic Search: different meanings. Semantic search: different meanings Definition 1: Semantic search as the problem of searching documents beyond the.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
-1- Philipp Heim, Thomas Ertl, Jürgen Ziegler Facet Graphs: Complex Semantic Querying Made Easy Philipp Heim 1, Thomas Ertl 1 and Jürgen Ziegler 2 1 Visualization.
Markup and Validation Agents in Vijjana – A Pragmatic model for Self- Organizing, Collaborative, Domain- Centric Knowledge Networks S. Devalapalli, R.
LOD for the Rest of Us Tim Finin, Anupam Joshi, Varish Mulwad and Lushan Han University of Maryland, Baltimore County 15 March 2012
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Keyword Query Routing.
Facilitating Document Annotation using Content and Querying Value.
Understanding User’s Query Intent with Wikipedia G 여 승 후.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
OWL Representing Information Using the Web Ontology Language.
Domain-Expert Repository Management for Adaptive Hypermedia Learning System By Norazah Yusof & Paridah Samsuri Members of SPAtH Group Faculty of Comp.
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
NATURAL LANGUAGE PROCESSING Zachary McNellis. Overview  Background  Areas of NLP  How it works?  Future of NLP  References.
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
DBpedia - A Crystallization Point
And the Watson Plugin for the NeOn Toolkit. IST NeOn-project.org The Semantic Web is growing… #SW Pages.
An Ontological Approach to Financial Analysis and Monitoring.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institut AIFB – Angewandte Informatik.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Of 24 lecture 11: ontology – mediation, merging & aligning.
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Big Data Quality the next semantic challenge
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Crossing the gap between multimedia data and semantics
Multimedia Information Retrieval
Extracting Semantic Concept Relations
DBpedia 2014 Liang Zheng 9.22.
CSE 635 Multimedia Information Retrieval
Information Retrieval and Web Design
Linked Data Reuse in the Language Services Industry
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

SWIMs: From Structured Summaries to Integrated Knowledge Base ScAi Lab, CSD, UCLA May 2014

Immense Knowledge From Web Wikipedia: 280 + languages 20M + articles 1B+ edits DBpedia: 110 + languages 2B+ facts 4M+ subjects (en) …

Semantic Applications explored at UCLA Semantic Search By-Example Query supported by SWiPE Multilingual Knowledge Base Knowledge Maintenance Essay Grading Text Summarization Reviewing summarization in systems such as Yelp and Amazon

But there are several challenges Various sources of knowledge inconsistency in format/content, inaccurate DBpedia Facts Rachel Anne McAdams (born November 17, 1978) is a Canadian actress. … … She was hailed by the media as Hollywood's new "it girl"and received a BAFTA nomination for Best Rising Star … … Wikipedia Text Wikipedia InfoBox

Not easy to search Keyword search SPARQL easy inaccurate hard for user need knowledge of terminology

Main Challenges A lot of knowledge are available in the web, but not usable! Current knowledge bases suffer from: Limit Coverage Inconsistency in terminology Hard to maintain Expensive search

SWIMs: Large Scale Knowledge Integration Semantic Web Information Management system Collaborative Project in UCLA ScAi Lab Our objective: a better knowledge base by performing the following tasks A. Integrate existing knowledge bases, B. Resolve inconsistencies, C. Provide user friendly interface for knowledge browsing and editing, D. Support query-by-example search over our KB

Integrate Existing Knowledge Bases Collecting public KBs, unifying knowledge representation format, and integrating KBs into the IKBstore represented in RDF format <Subject, Attribute, Value> IKBStore NELL

Attribute Synonym from CS3 : birthdate <==> born Knowledge Alignment Align Subject: (i) DBpedia interlinks; (ii) links in Wikipedia (e.g. redirect and sameAs); (iii) synonyms from WordNet and OntoMiner. Align Attribute: employing CS3 (Context Aware Synonym Suggestion System) to discover attribute synonyms DBpedia: Rachel, birthdate, 1978-11-17 Wikipedia: Rachel, born, 1978-11-17 Combine Attribute Synonym from CS3 : birthdate <==> born Rachel, birthdate (born), 1978-11-17 Align Category: (i) name matching (ii) category similarity based on the number of shared subjects

Initial Integrated Knowledge Base KB Name No. of Subjects (106) No. of InfoBoxes(106) DBPedia 2.5 63.82 Yago2 4.14 23.21 MusicBrainz 0.69 1.71 NELL 0.35 0.4 GeoNames 5.31 29.2 WikiData 1.44 2.54 IKBStore 9.18 105.4 *To improve the performance of online browsing (IBKB), we skip some rarely used facts in domain specific KBs (e.g. MusicBrainz).

Further Integration: Learn From Text To convert textual documents to knowledge, we employ our newly proposed text mining system IBMiner which can generate structured summaries from free text. Attribute Mapping NLP Rachel is a Canadian actress. Rachel, is, actress Rachel, is, Canadian Rachel, occupation, actress Rachel, nationality, Canadian Free Text Semantic Links Infobox Triples Integrated to IKBStore

Knowledge Browsing and Revising IBminer and other tools are automatic and scalable—even when NLP is required. But human intervention is still required to validate and/or improve the results obtained in terms of Correctness, Significance, and Relevance. Tools for knowledge browsing and revising (VLDB’13): InfoBox Knowledge-Base Browser (IBKB) InfoBox Editor (IBE).

InfoBox Knowledge-Base Browser Feedback Ranking Search Provenance Synonyms

InfoBox Editor Similar UI with IBKB IBE allows users to add more textual information and extract InfoBoxes from input text by using IBMiner. IBE also suggests candidate category and attribute names for generated InfoBoxes, which will make the knowledge editing much easier. With the help of IBE, the generated summaries will follow a standard terminology.

Provenance of Knowledge We annotate each piece of knowledge with provenance IDs and propagate the annotations during semantic integration. Triple Prov Rachel, born, 1978 p1 p2 Rachel, gender, female p3 remove duplicates Triple’ Prov Rachel, born, 1978 p1 + p2 Rachel, gender, female p3 𝜋 𝑇𝑟𝑖𝑝𝑙𝑒 p1,p2,p3: provenance id p1 + p2: provenance polynomial, encodes how the result is generated (We use + to represent projection, · to represent join)

Provenance of Knowledge We can use provenance polynomial to compute any type of provenance by replacing provenance id with different annotations replacing +, · with different operators Provenance p1 p2 + Lineage {DBpedia} {Yago2} U (Union) Reliability 0.8 0.6 max Thus, for triple <Rachel, born, 1978> with provenance polynomial (p1 + p2), we can compute its provenance as follows: Lineage: {DBpedia} U {Yago2} = {DBpedia, Yago2} Reliability: max(0.8, 0.6) = 0.8

Semantic Search A law school with more than 120 faculty members and established before 1900?

Cities in CA with > 10000 population? Semantic Search The power of the knowledge base via SPARQL engines is only available to those who can write SPARQL queries. Solution: Query-By-Example Exploits the InfoBoxes as input query from the very InfoBox of a representative page. Cities in CA with > 10000 population? Los Angeles State: CA Population: 3,904,657 Time Zone: PST Los Angeles State: CA Population: > 10000 Time Zone: PST Anaheim Bakersfield Berkeley San Diego San Francisco San Jose … …

Multilingual Semantic Search WikiData: a free collaborative knowledge base to link multilingual wikipages and unify their InfoBoxes. Unfortunately, it is very difficult for users to query these rich multilingual databases since this will require the knowledge of SPARQL and internal WikiData name for attributes. Solution: Combine SWiPE with WikiData Cities in Sardinia with > 10000 population? Rome Region: Lazio Population: 2,645,907 TimeZone: CET Rome Region: Sardinia Population: > 10000 Time Zone: CET Roma Regione: Sardegna popolazione : > 10000 Fuso Orario: CET WikiData

Domain-Specific KB Management Help expert users in advanced applications focused on more specific domains. For instance, consider a medical center where information is usually available in many different formats: plain text, forms, images, tables, structured information. Challenge: complexity and heterogeneity of data What we can do: IBMiner: extract structured information from free text OnMiner: identify important terms in free text IKBStore: enrich medical knowledge base SWiPE: support precise structured search over medical data

Conclusion We propose SWIMs, an integrated set of systems and tools, to merge existing knowledge bases into a more complete and consistent knowledge base. Ongoing work: IBMiner for Large Text Corpora By-Example Structured Query (BEStQ) Multilingual Extension based on WikiData

More Details about IBMiner and Text Mining Techniques Harvesting Wikipedia and Large Text Corpora