NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Slides:



Advertisements
Similar presentations
Symmetrically Exploiting XML Shuohao Zhang and Curtis Dyreson School of E.E. and Computer Science Washington State University Pullman, Washington, USA.
Advertisements

Natural Language Interfaces to Ontologies Danica Damljanović
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Querying for Information Integration: How to go from an Imprecise Intent to a Precise Query? Aditya Telang Sharma Chakravarthy, Chengkai Li.
XSEarch XML Search Engine Jonathan MAMOU October 2002.
The Web of data with meaning... By Michael Griffiths.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Search Engines and Information Retrieval
Agenda from now on Done: SQL, views, transactions, conceptual modeling, E/R, relational algebra. Starting: XML To do: the database engine: –Storage –Query.
Xyleme A Dynamic Warehouse for XML Data of the Web.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 8 The Enhanced Entity- Relationship (EER) Model.
XML Views El Hazoui Ilias Supervised by: Dr. Haddouti Advanced XML data management.
A Graphical Environment to Query XML Data with XQuery
1 Extending PRIX for Similarity-based XML Query Group Members: Yan Qi, Jicheng Zhao, Dan Situ, Ning Liao.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
1 COS 425: Database and Information Management Systems XML and information exchange.
FACT: A Learning Based Web Query Processing System Hongjun Lu, Yanlei Diao Hong Kong U. of Science & Technology Songting Chen, Zengping Tian Fudan University.
TOSS: An Extension of TAX with Ontologies and Similarity Queries Edward Hung, Yu Deng, V.S. Subrahmanian Department of Computer Science University of Maryland,
1 New Ways of Querying the Web by Eliahu Brodsky and Alina Blizhovsky.
By ANDREW ZITZELBERGER A Framework for Extraction Ontology Based Information Management.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Ontology-Based Free-Form Query Processing for the Semantic Web Mark Vickers Brigham Young University MS Thesis Defense Supported by:
Semantic Web Presented by: Edward Cheng Wayne Choi Tony Deng Peter Kuc-Pittet Anita Yong.
XML(EXtensible Markup Language). XML XML stands for EXtensible Markup Language. XML is a markup language much like HTML. XML was designed to describe.
Ontology-based Access Ontology-based Access to Digital Libraries Sonia Bergamaschi University of Modena and Reggio Emilia Modena Italy Fausto Rabitti.
1 Advanced Topics XML and Databases. 2 XML u Overview u Structure of XML Data –XML Document Type Definition DTD –Namespaces –XML Schema u Query and Transformation.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
By: Shawn Li. OUTLINE XML Definition HTML vs. XML Advantage of XML Facts Utilization SAX Definition DOM Definition History Comparison between SAX and.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
LOGO XML Keyword Search Refinement 郭青松. Outline  Introduction  Query Refinement in Traditional IR  XML Keyword Query Refinement  My work.
Search Engines and Information Retrieval Chapter 1.
XML-QL A Query Language for XML Charuta Nakhe
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
1 Searching XML Documents via XML Fragments D. Camel, Y. S. Maarek, M. Mandelbrod, Y. Mass and A. Soffer Presented by Hui Fang.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
ISP 433/533 Week 11 XML Retrieval. Structured Information Traditional IR –Unit of information: terms and documents –No structure Need more granularity.
Jennifer Widom XML Data Introduction, Well-formed XML.
CS 157B: Database Management Systems II February 11 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron.
Digital libraries and web- based information systems Mohsen Kamyar.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Information Retrieval
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Working with XML. Markup Languages Text-based languages based on SGML Text-based languages based on SGML SGML = Standard Generalized Markup Language SGML.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Schema-Free XQuery Based on the work of: Yanyao Li, Cong Yu and H.V.Jagadish From the University of Michigan From the University of Michigan Presented.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Chapter 5 The Semantic Web 1. The Semantic Web  Initiated by Tim Berners-Lee, the inventor of the World Wide Web.  A common framework that allows data.
 XML derives its strength from a variety of supporting technologies.  Structure and data types: When using XML to exchange data among clients, partners,
Semantic Interoperability in GIS N. L. Sarda Suman Somavarapu.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
SEMI-STRUCTURED DATA (XML) 1. SEMI-STRUCTURED DATA ER, Relational, ODL data models are all based on schema Structure of data is rigid and known is advance.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
XML in Web Technologies
Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology
XML Data Introduction, Well-formed XML.
Early Profile Pruning on XML-aware Publish-Subscribe Systems
CS246: Information Retrieval
Query Optimization.
Chaitali Gupta, Madhusudhan Govindaraju
Introduction to XML IR XML Group.
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University of Michigan

Outline Motivation Basic ideas System design Live demonstration of NaLIX User study Conclusion Q & A

XML is Everywhere Extensible Markup Language (XML) for data exchange and storage Almost every application domain is moving towards XML based document formats –Digital libraries –Office documents –GIS –...

How to Query XML? We can of course use keywords search. Can we do better? After all, XML has structures.

Example XML Fragment One Fish Two Fish John Meyer Peter Smith 7.95 Goodnight Moon Margaret Brown 10.55

Example Query Query: Find titles and prices of books by ‘Meyer’ bookinfo Just Lost book title author price Mercy Meyer Gina Meyer $5.75 book title price Brown Hedi $13.95

More Powerful Queries Aggregation –What are the books with more than 5 authors? Value Join –Find all the books, where an author of each book is the same as an author of Pride and Prejudice. Nesting –Find all the articles published in 2000 and with keyword “XML”.

Standard Approach - XQuery XQuery –a structured query language – search for specific information within XML documents primarily based on path expression

Standard Approach - XQuery for $b in document(“bib.xml”)//book where $b/author contains ‘Meyer’ return $b/title $b/price Query: Find titles and prices of books by ‘Meyer’ book title author price book title price bookinfo

Another Document Structure The same XQuery no longer works author name Dr. Meyer author name book M. Brown Goodnight Moon title book title price One Fish Two Fish $12.50 book title price Cat in the Hat $14.95 bookinfo

Standard Approach - XQuery XQuery is powerful, but …… The user may NOT KNOW the structure of bib.xml!

Solution 1 – Study Schema Ask the user to study the structure of bib.xml and write a query in XQuery

Solution 1 – Study Schema More problems Data evolution: The document structure may change over time Heterogeneity: multiple XML documents with similar content but different structures Plain unrealistic from an usability point of view

Solution 2 – Keyword Search Keyword Search – IR approach -Discard all tags: “Mary” -Treat tags as keywords: “book article year title author Mary”

Query: What are the titles and years of the publications, of which Mary is an author? ===> “year title author Mary” Solution 2 – Keyword Search bibliography (1) bib (2) bib (11) year (3) year (12) book (4) article (7) title (5) author (6) title (8) author (10) book (13) article (16) title (14) author (15) title (17) author (18) XML Bob HTML Mary Database Codd C ++ John Joe author (9)

Solution 2 – Keyword Search Pro -Required no knowledge of document structure Con -Does not take advantage of the structure -Cannot express complex query semantic

Our Approach Taking advantage of whatever partial knowledge user may have on document schema Support wide range of queries –from regular XQuery to keyword search Minimum effort required for the user –Natural language query is desirable

Basic Ideas Map parsed natural language query into XQuery. –Proximity in natural language parse tree should correspond to proximity in matched XML tree XML is human readable and is human created! Interactive query formulation to help users pose system-understandable queries.

What are the Challenges?  Challenge 1: Automatically understand user intent of an arbitrary natural language question.  Challenge 2: How to map user intent to XML schema? - Should I use “author” or “writer”? - Is “Gone with the wind” a book or a movie? - Are books grouped by year or by author in the bibliography?

Our focus is on the second challenge  User Intent => XML Schema Match users’ limited schema knowledge with the actual document schema. –Schema-Free Xquery Meaningful query focus

Meaningful Query Focus Without knowing the document structure, the user can still specify WHICH nodes should be meaningfully related The Meaningful Query Focus (MQF) is the most specific XML structure in which the nodes are related. authortitle Mary year authortitle Mary year MQF (year, title, author)

Intuition - A node in a XML document usually represent a real-world entity - Two nodes are related to each other by their lowest common ancestor article is one of the most specific entities that contain entity title and author

Intuition The entity represented by a lowest common ancestor node may not be the most specific type of entity that contains the types of entities each of the nodes represents NOT all lowest common ancestors are meaningful !

Meaningful Lowest Common Ancestor Given the sets of nodes with tag name title, year and author bibliography (1) bib year book article title author title author book article title author title author XML Bob HTML Mary Database Codd C ++ John Joe author (4) (5)(6) (8) (7) (9) (10) (3) (14) (15) (13) (16) (17) (18) (2) (12) (11)

Meaningful Lowest Common Ancestor Structure Given the sets of nodes with tag name title, year and author, Meaningful Query Focus of these nodes are: bibliography (1) bib year book article title author title author book article title author title author XML Bob HTML Mary Database Codd C ++ John Joe author (4) (5)(6) (8) (7) (9) (10) (3) (14) (15) (13) (16) (17) (18) (2) (12) (11)

Vocabulary Problem Users may not know the exact tag names Solution: term expansion – WordNet® English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. – Ontology mapping Domain knowledge

for $b in MQF doc(“bib.xml”)//expand(writer) $c in MQF doc(“bib.xml”)//title $d in MQF doc(“bib.xml”)//year where $b = “Mary” return $c, $d Term expansion User may not know the exact tag names term expansion for $b in doc(“bib.xml”)//author $c in doc(“bib.xml”)//title $d in doc(“bib.xml”)//year where mlca($b, $c, $d) and $b = “Mary” return $c, $d

Query Translation Process English Sentence Classified Parse Tree Schema-Free XQuery Dependency Parse Tree List ( V ) title ( N ) published ( A ) by ( Prep ) Addison-Wesley ( N ) the ( Det ) book ( N ) of ( Prep ) List ( CMT ) title ( NT ) published by ( CM ) publisher( NT ) Addison-Wesley ( VT ) the ( MM ) book ( NT ) of ( CM ) for $m0 in timber-mlca($v0,document("sdblp.xml")//title, $v1,document("sdblp.xml")//book, $v2,document("sdblp.xml")//publisher) where $v2 = "Addison-Wesley" return {$v0} List the titles of all the books published by Addison-Wesley.

System Architecture

Implementation ● NaLIX itself is a Java application ● Web version is under development ● Contains off-the-shelf components ● Natural Language Parser: MiniPar ● XML Database: Timber ● Thesaurus: WordNet ● Ontology: manually constructed Under development: –User interactive domain ontology construction –Machine learning of domain ontology

Interactive Query Formulation Invalid Query: List all the books with the same author as an author of “Advanced Programming in the Unix Environment” Feedback Message Suggestions Example Usage

Experiment NaLIX vs. Keywords search Subjects: 18 UAlbany students -No training on NaLIX was offered Tasks: 12 standard XMP use cases  Dataset: subset of DBLP -All the books and SIGMOD conference articles by year 2003.

Ease of Use Number of query reformulation Time to formulate an acceptable query

Search Quality Precision Recall

Conclusions Taking advantage of inherent structure of XML, precise and powerful queries with natural language can be achieved Using NaLIX, naive user can effectively search unfamiliar XML data with ease and precision Iterative query reformulation can be an effective search strategy.

Thanks you! Huahai Yang