Presentation is loading. Please wait.

Presentation is loading. Please wait.

NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University.

Similar presentations


Presentation on theme: "NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University."— Presentation transcript:

1 NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University of Michigan

2 Outline Motivation Basic ideas System design Live demonstration of NaLIX User study Conclusion Q & A

3 XML is Everywhere Extensible Markup Language (XML) for data exchange and storage Almost every application domain is moving towards XML based document formats –Digital libraries –Office documents –GIS –...

4 How to Query XML? We can of course use keywords search. Can we do better? After all, XML has structures.

5 Example XML Fragment One Fish Two Fish John Meyer Peter Smith 7.95 Goodnight Moon Margaret Brown 10.55

6 Example Query Query: Find titles and prices of books by ‘Meyer’ bookinfo Just Lost book title author price Mercy Meyer Gina Meyer $5.75 book title price Brown Hedi $13.95

7 More Powerful Queries Aggregation –What are the books with more than 5 authors? Value Join –Find all the books, where an author of each book is the same as an author of Pride and Prejudice. Nesting –Find all the articles published in 2000 and with keyword “XML”.

8 Standard Approach - XQuery XQuery –a structured query language – search for specific information within XML documents primarily based on path expression

9 Standard Approach - XQuery for $b in document(“bib.xml”)//book where $b/author contains ‘Meyer’ return $b/title $b/price Query: Find titles and prices of books by ‘Meyer’ book title author price book title price bookinfo

10 Another Document Structure The same XQuery no longer works author name Dr. Meyer author name book M. Brown Goodnight Moon title book title price One Fish Two Fish $12.50 book title price Cat in the Hat $14.95 bookinfo

11 Standard Approach - XQuery XQuery is powerful, but …… The user may NOT KNOW the structure of bib.xml!

12 Solution 1 – Study Schema Ask the user to study the structure of bib.xml and write a query in XQuery

13 Solution 1 – Study Schema More problems Data evolution: The document structure may change over time Heterogeneity: multiple XML documents with similar content but different structures Plain unrealistic from an usability point of view

14 Solution 2 – Keyword Search Keyword Search – IR approach -Discard all tags: “Mary” -Treat tags as keywords: “book article year title author Mary”

15 Query: What are the titles and years of the publications, of which Mary is an author? ===> “year title author Mary” Solution 2 – Keyword Search bibliography (1) bib (2) bib (11) year (3) year (12) book (4) article (7) title (5) author (6) title (8) author (10) book (13) article (16) title (14) author (15) title (17) author (18) 1999 2000 XML Bob HTML Mary Database Codd C ++ John Joe author (9)

16 Solution 2 – Keyword Search Pro -Required no knowledge of document structure Con -Does not take advantage of the structure -Cannot express complex query semantic

17 Our Approach Taking advantage of whatever partial knowledge user may have on document schema Support wide range of queries –from regular XQuery to keyword search Minimum effort required for the user –Natural language query is desirable

18 Basic Ideas Map parsed natural language query into XQuery. –Proximity in natural language parse tree should correspond to proximity in matched XML tree XML is human readable and is human created! Interactive query formulation to help users pose system-understandable queries.

19 What are the Challenges?  Challenge 1: Automatically understand user intent of an arbitrary natural language question.  Challenge 2: How to map user intent to XML schema? - Should I use “author” or “writer”? - Is “Gone with the wind” a book or a movie? - Are books grouped by year or by author in the bibliography?

20 Our focus is on the second challenge  User Intent => XML Schema Match users’ limited schema knowledge with the actual document schema. –Schema-Free Xquery Meaningful query focus

21 Meaningful Query Focus Without knowing the document structure, the user can still specify WHICH nodes should be meaningfully related The Meaningful Query Focus (MQF) is the most specific XML structure in which the nodes are related. authortitle Mary year authortitle Mary year MQF (year, title, author)

22 Intuition - A node in a XML document usually represent a real-world entity - Two nodes are related to each other by their lowest common ancestor article is one of the most specific entities that contain entity title and author

23 Intuition The entity represented by a lowest common ancestor node may not be the most specific type of entity that contains the types of entities each of the nodes represents NOT all lowest common ancestors are meaningful !

24 Meaningful Lowest Common Ancestor Given the sets of nodes with tag name title, year and author bibliography (1) bib year book article title author title author book article title author title author 1999 2000 XML Bob HTML Mary Database Codd C ++ John Joe author (4) (5)(6) (8) (7) (9) (10) (3) (14) (15) (13) (16) (17) (18) (2) (12) (11)

25 Meaningful Lowest Common Ancestor Structure Given the sets of nodes with tag name title, year and author, Meaningful Query Focus of these nodes are: bibliography (1) bib year book article title author title author book article title author title author 1999 2000 XML Bob HTML Mary Database Codd C ++ John Joe author (4) (5)(6) (8) (7) (9) (10) (3) (14) (15) (13) (16) (17) (18) (2) (12) (11)

26 Vocabulary Problem Users may not know the exact tag names Solution: term expansion – WordNet® English nouns, verbs, adjectives and adverbs are organized into synonym sets, each representing one underlying lexical concept. – Ontology mapping Domain knowledge

27 for $b in MQF doc(“bib.xml”)//expand(writer) $c in MQF doc(“bib.xml”)//title $d in MQF doc(“bib.xml”)//year where $b = “Mary” return $c, $d Term expansion User may not know the exact tag names term expansion for $b in doc(“bib.xml”)//author $c in doc(“bib.xml”)//title $d in doc(“bib.xml”)//year where mlca($b, $c, $d) and $b = “Mary” return $c, $d

28 Query Translation Process English Sentence Classified Parse Tree Schema-Free XQuery Dependency Parse Tree List ( V ) title ( N ) published ( A ) by ( Prep ) Addison-Wesley ( N ) the ( Det ) book ( N ) of ( Prep ) List ( CMT ) title ( NT ) published by ( CM ) publisher( NT ) Addison-Wesley ( VT ) the ( MM ) book ( NT ) of ( CM ) for $m0 in timber-mlca($v0,document("sdblp.xml")//title, $v1,document("sdblp.xml")//book, $v2,document("sdblp.xml")//publisher) where $v2 = "Addison-Wesley" return {$v0} List the titles of all the books published by Addison-Wesley.

29 System Architecture

30 Implementation ● NaLIX itself is a Java application ● Web version is under development ● Contains off-the-shelf components ● Natural Language Parser: MiniPar ● XML Database: Timber ● Thesaurus: WordNet ● Ontology: manually constructed Under development: –User interactive domain ontology construction –Machine learning of domain ontology

31 Interactive Query Formulation Invalid Query: List all the books with the same author as an author of “Advanced Programming in the Unix Environment” Feedback Message Suggestions Example Usage

32 Experiment NaLIX vs. Keywords search Subjects: 18 UAlbany students -No training on NaLIX was offered Tasks: 12 standard XMP use cases  Dataset: subset of DBLP -All the books and SIGMOD conference articles by year 2003.

33 Ease of Use Number of query reformulation Time to formulate an acceptable query

34 Search Quality Precision Recall

35 Conclusions Taking advantage of inherent structure of XML, precise and powerful queries with natural language can be achieved Using NaLIX, naive user can effectively search unfamiliar XML data with ease and precision Iterative query reformulation can be an effective search strategy.

36 Thanks you! Huahai Yang hyang@albany.edu


Download ppt "NaLIX Natural Language Interface for querying XML Huahai Yang Department of Information Studies Joint work with Yunyao Li and H.V. Jagadish at University."

Similar presentations


Ads by Google