Presentation is loading. Please wait.

Presentation is loading. Please wait.

Module 7 XML and Information Retrieval (XQuery FullText, Research) 26MKT-ECHER-67FEX-44B6P.

Similar presentations


Presentation on theme: "Module 7 XML and Information Retrieval (XQuery FullText, Research) 26MKT-ECHER-67FEX-44B6P."— Presentation transcript:

1 Module 7 XML and Information Retrieval (XQuery FullText, Research) 26MKT-ECHER-67FEX-44B6P

2 2 The World is Flat (Friedman) Differences between Switzerland and Bangladesh are disappearing Differences between Switzerland and Bangladesh are disappearing From traditional drivers „talent“, „origin“, and „luck“, „talent“ will dominate From traditional drivers „talent“, „origin“, and „luck“, „talent“ will dominate Individuals (not countries) compete Individuals (not countries) compete Chocolate sauce today, vanilla tomorrow Chocolate sauce today, vanilla tomorrow No premium for strength No premium for strength No premium for memory No premium for memory Premium for collaboration, communication Premium for collaboration, communication Premium for understanding, adaptivity Premium for understanding, adaptivity Premium for creativity Premium for creativity Good for Computer Scientists (lazy + greedy) Good for Computer Scientists (lazy + greedy)

3 3 The World is Flat (XML) Remove barriers between … machines Remove barriers between … machines Machines talk to machines Machines talk to machines … communication, processing, storage … communication, processing, storage One data model One data model … design and implementation … design and implementation Declarative programming, pay as you go along Declarative programming, pay as you go along … data and meta-data, structure/unstructured … data and meta-data, structure/unstructured One data model One data model … „here“ & „there“, „today“ & „tomorrow“ … „here“ & „there“, „today“ & „tomorrow“ Decouple data from schema / interpretation Decouple data from schema / interpretation

4 4 References XQuery 1.0 and XPath 2.0 Full-Text XQuery 1.0 and XPath 2.0 Full-Text http://www.w3.org/XML/Query http://www.w3.org/XML/Query http://www.w3.org/XML/Query Latest version: November 2005 Latest version: November 2005 Still work in progress! Still work in progress! S. Amer-Yahia, J. Shanmugasundaram: XML Full-Text Search: Challenges and Opportunities S. Amer-Yahia, J. Shanmugasundaram: XML Full-Text Search: Challenges and Opportunities Tutorial for VLDB Conf., August 2005 Tutorial for VLDB Conf., August 2005

5 5Motivation A key benefit of XML is its ability to represent a mix of structured and unstructured (text) data A key benefit of XML is its ability to represent a mix of structured and unstructured (text) data Applications Applications Digital libraries Digital libraries Content management Content management Many such XML repositories already available Many such XML repositories already available IEEE INEX collection IEEE INEX collection Library of Congress documents Library of Congress documents Shakespeare’s plays Shakespeare’s plays SIGMOD, DBLP, … SIGMOD, DBLP, …

6 6 XML in Library of Congress http://thomas.loc.gov/home/gpoxmlc109/h2739_ih.xml 109th CONGRESS 1st Session 109th CONGRESS 1st Session H. R. 2739 H. R. 2739 IN THE HOUSE OF REPRESENTATIVES IN THE HOUSE OF REPRESENTATIVES May 26, 2005 May 26, 2005 Mr. Tierney (for himself, Ms. McCollum of Minnesota, Mr. George Miller of California ) introduced the following bill; which was referred to the Committee on Education and the Workforce Mr. Tierney (for himself, Ms. McCollum of Minnesota, Mr. George Miller of California ) introduced the following bill; which was referred to the Committee on Education and the Workforce …

7 7 THOMAS: Library of Congress

8 8 INEX Data K0271 10.1041/K0271s-2004 K0271 10.1041/K0271s-2004 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING 1041-4347 /04/$20.00 © 2004 IEEE Published by the IEEE Computer Society Vol. 16, No. 2 FEBRUARY 2004 1041-4347 /04/$20.00 © 2004 IEEE Published by the IEEE Computer Society Vol. 16, No. 2 FEBRUARY 2004 pp. 271-288 A Graph-Based Approach for Timing Analysis and Refinement of OPS5 Knowledge-Based Systems pp. 271-288 * pp. 271-288 A Graph-Based Approach for Timing Analysis and Refinement of OPS5 Knowledge-Based Systems pp. 271-288 * Albert Mo Kim Cheng Senior Member IEEE Hsiu-yen Tsai Albert Mo Kim Cheng Senior Member IEEE Hsiu-yen Tsai Abstract —This paper examines the problem of predicting the timing behavior of knowledge-based systems for real-… Abstract —This paper examines the problem of predicting the timing behavior of knowledge-based systems for real-…

9 9 Current Query Languages Current XML query languages are mostly “database” languages Current XML query languages are mostly “database” languages Examples: XQuery, XPath Examples: XQuery, XPath Provide very rudimentary text/IR support Provide very rudimentary text/IR support fn:contains(e, keywords) fn:contains(e, keywords) Returns true iff element e contains keywords Returns true iff element e contains keywords No support for complex IR queries No support for complex IR queries Distance predicates, stemming, scoring, … Distance predicates, stemming, scoring, …

10 10 Example Queries XQuery Full-Text Use Cases Document XQuery Full-Text Use Cases Document Find the titles of the books whose body contains the phrases “Usability” and “Web site” in that order, in the same paragraph, using stemming if necessary to match the tokens Find the titles of the books whose body contains the phrases “Usability” and “Web site” in that order, in the same paragraph, using stemming if necessary to match the tokens Find the titles of the books whose body contains “Usability” and “testing” within a window of 3 words, and return them in score order Find the titles of the books whose body contains “Usability” and “testing” within a window of 3 words, and return them in score order

11 11 Example INEX Query //article[about(.//abs, "data mining")]//sec[about(., "frequent itemsets")] //article[about(.//abs, "data mining")]//sec[about(., "frequent itemsets")] sections about frequent itemsets from articles with abstract about data mining sections about frequent itemsets from articles with abstract about data mining To be relevant, a component has to be a section about "frequent itemsets". For example, it could be about algorithms for finding frequent itemsets, or uses of frequent itemsets to generate rules. Also, the article must have an abstract about "data mining". I need this information for a paper that I am writing. It is a survey of different algorithms for finding frequent itemsets. The paper will also have a section on why we would want to find frequent itemsets. To be relevant, a component has to be a section about "frequent itemsets". For example, it could be about algorithms for finding frequent itemsets, or uses of frequent itemsets to generate rules. Also, the article must have an abstract about "data mining". I need this information for a paper that I am writing. It is a survey of different algorithms for finding frequent itemsets. The paper will also have a section on why we would want to find frequent itemsets.

12 12 Grand Challenge Queries [Kossmann 98] Welcher ETH Professor spielt Landhockey und hat einen effizienten Algorithmus zur Berechnung der Pareto-Kurve entwickelt? Welcher ETH Professor spielt Landhockey und hat einen effizienten Algorithmus zur Berechnung der Pareto-Kurve entwickelt? Wer hat meinen IDP Komplexitätsbeweis kopiert? Wer hat meinen IDP Komplexitätsbeweis kopiert? Wer ist auf diesem Foto? Wer ist auf diesem Foto? In welchem Theaterstück treibt die ehrgeizige Ehe- frau ihren Mann zum Mord? In welchem Theaterstück treibt die ehrgeizige Ehe- frau ihren Mann zum Mord?

13 13 Why not use SQL/MM? Key difference: No strict demarcation between structured and text data in XML Key difference: No strict demarcation between structured and text data in XML Can issue structured and text queries over same data Can issue structured and text queries over same data Find books with year > 1995 Find books with year > 1995 Find books containing keyword “1998” Find books containing keyword “1998” Can embed structured queries in text queries Can embed structured queries in text queries Find books that contain the keywords that occur in the title of Richard Dawkins’ books Find books that contain the keywords that occur in the title of Richard Dawkins’ books Other important differences Other important differences XML/XQuery data model XML/XQuery data model Composability of full-text primitives Composability of full-text primitives

14 14 Challenges in XML FT Search Searching over Semi-Structured Data Searching over Semi-Structured Data Users may specify a search context and return context. Users may specify a search context and return context. Expressive Power and Extensibility Expressive Power and Extensibility Users should be able to express complex full-text searches and combine them with structural searches. Users should be able to express complex full-text searches and combine them with structural searches. Scores and Ranking Scores and Ranking Users may specify a scoring condition, possibly over both full-text and structured predicates and obtain top-k results based on query relevance scores. Users may specify a scoring condition, possibly over both full-text and structured predicates and obtain top-k results based on query relevance scores. The language should allow for an efficient implementation. The language should allow for an efficient implementation.

15 15 XML FT Search Definition Context expression : XML elements searched: Context expression : XML elements searched: pre-defined XML nodes. pre-defined XML nodes. XPath/XQuery queries. XPath/XQuery queries. Return expression : XML fragments returned: Return expression : XML fragments returned: pre-defined meaningful XML fragments. pre-defined meaningful XML fragments. XPath/XQuery to build answers. XPath/XQuery to build answers. Search expression : FT search conditions: Search expression : FT search conditions: Boolean keyword search. Boolean keyword search. proximity distance, scoping, thesaurus, stop words, stemming. proximity distance, scoping, thesaurus, stop words, stemming. Score expression : Score expression : system-defined scoring function. system-defined scoring function. user-defined scoring function. user-defined scoring function. query-dependent keyword weights. query-dependent keyword weights.

16 16 Granularity of Results Keyword queries Keyword queries compute possibly different scores for LCAs. compute possibly different scores for LCAs. Tag + Keyword queries Tag + Keyword queries compute scores based on tags and keywords. compute scores based on tags and keywords. Path Expression + Keyword queries Path Expression + Keyword queries compute scores based on paths and keywords. compute scores based on paths and keywords. XQuery + Complex full-text queries XQuery + Complex full-text queries compute scores for (newly constructed) XML fragments satisfying XQuery (structural, full-text and scalar conditions). compute scores for (newly constructed) XML fragments satisfying XQuery (structural, full-text and scalar conditions).

17 17 Four Classes of Languages Keyword search (INEX Content-Only Queries) Keyword search (INEX Content-Only Queries) “book xml” “book xml” Tag + Keyword search Tag + Keyword search book: xml book: xml Path Expression + Keyword search Path Expression + Keyword search /book[./title about “xml db ”] /book[./title about “xml db ”] XQuery + Complex full-text search XQuery + Complex full-text search for $b in /book let score $s := $b ftcontains “xml” && “db” distance 5 for $b in /book let score $s := $b ftcontains “xml” && “db” distance 5

18 18 XPath [W3C 2005] Special function in XQuery for keyword search. (First proposed: [Florescu&Kossmann 2000] Special function in XQuery for keyword search. (First proposed: [Florescu&Kossmann 2000] fn:contains($e, string) returns true iff $e contains string fn:contains($e, string) returns true iff $e contains string What happens if string is generated by an expression that returns a sequence of strings? What happens if string is generated by an expression that returns a sequence of strings? Does not allow specification of stemming, stop words, scoring, etc. Does not allow specification of stemming, stop words, scoring, etc. //section[fn:contains(./title, “XML Indexing”)]

19 19 XQuery Full-Text Full-text search extension to XQuery Full-text search extension to XQuery W3C Working Draft W3C Working Draft Tightly integrated with the XQuery data model Tightly integrated with the XQuery data model Provides well defined model for reasoning about full-text operations and integration with XQuery Provides well defined model for reasoning about full-text operations and integration with XQuery Composability Composability Fully composable full-text primitives, including Boolean connectives, distance predicates, stemming Fully composable full-text primitives, including Boolean connectives, distance predicates, stemming Can embed XQuery Full-Text primitives in XQuery and vice versa Can embed XQuery Full-Text primitives in XQuery and vice versa Flexible scoring construct Flexible scoring construct AllMatches Data Model: Tokenization! AllMatches Data Model: Tokenization!

20 20 XQuery Full-Text Evolution Quark Full-Text Language (Cornell) 2002 2003 2004 2005 TeXQuery (Cornell, AT&T) IBM, Microsoft, Oracle proposals XQuery Full-Text (Second Draft)

21 21 Syntax Overview Two new XQuery constructs: Two new XQuery constructs: 1) FTContainsExpr Expresses “Boolean” full-text search predicates Expresses “Boolean” full-text search predicates Seamlessly composes with other XQuery expressions Seamlessly composes with other XQuery expressions 2) FTScoreClause Extension to FLWOR expression Extension to FLWOR expression Can score FTContainsExpr and other expressions Can score FTContainsExpr and other expressions

22 22 FTContainsExpr Like other XQuery expressions Like other XQuery expressions Takes in a sequence of items (nodes) as input Takes in a sequence of items (nodes) as input Produces a sequence of items (nodes) as output Produces a sequence of items (nodes) as output Seamlessly compose with other XQuery exprs. Seamlessly compose with other XQuery exprs. Do not confuse with fn:contains function! Do not confuse with fn:contains function! XQuery Expression Evaluate to a Sequence of items

23 23FTContainsExpr ContextExpr ftcontains FTSelection ContextExpr (any XQuery expression) is context spec ContextExpr (any XQuery expression) is context spec FTSelection is search spec FTSelection is search spec Returns true iff at least one node in ContextExpr satisfies the FTSelection Returns true iff at least one node in ContextExpr satisfies the FTSelection Examples Examples //book ftcontains ‘Usability’ && ‘testing’ distance 5 //book ftcontains ‘Usability’ && ‘testing’ distance 5 //book[./content ftcontains ‘Usability’ with stems]/title //book[./content ftcontains ‘Usability’ with stems]/title //book ftcontains /article[author=‘Dawkins’]/title //book ftcontains /article[author=‘Dawkins’]/title

24 24 FTSelection Encapsulates all full-text conditions in FTContainsExpr Encapsulates all full-text conditions in FTContainsExpr Works in a new data model called AllMatch Works in a new data model called AllMatch Operates on positions within XML nodes (more fine grained than XQuery data model) Operates on positions within XML nodes (more fine grained than XQuery data model) Fully composable; similar to composition of relational (and XML) operators! Fully composable; similar to composition of relational (and XML) operators! FTSelection Evaluate to AllMatch

25 25 FTSelection Examples ‘Usability’ ‘Usability’ /book[author=‘Dawkins’]/title /book[author=‘Dawkins’]/title ‘Usability’ && /book[author=‘Dawkins’]/title ‘Usability’ && /book[author=‘Dawkins’]/title (‘Usability’ && /book[author=‘Dawkins’]/title) same sentence (‘Usability’ && /book[author=‘Dawkins’]/title) same sentence (‘Usability’ && /book[author=‘Dawkins’]/title) same sentence window 5 (‘Usability’ && /book[author=‘Dawkins’]/title) same sentence window 5 All of these evaluate to an AllMatch! All of these evaluate to an AllMatch! Allows arbitrary composition of full-text primitives Allows arbitrary composition of full-text primitives

26 26 AllMatches Data Model Tokenization: Extend XQuery data model to represent each invidual word (not just string or text). Each word is represented as a token Tokenization: Extend XQuery data model to represent each invidual word (not just string or text). Each word is represented as a token AllMatches: Represent the results of FTSelections AllMatches: Represent the results of FTSelections The following TokenInfo is kept for each match: The following TokenInfo is kept for each match: Word: string (the matching word itsself) Word: string (the matching word itsself) Pos: integer (position of word within document) Pos: integer (position of word within document) Para: integer (position of paragraph) Para: integer (position of paragraph) Sentence: integer(position of sentence) Sentence: integer(position of sentence)

27 27 FTContextModifier Can be applied on any FTSelection to specify aspects such as stemming, thesauri, case, etc. Can be applied on any FTSelection to specify aspects such as stemming, thesauri, case, etc. Fully composable with other context modifiers and FTSelections Fully composable with other context modifiers and FTSelections Examples Examples ‘Usability’ && ‘testing’ with stems ‘Usability’ && ‘testing’ with stems ‘Usability’ && ‘testing’ without stop words ‘Usability’ && ‘testing’ without stop words ‘Usability’ && ‘testing’ case insensitive ‘Usability’ && ‘testing’ case insensitive

28 28 Porter Algorithm for Stemming Transform the word (sequence of vowels and consonants) to a stem Transform the word (sequence of vowels and consonants) to a stem Works for the English language Works for the English language Applies a set of heuristics; e.g., Applies a set of heuristics; e.g., Plural: sses -> ss; ies -> i Plural: sses -> ss; ies -> i Tenses: eed -> ee (agreed -> agree); ed ->  Tenses: eed -> ee (agreed -> agree); ed ->  Use thesauri to separate composite words Use thesauri to separate composite words Particularly useful in German: Particularly useful in German: Schwimmvogel -> Schwimm, Vogel Schwimmvogel -> Schwimm, Vogel Stop Words: Lists are available on Internet Stop Words: Lists are available on Internet

29 29 Full-Text Scoring Score value should reflect relevance of answer to user query. Score value should reflect relevance of answer to user query. Higher scores imply a higher degree of relevance. Higher scores imply a higher degree of relevance. Queries return document fragments. Queries return document fragments. Granularity of returned results affects scoring. Granularity of returned results affects scoring. For queries containing conditions on structure, For queries containing conditions on structure, structural conditions may affect scoring. structural conditions may affect scoring. Existing proposals extend common scoring methods (standard does not care!): Existing proposals extend common scoring methods (standard does not care!): probabilistic or vector-based similarity. probabilistic or vector-based similarity.

30 30 Vector Space Model Consider document as a vector of weights Consider document as a vector of weights One weight per word: tf * idf One weight per word: tf * idf Term frequency (tf) Term frequency (tf) Inverse document frequence (idf) Inverse document frequence (idf) Consider query as a vector of weights Consider query as a vector of weights One weight per word in query: tf * idf One weight per word in query: tf * idf Compute similarity of vectors of doc and query Compute similarity of vectors of doc and query Textbook: cosine similarity Textbook: cosine similarity Black art in each search engine Black art in each search engine Google: PageRank, based on random walk Google: PageRank, based on random walk Goal: Maximize Precision and Recall Goal: Maximize Precision and Recall Defined by humans! (AI-complete, no rigorous approach) Defined by humans! (AI-complete, no rigorous approach)

31 31FTScoreClause Two alternatives Two alternatives Both extensions to FLWOR clause Both extensions to FLWOR clause Alternative 1 Alternative 1 Score “Boolean” XQuery expressions, including FTContainsExpr Score “Boolean” XQuery expressions, including FTContainsExpr Current working draft syntax Current working draft syntax Alternative 2 Alternative 2 Score arbitrary XQuery expressions Score arbitrary XQuery expressions Under discussion Under discussion Exact scoring is implementation-dependent!!! Exact scoring is implementation-dependent!!! Standard imposes competition between vendors Standard imposes competition between vendors

32 32 Alternative 1 FOR … LET … SCORE $var AS Expr(Expr returns Boolean) WHERE … ORDER BY … RETURNExample FOR $b in /pubs/book SCORE $s AS $b ftcontains ‘software’ weight 0.8 && ‘testing’ weight 0.2 ORDER BY $s RETURN $b $b ftcontains ‘software’ weight 0.8 && ‘testing’ weight 0.2 ORDER BY $s RETURN $b In any order

33 33 Alternative 1 FOR … LET … SCORE $var AS Expr(Expr returns Boolean) WHERE … ORDER BY … RETURNExample FOR $b in /pubs/book SCORE $s AS $b/price $b $b/price $b In any order

34 34 Alternative 1: Analysis Not powerful enough for some XML IR queries Not powerful enough for some XML IR queries Case study: XML INEX initiative Case study: XML INEX initiative Want to “relax” /pubs/book (in addition to full- text predicates) Want to “relax” /pubs/book (in addition to full- text predicates) Boolean scoring expressions insufficient Boolean scoring expressions insufficient /pubs/book[. ftcontains ‘Usability’ && ‘testing’]

35 35 Alternative 2 FOR $v [SCORE $s]? [AT $i]? IN [FUZZY] Expr LET … WHERE … ORDER BY … RETURNExample FOR $b SCORE $s in /pub/book[. ftcontains “Usability” && “testing”] ORDER BY $s RETURN $b ORDER BY $s RETURN $b In any order

36 36 Alternative 2 FOR $v [SCORE $s]? [AT $i]? IN [FUZZY] Expr LET … WHERE … ORDER BY … RETURNExample FOR $b SCORE $s in FUZZY /pub/book[. ftcontains “Usability” && “testing”] ORDER BY $s RETURN $b ORDER BY $s RETURN $b In any order

37 37 Research Challenges

38 38 Challenge 1: System Architecture XQuery EngineIR Engine Integration Layer

39 39 Challenge 1: System Architecture XQuery + IR Engine

40 40 Challenge 2: Structural Relaxation FOR $b SCORE $s in FUZZY /pub/book[. ftcontains “Usability” with stems] /pub/book[. ftcontains “Usability” with stems] ORDER BY $s RETURN $b RETURN $b

41 41 Adaptation of tf.idf to XML Whirlpool [ Marian et al ICDE 2005 ] Document Collection (Information Retrieval) XML Document Document XML Node (result is a subtree rooted at a returned node with a given tag and satisfying structural predicates in the query) Keyword(s) Tree Pattern idf (inverse document frequency) is a function of the fraction of documents that contain the keyword(s) idf is a function of the fraction of returned nodes that match the query tree pattern tf (term frequency) is a function of the number of occurrences of the keyword in the document tf is a function of the number of ways the query tree pattern matches the returned node

42 42 Challenge 3: Search Over Views LET $bookrevs := FOR $book IN //book RETURN { $book } { FOR $rev IN //review WHERE $rev/bookid = $book/id RETURN $rev } FOR $bookrev IN $bookrevs SCORE $score AS $bookrev ftcontains ‘Usability’ with stems ORDER BY $score RETURN $bookrev

43 43 Challenge 4: LCA Challenge 4: LCA Given: Query keywords Given: Query keywords Compute: Least Common Ancestors (LCAs) that contain all query keywords, in ranked order Compute: Least Common Ancestors (LCAs) that contain all query keywords, in ranked order

44 44 Na ï ve Method Naïve inverted lists: Ricardo 1 ; 5 ; 6 ; 8 XQL 1 ; 5 ; 6 ; 7 Problems: 1. Space Overhead 2. Spurious Results Main issue: Decouples representation of ancestors and descendants date 28 July …XML and …David Carmel … … …… XQL and … Ricardo … 1 2345 6 78

45 45 Dewey Encoding of IDs [1850s] 0.0date0.1 0 0.2 0.3 28 July …XML and …David Carmel … 0.3.0 0.3.1 … 0.3.0.0 0.3.0.1 …… XQL and …Ricardo …

46 46 Other Open Issues Experimental evaluation of scoring functions and ranking algorithms for XML ( INEX). Experimental evaluation of scoring functions and ranking algorithms for XML ( INEX). Search over a mix of HTML and XML. Search over a mix of HTML and XML. Joint scoring on full-text and scalar predicates. Joint scoring on full-text and scalar predicates. Score-aware algebra for XML for the joint optimization of queries on both structure and text. Score-aware algebra for XML for the joint optimization of queries on both structure and text.

47 47 Conclusion Unified querying of structured data and text is one of the most promising benefits of XML Unified querying of structured data and text is one of the most promising benefits of XML XQuery Full-Text is a language designed to enable this goal XQuery Full-Text is a language designed to enable this goal Many research challenges Many research challenges System implementation System implementation Scoring Scoring Requirements of a new class of applications Requirements of a new class of applications Starting to see research prototypes Starting to see research prototypes Quark (Open-source software, Cornell) Quark (Open-source software, Cornell) GalaTeX (Reference implementation, AT&T) GalaTeX (Reference implementation, AT&T)


Download ppt "Module 7 XML and Information Retrieval (XQuery FullText, Research) 26MKT-ECHER-67FEX-44B6P."

Similar presentations


Ads by Google