Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,

Similar presentations


Presentation on theme: "1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,"— Presentation transcript:

1 1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou, wwc}@cs.ucla.edu July 26, 2004

2 2 XML Basics A format for defining the syntax and semantics of structured documents An XML document is commonly modeled as an ordered labeled tree 2003 XML retrieval… XML retrieval… Content Element article yearbody 2003 XML retrieval… sec ref XML retrieval… author J. Webb

3 3 XML Queries: content and structure (CAS) Structure: XPath expressions Content: about (path, string) functions –Specify a certain context, path, to be about a specific content, string –Basis for result ranking Example –/article/body/sec[about(., XML retrieval)] article yearbody 2003 XML retrieval… sec ref XML retrieval… author J. Webb

4 4 Motivation XML Documents CAS Queries Ranked Results XML Retrieval System

5 5 Text Retrieval Text Documents Indexing Indices Retrieval Query Ranked Documents

6 6 What’s new in XML retrieval Structure !!!

7 7 XML Retrieval Challenges Indexing –Text retrieval: content information –XML Retrieval: content + structure information Ranking –Text retrieval: static document concept [1] –XML retrieval: dynamic document concept [1] [1] Querying and Ranking XML Documents (T. Schlieder and H. Meuss, 2000)

8 8 Related Works XML Query Language –XIRQL [Fuhr et al., 2001] –XXL [Theobald and Weikum, 2001] –Searching XML documents via XML fragment [Carmel et al., 2003] –Narrow Extended XPath I (NEXI) [Trotman et al., 2004] XML Search Engines –HyREX [Fuhr et al. 2001] –The XXL Search Engine [ Theobald and Weikum, 2002 ] –JuruXML [Mass et al., 2003] –XSEarch [Cohen et al., 2003] –A lot of others in INEX 02 & INEX 03

9 9 Goal: Fully utilize XML structure information to improve retrieval performance!!!

10 10 Our Approach: configurable XML retrieval XML Retrieval (Configurable Ranking) CAS Queries Ranked Elements Configurable Indexing Content Indices Structure Indices XML Documents Ctree

11 11 Roadmap Background, Challenges & Related works Our approach: configurable XML retrieval system Configurable XML Indexing XML ranking Experiments Conclusion & Future Work

12 12 Why Configurable Indexing? Utilize structure information to customize indexing (filtering operations, index types) for different elements. Remove stop word, stem … text retrieval… … text retrief… J. Webb web Remove stop word, stem

13 13 Configurable XML Indexing Filtering operation selection Index type selection Index Configurations Index Builder Scan Structure Indices Content Indices XML Documents

14 14 Building Index The tree representation of the XML document collection articles article fm year 2000 body sec XML…Database… article fm year kwd body sec 2003XML retrievalXML retrieval… 0, 1, 2 0, 1 Structure index: Ctree articles article fm year kwd bdy sec g1g1 g2g2 g3g3 g4g4 g5g5 g6g6 g7g7 0, 1 0, 0 0, 1 0, 1, 1 0 Content index example: invert

15 15 Roadmap Background, Challenges & Related works Our approach: Configurable XML retrieval system Configurable XML Indexing XML Ranking –Weighted term frequency –Inverse element frequency Experiments Conclusion & Future work

16 16 XML Ranking: why weighted term frequency? Hierarchical XML: content of an element e is also considered as part of the content of e’s ancestor elements. How to estimate an element e’s relevancy to a term t? –Example: //article[about(., ‘XML’)] article bodyfmbm kwdrefsec titlepara XML… XML …XML…XML…

17 17 XML Ranking: weighted term frequency Basic idea –Terms under different paths of an element e are of different importance. Notations –A path l = x 1 /x 2 /../x n –w(l): weight for a path l Formula e: an element t: a term l i : a path under element e and containing t m: # of different paths under element e and containing t

18 18 XML Ranking: how to assign path weight? A straightforward method –assign weights to all possible paths –Problems: too many combinations! Our approach –l = x 1 /x 2 /…/x n –w(x): user-configurable weight for a node x –properties of w(l) = f(w(x 1 ), w(x 2 ), …, w(x n )) f(w(x 1 ), …, w(x n )) is monotonically increasing wrt. any w(x i ) (1≤i ≤n ) f(w(x 1 ), …, w(x n )) = 0 if any w(x i ) = 0 (1≤i ≤n ) –Example function article bodyfmbm kwdrefsec titlepara XML… XML …XML…XML…

19 19 XML Ranking: inverse element frequency Vector space model –Term frequency  Weighted term frequency –Inverse document frequency  Inverse element frequency Inverse element frequency (IEF) q: a content and structure query N 1 : # of elements satisfying the structure condition in q N 2 : # of elements that satisfy the structure condition in q and contain term t

20 20 Roadmap Background, Challenges & Related works Our approach: Configurable XML retrieval system Configurable XML Indexing XML Ranking Experiments Conclusion & Future work

21 21 Experiments: dataset INEX (Initiative for the Evaluation of XML retrieval) –Similar to TREC for text retrieval Document collections –Scientific articles from IEEE Computer Society 1995 – 2002 –About 500M –Each article consists of 1500 XML nodes on average Query set: all the 30 CAS queries in INEX 03 Evaluation metric: (exhaustiveness, specificity)

22 22 Experimental Setup: index configuration Element content statistics –# of digit tokens, e.g., ‘1990’ –# of word tokens, e.g., ‘retrieval’ –# of mixed tokens, e.g., ‘A1004s’ –Maximal content length –Minimal content length Token selection & Index types selection elementdigit#word#mixed#token selectioncontent index type yr15502917726digitNumber atl597410608694056wordInvert

23 23 Experimental Setup: index configuration Element content statistics –# of digit tokens, e.g., ‘1990’ –# of word tokens, e.g., ‘retrieval’ –# of mixed tokens, e.g., ‘A1004s’ –Maximal content length –Minimal content length Token selection & Index types Stop word removal elementMin content lengthMax content lengthRemove stop word? p122767Yes fnm1123No

24 24 Experimental Setup: weight configuration … 1 30.2 1 52 Level 1 Level 2 fmbdybmatlabskwdst A3113123 B5105135 Weight configuration A & B (nodes indexed but not listed below are with default weight 1). Tree representation of the schema for the dataset … bm article fmbdy tigabskwd atl sec stss1 … … …

25 25 Experiments Setup: weight configuration article bdybm vita p sec p XML retrieval... Example 1: //article[about(., XML retrieval)] Example 2: //vita[about(., XML retrieval)]

26 26 Experimental Setup: weight configuration … fmbdybmatlabskwdst A3113123 B5105135 Weight configuration A & B (nodes indexed but not listed below are with default weight 1). Tree representation of the schema for the dataset … bm article fmbdy tigabskwd atl sec stss1 … … …

27 27 Experimental Results: run 1 CAS Topic 65: //article[./fm//yr > '1998' AND about(./, '"image retrieval"')] Strict quantization High precision at low recall regions Adjusting weights properly improves retrieval performance 0.3 0.55

28 28 Experimental Results: run 2 All the 30 CAS topics with weight configuration B Strict quantization 0.510 0.2 0.4 0.6 0.8 1 Recall Precision Avg. Precision 0.3309

29 29 Roadmap Background, Challenges & Related works Our approach: Configurable XML retrieval system Configurable XML Indexing XML Ranking Experiment Conclusion & Future work

30 30 Conclusion A configurable XML retrieval system that fully utilizes XML structures to improve retrieval performance –Element-specific index configurations –Configurable XML ranking Weighted term frequency Inverse element frequency Experimental results –High precision at low recall regions –Adjusting weights properly improves retrieval performance Future works –Automate index configurations –Optimize weight configurations

31 31 Acknowledgement The Initiative for the evaluation of XML retrieval (INEX) –http://qmir.dcs.qmw.ac.uk/inex/ Questions? Thank You!


Download ppt "1 Configurable Indexing and Ranking for XML Information Retrieval Shaorong Liu, Qinghua Zou and Wesley W. Chu UCLA Computer Science Department {sliu, zou,"

Similar presentations


Ads by Google