Download presentation
Presentation is loading. Please wait.
Published byMarjorie Owens Modified over 9 years ago
1
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of Computing, NUS Ji-Rong Wen Microsoft Research Asia Tat-Seng Chua School of Computing, NUS
2
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 2 Presentation for ECIR’03, Pisa, Italy Outline Motivations and problems Hierarchical index propagation and pruning Flexible element retrieval Evaluation Conclusions
3
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 3 Presentation for ECIR’03, Pisa, Italy Outline Motivations and problems Hierarchical index propagation and pruning Flexible element retrieval Evaluation Conclusions
4
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 4 Presentation for ECIR’03, Pisa, Italy Motivations More structured and semi-structured documents on the Web. Users want to explore more of the document structure. –Access only relevant parts of a document, i.e. sections or paragraphs IR can’t help –Document as the smallest resulting unit. Not Question Answering! –Can’t provide views of the internal document structure.
5
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 5 Presentation for ECIR’03, Pisa, Italy Encarta Articles – An Example Online encyclopedia. Well structured XML documents. Nodes (elements) – documents, sections and paragraphs (leaf nodes) Text contained in paragraphs, which constitute sections and documents.
6
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 6 Presentation for ECIR’03, Pisa, Italy Problems A document covers multiple aspects of a central topic –Represented by sections or paragraphs. –Users usually want just one of the aspects. How to achieve this goal by utilizing the document structure? –Flexible element retrieval to get elements at arbitrary level rather than only leaf nodes. –Let each element at different levels have proper keywords description.
7
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 7 Presentation for ECIR’03, Pisa, Italy Our contributions Building index with the same hierarchical structure as the document has. –Not just index the leaf nodes. Keywords propagation mechanism. –Assign proper keywords to each level’s nodes (push broad- sense keywords to upper-level nodes). –Why can’t use weight propagation technique? –Considering terms’ distributions. Flexible element retrieval according to queries. –With the hierarchical index, the system can access arbitrary-level elements – documents, sections or paragraphs w.r.t queries. –Avoid assembling separate text fragments with leaf nodes retrieval only.
8
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 8 Presentation for ECIR’03, Pisa, Italy Outline Motivations and problems Hierarchical index propagation and pruning Flexible element retrieval Evaluation Conclusions
9
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 9 Presentation for ECIR’03, Pisa, Italy Hierarchical Indexing for Structured Documents Term weighting for the leaf nodes and the intermediate elements. –Combining the statistics of the term occurrences and the distributions. –Term selection threshold. Propagation and pruning of the index terms
10
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 10 Presentation for ECIR’03, Pisa, Italy Term Weighting for Paragraphs Paragraphs are “atomic” without children elements. Consider the term occurrences only – TFIDF measure.
11
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 11 Presentation for ECIR’03, Pisa, Italy Term Weighting for Intermediate elements Document-level or section-level elements. Taking into account the term distributions in the immediately descendant elements.
12
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 12 Presentation for ECIR’03, Pisa, Italy Measuring Term Distributions Entropy-like measurement –How even a term is distributed in all the immediate- descendant elements of an intermediate element. –Normalization factor – the theoretic maximum entropy.
13
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 13 Presentation for ECIR’03, Pisa, Italy Term Selection Term weights are normalized to the range of 0 and 1 for the purpose of comparison. Compare the terms within one element. –Select those terms with the weights beyond a threshold as the index terms for this element. Repeat this process from bottom up. –Broader-sense terms can be propagated to upper level elements. –Term pruning to avoid duplications of index terms.
14
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 14 Presentation for ECIR’03, Pisa, Italy Terms Propagation and Pruning Algorithm 1.For each leaf element, i.e. paragraph, calculate all terms’ weights for paragraphs. 2.For each composite element Ej at the next upper level, calculate the terms’ weights by measuring these terms’ occurrences in this element and the distributions in the immediate-descendant elements of Ej. 3.For term ti, if Weight(ti, Ej)>= average(Ej)+std_dev(Ej), then this term is selected as an index term of the element Ej and all the descendent elements of Ej would eliminate ti from their index term lists. This process is called the index term propagation and pruning. 4.Recursively perform step 2 onwards until the root node (i.e., the document) is reached.
15
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 15 Presentation for ECIR’03, Pisa, Italy An illustration of the process
16
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 16 Presentation for ECIR’03, Pisa, Italy Outline Motivations and problems Hierarchical index propagation and pruning Flexible element retrieval Evaluation Conclusions
17
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 17 Presentation for ECIR’03, Pisa, Italy Flexible Element Retrieval No term duplications along one path. The path of an element – including all the elements from this node to the root. Ranking relevant elements is equal to rank their paths.
18
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 18 Presentation for ECIR’03, Pisa, Italy Path Ranking Algorithm 1.Find all elements that contain at least one query term. 2.Get paths for all candidate elements and merge the paths, that is, merge two paths into one if one is a part of the other. 3.Assign the weights of the query terms for the elements to their paths respectively. 4.Rank these paths using the equation on the previous slide. 5.Return the elements corresponding to the ranked paths with the ranks satisfying the pre-defined threshold in a descending order.
19
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 19 Presentation for ECIR’03, Pisa, Italy Result Browsing The prototype interface can –Highlight the relevant parts of the selected document. –Allow the user to browse results in the original document structure. Query example – “the Manchu Qing Dynasty” –A section in “China” –The whole document for “Qing Dynasty”
20
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 20 Presentation for ECIR’03, Pisa, Italy Prototype Interface
21
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 21 Presentation for ECIR’03, Pisa, Italy Outline Motivations and problems Hierarchical index propagation and pruning Flexible element retrieval Evaluation Conclusions
22
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 22 Presentation for ECIR’03, Pisa, Italy Evaluation Data Set –41,942 XML documents in various topics from Encarta online encyclopedia. –Ten experimental queries Can be answered by only parts of the relevant document, e.g. “Fleet Street in London” answered by a paragraph of the document London. Relevance judgment made by human assessors – for each query, there is a group of paragraphs representing relevant sections or such paragraphs. –Baseline system (TFIDF Para) Indexing paragraph nodes only. Applying TFIDF measure to weight terms and using cosine similarity to retrieve answers.
23
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 23 Presentation for ECIR’03, Pisa, Italy Performance Evaluation Use precision, recall and F-value as performance metrics. Two sets of hierarchical index –Utilizing titles and without considering titles. Answer selection threshold –Fixed numbers 0.1 – 0.9, used by most of existing systems. –Dynamic thresholds – Avg and Avg + Std_Dev Compared our system with TFIDF Para using different answer selection thresholds.
24
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 24 Presentation for ECIR’03, Pisa, Italy Results of Performance Comparison Figures are impressive –Improvements on precision are 48.83% (w/ titles) and 41.67% (w/o titles) in average. –For F-Values, improvements are 56.02% (w/ titles) and 40.89% (w/o titles). –Recall slightly decreases with some threshold settings (too rigorous threshold for index term selection). User feedback –Our system can find more meaningful units instead of separate paragraphs, including some paragraphs not actually containing query terms. –Users are clear of their context when browsing the answers within the original document structure.
25
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 25 Presentation for ECIR’03, Pisa, Italy Threshold Setting Our system is less sensitive to the answer selection threshold settings. Dynamic threshold is a good alternative for such structured document retrieval.
26
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 26 Presentation for ECIR’03, Pisa, Italy Outline Motivations and problems Hierarchical index propagation and pruning Flexible element retrieval Evaluation Conclusions
27
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 27 Presentation for ECIR’03, Pisa, Italy Conclusions A novel hierarchical index propagation and pruning mechanism to generate structured index. Flexible element retrieval of getting arbitrary- level relevant elements is realized on the hierarchical index. It can better satisfy users than previous passage retrieval systems. More work can be done on generating hierarchical index for federate search.
28
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 28 Presentation for ECIR’03, Pisa, Italy Thanks!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.