Download presentation
Presentation is loading. Please wait.
Published byDavid Short Modified over 6 years ago
1
An Efficient Algorithm for Incremental Update of Concept space
Presented by Felix Cheung
2
Overview Background Introduction to Concept Space
The Problem of Concept Space The Idea of the Solution Performance Evaluation Conclusion
3
Background Vocabulary Problem
The failure is caused by variety of terms Such as HIV vs. AIDS Two people choose the same words with less 20% One of solutions: thesauri
4
Thesauri A thesaurus is a book of words that are put in groups together according to connections between their meaning To solve vocabulary problem If a search retrieves too few documents, a user can expand his query The problem of thesauri Manual construction is very complex
5
Introduction to Concept Space
It is an automatic approach to thesaurus construction Given terms j & k, a concept space has associations Wjk and Wkj Wjk and Wkj are asymmetric An association is a value between 0 and 1
6
Concept Space Construction
The construction of concept space consists of two phases An automatic indexing phase A document collection is processed to build inverted lists
7
Inverted Lists doc. id tf a b
8
Concept Space Construction
The construction of concept space consists of two phases A co-occurrence analysis phase The associations of every term pair are computed based on the following equation
9
The sum of TFIDF scores To compute the sum of all TFIDF scores of keyword j in all the documents: where term frequency of j in doc i number of docs with j number of docs in db
10
Weighting Factor The Weighting Factor is used to penalize the general terms
11
The sum of co-occurrence TFIDF scores
To find the sum of all co-occurrence TFIDF scores of keywords j and k in all the documents where number of docs with both j and k min(tfij, tfik) number of docs in db
12
A Complete Concept Space
A complete concept space is gigantic Each term may have a few thousand related terms => overwhelm searchers Only highly related terms are suggested
13
Highly related terms There are 1,708,551 co-occurrence pairs
The max no. of related terms = 100 If no. of related terms > 100, only 100 terms with highest association values retained (strong associations) Only highly-ranked association is contained – called partial concept space
14
The Problem of Concept Space
In a dynamic environment, the collection changes with time => concept space update The simplest approach => reconstruct from scratch Disadvantage: time consuming To study incremental update problem of partial concept spaces
15
The Definition A set of document (D) A new document collection (D’)
add A document collection (D) A updated concept space(CSD’) A constructed concept space (CSD) update Only n strong associations kept
16
The Idea of pruning algorithm
Avoid scanning inverted lists directly Calculate an easy-computed upper bound of W’jk Compare with a threshold j The property of j If j, W’jk must not be a strong association
17
The upper bound
18
How to determine j Compute n associations W’jki‘s for which Wjki is strong w.r.t the document D (n i 1) Set j = min(W’jki) Given p, if j > , W’jp< all n W’jki’s
19
Pruning Algorithm Compute the association W’jki w.r.t D’ if Wjki is strong w.r.t. D for each term j Determine j among n such associations of term j Compute the upper bound of W’jp if Wjp is weak w.r.t. D Compute W’jp if j Only keep the n largest associations of j
20
Quantization is in term of The amount of storage is very big
High precision is not needed Some quantization techniques can be applied to reduce the storage requirment
21
Performance Evaluation
“The Ohsumed Test Collection” is used 348,566 abstracts with terms 169 MB large (after stop-word removal and stemming) The algorithm is run on a 700 MHz Pentium III Xeon machine
22
Experiment I Half of documents are picked as the original collection D
The other half of documents are partitioned into 10 equal parts These parts are added to D successively and cumulatively
23
Experiment I Result (I)
24
Experiment I Result (II)
25
Experiment I Result (III)
26
Experiment I Result (IV)
27
Experiment I Result (V)
28
Experiment II Another factors affects the performance- the size of added documents The size of added documents changes from 17,400 to 174,000
29
Experiment II Result
30
Storage requirement
31
Conclusion Concept space approach is a very useful tool for information retrieval The construction and incremental update are very time consuming In many application, only a partial concept is needed To reduce the storage requirement, some quantization methods are proposed
32
Conclusion (Con’t) The pruning algorithms are effective in avoiding computing weak associations 9-time speedup can be achieved
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.