An Efficient Algorithm for Incremental Update of Concept space

An Efficient Algorithm for Incremental Update of Concept space
Presented by Felix Cheung

Overview Background Introduction to Concept Space
The Problem of Concept Space The Idea of the Solution Performance Evaluation Conclusion

Background Vocabulary Problem
The failure is caused by variety of terms Such as HIV vs. AIDS Two people choose the same words with less 20% One of solutions: thesauri

Thesauri A thesaurus is a book of words that are put in groups together according to connections between their meaning To solve vocabulary problem If a search retrieves too few documents, a user can expand his query The problem of thesauri Manual construction is very complex

Introduction to Concept Space
It is an automatic approach to thesaurus construction Given terms j & k, a concept space has associations Wjk and Wkj Wjk and Wkj are asymmetric An association is a value between 0 and 1

Concept Space Construction
The construction of concept space consists of two phases An automatic indexing phase A document collection is processed to build inverted lists

Inverted Lists doc. id tf a b

Concept Space Construction
The construction of concept space consists of two phases A co-occurrence analysis phase The associations of every term pair are computed based on the following equation

The sum of TFIDF scores To compute the sum of all TFIDF scores of keyword j in all the documents: where term frequency of j in doc i number of docs with j number of docs in db

Weighting Factor The Weighting Factor is used to penalize the general terms

The sum of co-occurrence TFIDF scores
To find the sum of all co-occurrence TFIDF scores of keywords j and k in all the documents where number of docs with both j and k min(tfij, tfik) number of docs in db

A Complete Concept Space
A complete concept space is gigantic Each term may have a few thousand related terms => overwhelm searchers Only highly related terms are suggested

Highly related terms There are 1,708,551 co-occurrence pairs
The max no. of related terms = 100 If no. of related terms > 100, only 100 terms with highest association values retained (strong associations) Only highly-ranked association is contained – called partial concept space

The Problem of Concept Space
In a dynamic environment, the collection changes with time => concept space update The simplest approach => reconstruct from scratch Disadvantage: time consuming To study incremental update problem of partial concept spaces

The Definition A set of document (D) A new document collection (D’)
add A document collection (D) A updated concept space(CSD’) A constructed concept space (CSD) update Only n strong associations kept

The Idea of pruning algorithm
Avoid scanning inverted lists directly Calculate an easy-computed upper bound of W’jk Compare with a threshold j The property of j If  j, W’jk must not be a strong association

The upper bound

How to determine j Compute n associations W’jki‘s for which Wjki is strong w.r.t the document D (n  i  1) Set j = min(W’jki) Given p, if j > , W’jp< all n W’jki’s

Pruning Algorithm Compute the association W’jki w.r.t D’ if Wjki is strong w.r.t. D for each term j Determine j among n such associations of term j Compute the upper bound of W’jp if Wjp is weak w.r.t. D Compute W’jp if  j Only keep the n largest associations of j

Quantization is in term of The amount of storage is very big
High precision is not needed Some quantization techniques can be applied to reduce the storage requirment

Performance Evaluation
“The Ohsumed Test Collection” is used 348,566 abstracts with terms 169 MB large (after stop-word removal and stemming) The algorithm is run on a 700 MHz Pentium III Xeon machine

Experiment I Half of documents are picked as the original collection D
The other half of documents are partitioned into 10 equal parts These parts are added to D successively and cumulatively

Experiment I Result (I)

Experiment I Result (II)

Experiment I Result (III)

Experiment I Result (IV)

Experiment I Result (V)

Experiment II Another factors affects the performance- the size of added documents The size of added documents changes from 17,400 to 174,000

Experiment II Result

Storage requirement

Conclusion Concept space approach is a very useful tool for information retrieval The construction and incremental update are very time consuming In many application, only a partial concept is needed To reduce the storage requirement, some quantization methods are proposed

Conclusion (Con’t) The pruning algorithms are effective in avoiding computing weak associations 9-time speedup can be achieved

An Efficient Algorithm for Incremental Update of Concept space

Similar presentations

Presentation on theme: "An Efficient Algorithm for Incremental Update of Concept space"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Efficient Algorithm for Incremental Update of Concept space

Similar presentations

Presentation on theme: "An Efficient Algorithm for Incremental Update of Concept space"— Presentation transcript:

Similar presentations

About project

Feedback