Automatically obtain a description for a larger cluster of relevant documents Identify terms related to query terms Synonyms, stemming variations, terms close to query terms Local analysis Use correlated terms from retrieved documents for query expansion
Three types of clusters Association clusters Stems co-occurring frequently inside documents have a synonymity association
Un-normalized correlation factor S u,v =C u,v Normalized correlation factor
Build local association clusters as follows Find clusters for the query terms
Metric clusters Consider the distance between two terms to compute their correlation factor
Un-normalized correlation factor S u,v =C u,v Normalized correlation factor Build local metric clusters as follows
Scalar clusters Two stems with similar neighborhoods have some synonymity relationship
A term S u is a neighbor of S v if S u belongs to a cluster (of size n) associated with S v Neighbor stems having a synonymity relationship are not necessarily synonyms in the grammatical sense Union of un-normalized and normalized clusters provides a better representation of possible correlations Metric clusters seem to perform better than purely association clusters
Global analysis Expand the query using information from the whole set of documents in the collection Build a thesaurus-like structure Select terms for expansion based on their similarity to the whole query Previous approaches failed to yield good results by considering individual query terms
Query expression done in three steps Represent the query as follows
Compute the similarity between each term correlated to the query terms and the whole query
Expand the query with the top r ranked terms according to the similarity computed Yield improved retrieval performance in the range of 20%