Download presentation
Presentation is loading. Please wait.
Published byMegan Cook Modified over 10 years ago
1
Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh www.Gelbukh.com
2
2 Previous chapter: Conclusions Query languages (width-wide): owords, phrases, proximity, fuzzy Boolean, natural language Query languages (depth-wide): oPattern matching If return sets, can be combined using Boolean model Combining with structure oHierarchical structure Standardized low level languages: protocols oReusable
3
3 Previous chapter: Trends and research topics Models: to better understand the user needs Query languages: flexibility, power, expressiveness, functionality Visual languages oExample: library shown on the screen. Act: take books, open catalogs, etc. oBetter Boolean queries: I need books by Cervantes AND Lope de Vega?!
4
4 Query operations Users have difficulties formulating queries Program improves the query oInteractive mode: using the users feedback oUsing info from the retrieved set oUsing linguistic information or information from the collection Query expansion oadd new terms Term rewriting omodify weights
5
5 1 st method: User relevance feedback User examines to 10 (20) docs and marks relevant ones System uses this to construct new query oMoved toward relevant docs oAway from irrelevant Good: simplicity Note: In all the chapter, the correct spelling is Rocchio
6
6 User relevance feedback: Vector Space Model Best vector to distinguish good from bad docs: avg good minus avg bad
7
7 User relevance feedback: Vector Space Model Equally good results Original query gives important info: Relevant docs give more info than irrelevant ones: < = 0: Positive feedback
8
8 User relevance feedback: Probabilistic Model User feedback: Smoothing is usually applied Bad: oNo document weights oPrevious history lost oNo new terms, only weights are changed
9
9... a variant for Probabilistic Model Similarity is multiplied by TF (term frequency) oNot exactly, but this is the idea oInitially, IDF is also taken into account oDetails in the book Still no query expansion, only re-weighting the original terms
10
10 Evaluation of Relevance Feedback Simplistic: oEvaluate precision and recall after the feedback cycle oNot realistic since includes the users own feedback Better: oOnly consider unseen data oUse the rest of the collection oNot as good figures oUseful to compare different methods, not to compare precision/recall before and after feedback
11
11 2 nd method: Automatic local analysis Idea: add to the query synonyms, stemming variations, collocations: thesaurus-like relationships oBased on clustering technoques Global vs Local strategy: oGlobal: the whole collection is used for this oLocal: the retrieved set. Similar to feedback, but automatic. Local analysis: seems to give better results (better adaptation to the specific query) but time-consuming. oGood for local collections, not for Web Build clusters of words; add to each keyword its neighbors
12
12 Clustering (words) Association clusters oTerms that co-occur in the docs oThe clusters are the n terms that occur most frequently together with the query terms (normalized vs. non-) Metric clusters (better) oMultiplies the number of co-occurrences by the proximity in the text oTerms that occur in the same sentence are more related Scalar clusters oTerms co-occurring with the same other terms are related oRelatedness of two words = scalar product of centroids of their association clusters
13
13... variant (local clustering) Metric-like reasoning: Break the retrieved docs into passages (say, 300 words) Use them as docs; use TF-IDF Choose words related (use TF-IDF) to the whole query Better: words occuring near each other are more related Tune for each collection, not 5:
14
14 3 rd Method: Automatic Global Analysis Uses all docs in the collection Builds a thesaurus The terms related to the whole query are added (query expansion)
15
15 Similarity thesaurus Relatedness = occur in the same docs. Matrix doc x term frequency Inverse term frequency: divided by the size of the doc Relatedness = correlation between rows of the matrix Query: centroid, weighted (weighted sum). Relatedness between a term and this centroid = cosine Add best terms are added to the query, with weights:
16
16 (Global) Statistical thesaurus... Terms added must be discriminative low frequency Difficult to cluster (no info) Solution: First cluster docs; the frequency increases Clustering docs, e.g.: oEach doc is a cluster oMerge two most similar clusters = their docs are similar oRepeat until page 136:
17
17... statistical thesaurus Convert the cluster hierarchy into a set of clusters oUse a threshold similarity level to cut the hierarchy oDont take too large clusters Consider only low-frequency (in terms of ITF) terms occurring in the docs of the same class othreshold oThese give clusters of words Calculate weight of each class of terms. Add these terms with this weight to the query terms
18
18 Research topics Interactive interfaces oGraphical, 2D or 3D Refining global analysis techniques Application of linguistics methods. Stemming. Ontologies Local analysis for the Web (now too expensive) Combine the tree techniques (feedback, local, global)
19
19 Conclusions Relevance feedback oSimple, understandable oNeeds user attention oTerm re-weighting Local analysis for query expansion oCo-occurrences in the retrieved docs oUsually gives better results than global analysis oComputationally expensive Global analysis oNot as good results, since what is good for the whole collection is not good for a specific query oLinguistic methods, dictionaries, ontologies, stemming,...
20
20 Exam Questions and exercises You do what you consider appropriate On Oct 23 or maybe Nov 6 (??), discuss The class on Oct 30 is moved to Oct 23
21
21 Thank you! Till October 23 October 23: discussion of the midterm exam, class moved from October 30 The class of Oct 30 is moved to 23
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.