Download presentation
Presentation is loading. Please wait.
1
ISP433/633 Week 3 Query Structure and Query Operations
2
Outline More on weight assignment Exercise (as a review) BREAK Query structure Query operations –Boolean query parse –Vector query reformulation
3
Vector Space Model Example D 1 = “computer information retrieval” D 2 = “computer, computer information” Q 1 = “information, information retrieval” ComputerInformationRetrieval D1111 D2210 Q1021 vectors
4
More on Weight Assignment Boolean Model: binary weight Term Frequency as weight(freq) –Raw frequency of a term inside a document Problem with using raw freq –Zipf’s law Non-distinguishing terms have high frequency –Document length matters
5
Zipf’s law Rank x Frequency Constant
6
Zipf’s law Linear scale Log scale
7
Inverse Document Frequency (idf) Inverse of the proportion of documents that have the term among all the documents in the collection –deal with Zipf’s law idf i =log(N/n i ) –N: total number of documents in the collection –n i : the number of documents that have term i
8
Benefit of idf idf provides high values for rare words and low values for common words
9
Normalized Term Frequency (tf) Deal with document length The most frequent term m in document j –freq m, j Term i’s normalized term frequency in document j: –tf i, j = freq i, j / freq m, j For query: –tf i, q =.5 +.5 * freq i, q / freq m, q
10
tf*idf as weight Assign a tf * idf weight to each term in each document
11
Exercise D 1 = “computer information retrieval” D 2 = “computer retrieval” Q = “information, retrieval” Compute the tf*idf weight for each term in D 2 and Q BREAK
12
Queries Single-word queries Context queries –Phrases –Proximity Boolean queries Natural Language queries
13
Patten Match Words Prefixes Suffixes Substrings Ranges Regular expressions Structured queries (e.g., XQuery to query XML, Z39.50)
14
Boolean Query Processing The query must be parsed to determine what the: –Search Words –Optional field or index qualifications –Boolean Operators Are and how they relate to one-another Typical parsing uses lexical analysers (like lex or flex) along with parser generators like YACC, BISON or Llgen –These produce code to be compiled into programs.
15
Z39.50 Query Structure (ASN-1 Notation) -- Query Definitions Query ::= CHOICE{ type-0 [0] ANY, type-1 [1] IMPLICIT RPNQuery, type-2 [2] OCTET STRING, type-100 [100] OCTET STRING, type-101 [101] IMPLICIT RPNQuery, type-102 [102] OCTET STRING}
16
Z39.50 RPN Query (ASN-1 Notation) -- Definitions for RPN query RPNQuery ::= SEQUENCE{ attributeSet AttributeSetId, rpn RPNStructure}
17
RPN Structure RPNStructure ::= CHOICE{ op [0] Operand, rpnRpnOp [1] IMPLICIT SEQUENCE{ rpn1 RPNStructure, rpn2 RPNStructure, op Operator } }
18
Operand Operand ::= CHOICE{ attrTerm AttributesPlusTerm, resultSet ResultSetId, -- If version 2 is in force: -- - If query type is 1, one of the above two must be chosen; -- - resultAttr (below) may be used only if query type is 101. resultAttr ResultSetPlusAttributes}
19
Operator Operator ::= [46] CHOICE{ and [0] IMPLICIT NULL, or [1] IMPLICIT NULL, and-not [2] IMPLICIT NULL, -- If version 2 is in force: -- - For query type 1, one of the above three must be chosen; -- - prox (below) may be used only if query type is 101. prox [3] IMPLICIT ProximityOperator}
20
Parse Result (Query Tree) Z39.50 queries… Oper: AND Title XXX and Subject YYY Operand: Index = Title Value = XXX Operand: Index = Subject Value = YYY left right
21
Parse Results Subject XXX and (title yyy and author zzz) Op: AND Oper: Index: Subject Value: XXX Oper: Index: Title Value: YYY Oper: Index: Author Value: ZZZ
22
Relevance feedback Popular query reformulation strategy Used for –Query expansion –Term re-weighting Type –Manual –Automatic Scope –Local –Global
23
Vector Model Dr: set of relevant documents identified by user Dn: set of non-relevant documents identified Vec q : vector of original query Vec q’ : vector of expanded query A common strategy of query reformulation is: –Vec q’ = Vec q + (sum of Vec Dr ) – (sum of Vec Dn )
24
Example Q = “safety minivans” D1 = “car safety minivans tests injury statistics” - relevant D2 = “liability tests safety” - relevant D3 = “car passengers injury reviews” - non-relevant What should be the reformulated Q’?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.