ISP433/633 Week 3 Query Structure and Query Operations.

ISP433/633 Week 3 Query Structure and Query Operations

Outline More on weight assignment Exercise (as a review) BREAK Query structure Query operations –Boolean query parse –Vector query reformulation

Vector Space Model Example D 1 = “computer information retrieval” D 2 = “computer, computer information” Q 1 = “information, information retrieval” ComputerInformationRetrieval D1111 D2210 Q1021 vectors

More on Weight Assignment Boolean Model: binary weight Term Frequency as weight(freq) –Raw frequency of a term inside a document Problem with using raw freq –Zipf’s law Non-distinguishing terms have high frequency –Document length matters

Zipf’s law Rank x Frequency  Constant

Zipf’s law Linear scale Log scale

Inverse Document Frequency (idf) Inverse of the proportion of documents that have the term among all the documents in the collection –deal with Zipf’s law idf i =log(N/n i ) –N: total number of documents in the collection –n i : the number of documents that have term i

Benefit of idf idf provides high values for rare words and low values for common words

Normalized Term Frequency (tf) Deal with document length The most frequent term m in document j –freq m, j Term i’s normalized term frequency in document j: –tf i, j = freq i, j / freq m, j For query: –tf i, q =.5 +.5 * freq i, q / freq m, q

tf*idf as weight Assign a tf * idf weight to each term in each document

Exercise D 1 = “computer information retrieval” D 2 = “computer retrieval” Q = “information, retrieval” Compute the tf*idf weight for each term in D 2 and Q BREAK

Queries Single-word queries Context queries –Phrases –Proximity Boolean queries Natural Language queries

Patten Match Words Prefixes Suffixes Substrings Ranges Regular expressions Structured queries (e.g., XQuery to query XML, Z39.50)

Boolean Query Processing The query must be parsed to determine what the: –Search Words –Optional field or index qualifications –Boolean Operators Are and how they relate to one-another Typical parsing uses lexical analysers (like lex or flex) along with parser generators like YACC, BISON or Llgen –These produce code to be compiled into programs.

Z39.50 Query Structure (ASN-1 Notation) -- Query Definitions Query ::= CHOICE{ type-0 [0] ANY, type-1 [1] IMPLICIT RPNQuery, type-2 [2] OCTET STRING, type-100 [100] OCTET STRING, type-101 [101] IMPLICIT RPNQuery, type-102 [102] OCTET STRING}

Z39.50 RPN Query (ASN-1 Notation) -- Definitions for RPN query RPNQuery ::= SEQUENCE{ attributeSet AttributeSetId, rpn RPNStructure}

RPN Structure RPNStructure ::= CHOICE{ op [0] Operand, rpnRpnOp [1] IMPLICIT SEQUENCE{ rpn1 RPNStructure, rpn2 RPNStructure, op Operator } }

Operand Operand ::= CHOICE{ attrTerm AttributesPlusTerm, resultSet ResultSetId, -- If version 2 is in force: -- - If query type is 1, one of the above two must be chosen; -- - resultAttr (below) may be used only if query type is 101. resultAttr ResultSetPlusAttributes}

Operator Operator ::= [46] CHOICE{ and [0] IMPLICIT NULL, or [1] IMPLICIT NULL, and-not [2] IMPLICIT NULL, -- If version 2 is in force: -- - For query type 1, one of the above three must be chosen; -- - prox (below) may be used only if query type is 101. prox [3] IMPLICIT ProximityOperator}

Parse Result (Query Tree) Z39.50 queries… Oper: AND Title XXX and Subject YYY Operand: Index = Title Value = XXX Operand: Index = Subject Value = YYY left right

Parse Results Subject XXX and (title yyy and author zzz) Op: AND Oper: Index: Subject Value: XXX Oper: Index: Title Value: YYY Oper: Index: Author Value: ZZZ

Relevance feedback Popular query reformulation strategy Used for –Query expansion –Term re-weighting Type –Manual –Automatic Scope –Local –Global

Vector Model Dr: set of relevant documents identified by user Dn: set of non-relevant documents identified Vec q : vector of original query Vec q’ : vector of expanded query A common strategy of query reformulation is: –Vec q’ = Vec q + (sum of Vec Dr ) – (sum of Vec Dn )

Example Q = “safety minivans” D1 = “car safety minivans tests injury statistics” - relevant D2 = “liability tests safety” - relevant D3 = “car passengers injury reviews” - non-relevant What should be the reformulated Q’?

ISP433/633 Week 3 Query Structure and Query Operations.

Similar presentations

Presentation on theme: "ISP433/633 Week 3 Query Structure and Query Operations."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ISP433/633 Week 3 Query Structure and Query Operations.

Similar presentations

Presentation on theme: "ISP433/633 Week 3 Query Structure and Query Operations."— Presentation transcript:

Similar presentations

About project

Feedback