Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh www.Gelbukh.com
Previous Chapter Main measures: Precision & Recall. For sets Rankings are evaluated through initial subsets There are measures that combine them into one Involve user-defined preferences. In F-measure set to 50-50 Many (other) characteristics An algorithm can be good at some and bad at others Averages are used, but not always are meaningful Reference collection exists with known answers to evaluate new algorithms
Previous chapter: research issues Different types of interfaces; interactive systems: What measures to use? How people judge relevance? How the “user satisfaction” can be measured? Modeled?
Query languages Query language = type of possible queries Type of queries depend on the IR model Types: IR (= ranked output) Data retrieval User-oriented Low-level (= protocols) Assume all pre-processing has been done Thesaurus, stop-words, ... (I think this must be a part of the language!) Returns “documents” (chapter, paragraph, ...)
In this chapter Keyword-based languages Pattern matching Structure taken into account Protocols
Keyword-based languages: Single word Intuitive, easy to express, fast ranking. Words can be highlighted in the output. What a word is? Letters, separators Non-splitting characters: on-line. Database decides. TF-IDF are designed for words Used for the main models (Boolean, Vector, Probabilistic)
Keyword-based languages: Context Queries Ensure that the words are related Phrase “enhance retrieval” Allows separators and stopwords: “enhance the retrieval” Proximity “enhance the quality of information retrieval” Distance: words, letters. Order: same or not Not clear how to rank Research issue
Keyword-based languages: Boolean Queries Boolean expressions (can combine basic queries) Query syntax tree translation AND (syntax OR syntactic) operations on the sets Result: set OR, AND, e1 BUT e2 NOT not used, could give (almost) all docs (= unsafe) Good: Can highlight occurrences, sort Bad: Difficult for the users Remedy (?): fuzzy Boolean (see below). Basic = keyword, pattern
Keyword-based languages: Fuzzy Boolean, Natural Language Fuzzy Boolean: OR AND = some. AND punishes for absence, OR encourages multiple. Natural ranking: how many times? Natural Language: OR = AND BUT can be expressed (= penalty) How to rank? Different ways Vector space model Query is a vector A doc can be taken as a vector. Relevance feedback! Proximity is ignored (Why? Research issue.)
Pattern matching... Pattern = sequence of features Types: Words Text segment matches the pattern Types: Words Prefixes, suffixes, substrings: comput-, -ters, -any flow- (many flowers). Ranges implies some order, e.g., lexicographical = alphabetic Allowing errors Levenshtein (= edit) distance: historical / hysterical # insertions, deletions, replacements. Threshold.
...Pattern matching ...Types Regular expressions Extended patterns union = or: if e1, e2 are expressions, (e1 | e2) too concatenation: e1 e2 repetition: e* (0 or more occurrences) Extended patterns user-friendly; can be internally converted into simple case-insensitive, “anything” (wildcard), digit, vowel, ... conditionals, optional some parts match exactly and other with errors, etc.
Structural queries Old days: fields. No nesting, no overlap, fixed order. Email: subject, body, sender, ... = Relational database with text type, treated as text should be Versions of SQL with text operators Hypertext Not well developed. Too free WebGlimpse: search the neighborhood Hierarchical Intermediate level of freedom Volumes, chapters, sections, paragraphs, sentences, ...
Too fixed Too free Intermediate
Hierarchical Models ... PAT expressions Overlapped lists Hierarchy is defined at query time. Regions are included in the index, e.g., sections, italics, ... Different types of regions can overlap, same type can’t Can query for words in a region, regions in a region, etc. Complex computation, unclear semantics Overlapped lists Evolution of PAT: areas of same type can overlap (not nest) Uses same inverted file Can combine regions, specify order, ... n-words: all (overlapping) areas of n words.
Overlapping lists
... Hierarchical Models ... List of references Proximal nodes Answers are references (pointers) to regions Only one type of regions (e.g., only sections). No nesting. Known at index time Ancestry of nodes. Can query paths Proximal nodes Compromise between expressiveness and efficiency Many (overlapping) fixed hierarchies Interesting queries: “3rd paragraph of each chapter”, ...
Proximal nodes
... Hierarchical Models Tree matching Query is a tree. Match the text tree. Ordered or unordered trees (are siblings ordered?) Prolog-like constraints on different parts of the tree Variables Answer: root of a match Very inefficient (usually NP-hard) Due to variables and unordered matching
Research issues in hierarchical models Static or dynamic? Define the hierarchy at index time or at query time? Static: text markup. Dynamic: tags, indexed. Restrictions on the structure Restrict structure of restrict the query language For efficiency Integration with text of secondary importance: structure (in IR) or text (in DB)? combine Query language Standardization, expressiveness taxonomy, categorization
Query protocols Used internally Standard: one client can query different libraries In CD-ROMS, disk interchangeability Z39.50: bibliographic (used for other types, too) WAIS (Wide Area Information Service) Includes Z39.50 For CD-ROMs: CCL, Common Command Language CD-RDx (Compact Disk Read only Data Exchange) SFQL (Structured Full-text Query Language). Like DB.
Types of queries we have discussed
Trends and research topics Models: to better understand the user needs Query languages: flexibility, power, expressiveness, functionality Visual languages Example: library shown on the screen. Act: take books, open catalogs, etc. Better Boolean queries: “I need books by Cervantes AND Lope de Vega”?!
Conclusions Width-wide: Depth-wide: words, phrases, proximity, fuzzy Boolean, natural language Depth-wide: Pattern matching If return sets, can be combined using Boolean model Combining with structure Hierarchical structure Standardized low level languages: protocols Reusable
Thank you! Till October 16 October 23: midterm exam