Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh www.Gelbukh.com.

Alexander Gelbukh www.Gelbukh.com
Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh

Previous Chapter Main measures: Precision & Recall.
For sets Rankings are evaluated through initial subsets There are measures that combine them into one Involve user-defined preferences. In F-measure set to 50-50 Many (other) characteristics An algorithm can be good at some and bad at others Averages are used, but not always are meaningful Reference collection exists with known answers to evaluate new algorithms

Previous chapter: research issues
Different types of interfaces; interactive systems: What measures to use? How people judge relevance? How the “user satisfaction” can be measured? Modeled?

Query languages Query language = type of possible queries
Type of queries depend on the IR model Types: IR (= ranked output) Data retrieval User-oriented Low-level (= protocols) Assume all pre-processing has been done Thesaurus, stop-words, ... (I think this must be a part of the language!) Returns “documents” (chapter, paragraph, ...)

In this chapter Keyword-based languages Pattern matching
Structure taken into account Protocols

Keyword-based languages: Single word
Intuitive, easy to express, fast ranking. Words can be highlighted in the output. What a word is? Letters, separators Non-splitting characters: on-line. Database decides. TF-IDF are designed for words Used for the main models (Boolean, Vector, Probabilistic)

Keyword-based languages: Context Queries
Ensure that the words are related Phrase “enhance retrieval” Allows separators and stopwords: “enhance the retrieval” Proximity “enhance the quality of information retrieval” Distance: words, letters. Order: same or not Not clear how to rank Research issue

Keyword-based languages: Boolean Queries
Boolean expressions (can combine basic queries) Query syntax tree translation AND (syntax OR syntactic)  operations on the sets Result: set OR, AND, e1 BUT e2 NOT not used, could give (almost) all docs (= unsafe) Good: Can highlight occurrences, sort Bad: Difficult for the users Remedy (?): fuzzy Boolean (see below). Basic = keyword, pattern

Keyword-based languages: Fuzzy Boolean, Natural Language
Fuzzy Boolean: OR  AND = some. AND punishes for absence, OR encourages multiple. Natural ranking: how many times? Natural Language: OR = AND BUT can be expressed (= penalty) How to rank? Different ways Vector space model Query is a vector A doc can be taken as a vector.  Relevance feedback! Proximity is ignored (Why? Research issue.)

Pattern matching... Pattern = sequence of features Types: Words
Text segment matches the pattern Types: Words Prefixes, suffixes, substrings: comput-, -ters, -any flow- (many flowers). Ranges implies some order, e.g., lexicographical = alphabetic Allowing errors Levenshtein (= edit) distance: historical / hysterical # insertions, deletions, replacements. Threshold.

...Pattern matching ...Types Regular expressions Extended patterns
union = or: if e1, e2 are expressions, (e1 | e2) too concatenation: e1 e2 repetition: e* (0 or more occurrences) Extended patterns user-friendly; can be internally converted into simple case-insensitive, “anything” (wildcard), digit, vowel, ... conditionals, optional some parts match exactly and other with errors, etc.

Structural queries Old days: fields. No nesting, no overlap, fixed order. subject, body, sender, ... = Relational database with text type, treated as text should be Versions of SQL with text operators Hypertext Not well developed. Too free WebGlimpse: search the neighborhood Hierarchical Intermediate level of freedom Volumes, chapters, sections, paragraphs, sentences, ...

Too fixed Too free Intermediate

Hierarchical Models ... PAT expressions Overlapped lists
Hierarchy is defined at query time. Regions are included in the index, e.g., sections, italics, ... Different types of regions can overlap, same type can’t Can query for words in a region, regions in a region, etc. Complex computation, unclear semantics Overlapped lists Evolution of PAT: areas of same type can overlap (not nest) Uses same inverted file Can combine regions, specify order, ... n-words: all (overlapping) areas of n words.

Overlapping lists

... Hierarchical Models ... List of references Proximal nodes
Answers are references (pointers) to regions Only one type of regions (e.g., only sections). No nesting. Known at index time Ancestry of nodes. Can query paths Proximal nodes Compromise between expressiveness and efficiency Many (overlapping) fixed hierarchies Interesting queries: “3rd paragraph of each chapter”, ...

Proximal nodes

... Hierarchical Models Tree matching
Query is a tree. Match the text tree. Ordered or unordered trees (are siblings ordered?) Prolog-like constraints on different parts of the tree Variables Answer: root of a match Very inefficient (usually NP-hard) Due to variables and unordered matching

Research issues in hierarchical models
Static or dynamic? Define the hierarchy at index time or at query time? Static: text markup. Dynamic: tags, indexed. Restrictions on the structure Restrict structure of restrict the query language For efficiency Integration with text of secondary importance: structure (in IR) or text (in DB)? combine Query language Standardization, expressiveness taxonomy, categorization

Query protocols Used internally
Standard: one client can query different libraries In CD-ROMS, disk interchangeability Z39.50: bibliographic (used for other types, too) WAIS (Wide Area Information Service) Includes Z39.50 For CD-ROMs: CCL, Common Command Language CD-RDx (Compact Disk Read only Data Exchange) SFQL (Structured Full-text Query Language). Like DB.

Types of queries we have discussed

Trends and research topics
Models: to better understand the user needs Query languages: flexibility, power, expressiveness, functionality Visual languages Example: library shown on the screen. Act: take books, open catalogs, etc. Better Boolean queries: “I need books by Cervantes AND Lope de Vega”?!

Conclusions Width-wide: Depth-wide:
words, phrases, proximity, fuzzy Boolean, natural language Depth-wide: Pattern matching If return sets, can be combined using Boolean model Combining with structure Hierarchical structure Standardized low level languages: protocols Reusable

Thank you! Till October 16 October 23: midterm exam

Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh www.Gelbukh.com.

Similar presentations

Presentation on theme: "Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh www.Gelbukh.com."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh www.Gelbukh.com.

Similar presentations

Presentation on theme: "Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh www.Gelbukh.com."— Presentation transcript:

Similar presentations

About project

Feedback