Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh www.Gelbukh.com.

Slides:



Advertisements
Similar presentations
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Advertisements

Special Topics in Computer Science The Art of Information Retrieval Chapter 6: Text and Multimedia Languages and Properties Alexander Gelbukh
Special Topics in Computer Science The Art of Information Retrieval Chapter 10: User Interfaces and Visualization Alexander Gelbukh
Alexander Gelbukh Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 4 (book chapter 8): Indexing.
Special Topics in Computer Science The Art of Information Retrieval Chapter 8: Indexing and Searching Alexander Gelbukh
Special Topics in Computer Science The Art of Information Retrieval Chapter 13: Searching the Web Alexander Gelbukh
Special Topics in Computer Science The Art of Information Retrieval Chapter 5: Query Operations Alexander Gelbukh
Special Topics in Computer Science The Art of Information Retrieval Chapter 3: Retrieval Evaluation Alexander Gelbukh
Alexander Gelbukh Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 5 (book chapter 11): Multimedia.
Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 2: Modeling Alexander Gelbukh
Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 1: Introduction Alexander Gelbukh
Special Topics in Computer Science Advanced Topics in Information Retrieval Chapter 3: Goals: Retrieval Evaluation Alexander Gelbukh
Special Topics in Computer Science The Art of Information Retrieval Chapter 1: Introduction Alexander Gelbukh
1 Alexander Gelbukh Moscow, Russia. 2 Mexico 3 Computing Research Center (CIC), Mexico.
1 Use of Electronic Resources in Research Prof. Dr. Khalid Mahmood Department of Library & Information Science University of the Punjab.
Multilinguality & Semantic Search Eelco Mossel (University of Hamburg) Review Meeting, January 2008, Zürich.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Chapter 24 Lists, Stacks, and Queues
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
Traditional IR models Jian-Yun Nie.
Boolean and Vector Space Retrieval Models
11-1 FRAMING The data link layer needs to pack bits into frames, so that each frame is distinguishable from another. Our postal system practices a type.
Chapter 8 Improving the User Interface
Chapter 5: Introduction to Information Retrieval
Multimedia Database Systems
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
IR Models: Overview, Boolean, and Vector
Modern Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
IR Models: Structural Models
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Query Languages: Patterns & Structures. Pattern Matching Pattern –a set of syntactic features that must occur in a text segment Types of patterns –Words:
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
LAST WEEK  Retrieval evaluation  Why?  How?  Recall and precision – Venn’s Diagram & Contingency Table.
Chapter 4 : Query Languages Baeza-Yates, 1999 Modern Information Retrieval.
XML –Query Languages, Extracting from Relational Databases ADVANCED DATABASES Khawaja Mohiuddin Assistant Professor Department of Computer Sciences Bahria.
Modern Information Retrieval Chapter 4 Query Languages.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Chapter 5: Information Retrieval and Web Search
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.
Modern Information Retrieval Chap. 02: Modeling (Structured Text Models)
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Information Retrieval Introduction/Overview Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Web- and Multimedia-based Information Systems Lecture 2.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval
Text Based Information Retrieval
Databases.
Multimedia Information Retrieval
Query Languages.
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Query Languages Berlin Chen 2003 Reference:
Recuperação de Informação B
Information Retrieval and Web Design
Presentation transcript:

Alexander Gelbukh www.Gelbukh.com Special Topics in Computer Science The Art of Information Retrieval Chapter 4: Query Languages Alexander Gelbukh www.Gelbukh.com

Previous Chapter Main measures: Precision & Recall. For sets Rankings are evaluated through initial subsets There are measures that combine them into one Involve user-defined preferences. In F-measure set to 50-50 Many (other) characteristics An algorithm can be good at some and bad at others Averages are used, but not always are meaningful Reference collection exists with known answers to evaluate new algorithms

Previous chapter: research issues Different types of interfaces; interactive systems: What measures to use? How people judge relevance? How the “user satisfaction” can be measured? Modeled?

Query languages Query language = type of possible queries Type of queries depend on the IR model Types: IR (= ranked output) Data retrieval User-oriented Low-level (= protocols) Assume all pre-processing has been done Thesaurus, stop-words, ... (I think this must be a part of the language!) Returns “documents” (chapter, paragraph, ...)

In this chapter Keyword-based languages Pattern matching Structure taken into account Protocols

Keyword-based languages: Single word Intuitive, easy to express, fast ranking. Words can be highlighted in the output. What a word is? Letters, separators Non-splitting characters: on-line. Database decides. TF-IDF are designed for words Used for the main models (Boolean, Vector, Probabilistic)

Keyword-based languages: Context Queries Ensure that the words are related Phrase “enhance retrieval” Allows separators and stopwords: “enhance the retrieval” Proximity “enhance the quality of information retrieval” Distance: words, letters. Order: same or not Not clear how to rank Research issue

Keyword-based languages: Boolean Queries Boolean expressions (can combine basic queries) Query syntax tree translation AND (syntax OR syntactic)  operations on the sets Result: set OR, AND, e1 BUT e2 NOT not used, could give (almost) all docs (= unsafe) Good: Can highlight occurrences, sort Bad: Difficult for the users Remedy (?): fuzzy Boolean (see below). Basic = keyword, pattern

Keyword-based languages: Fuzzy Boolean, Natural Language Fuzzy Boolean: OR  AND = some. AND punishes for absence, OR encourages multiple. Natural ranking: how many times? Natural Language: OR = AND BUT can be expressed (= penalty) How to rank? Different ways Vector space model Query is a vector A doc can be taken as a vector.  Relevance feedback! Proximity is ignored (Why? Research issue.)

Pattern matching... Pattern = sequence of features Types: Words Text segment matches the pattern Types: Words Prefixes, suffixes, substrings: comput-, -ters, -any flow- (many flowers). Ranges implies some order, e.g., lexicographical = alphabetic Allowing errors Levenshtein (= edit) distance: historical / hysterical # insertions, deletions, replacements. Threshold.

...Pattern matching ...Types Regular expressions Extended patterns union = or: if e1, e2 are expressions, (e1 | e2) too concatenation: e1 e2 repetition: e* (0 or more occurrences) Extended patterns user-friendly; can be internally converted into simple case-insensitive, “anything” (wildcard), digit, vowel, ... conditionals, optional some parts match exactly and other with errors, etc.

Structural queries Old days: fields. No nesting, no overlap, fixed order. Email: subject, body, sender, ... = Relational database with text type, treated as text should be Versions of SQL with text operators Hypertext Not well developed. Too free WebGlimpse: search the neighborhood Hierarchical Intermediate level of freedom Volumes, chapters, sections, paragraphs, sentences, ...

Too fixed Too free Intermediate

Hierarchical Models ... PAT expressions Overlapped lists Hierarchy is defined at query time. Regions are included in the index, e.g., sections, italics, ... Different types of regions can overlap, same type can’t Can query for words in a region, regions in a region, etc. Complex computation, unclear semantics Overlapped lists Evolution of PAT: areas of same type can overlap (not nest) Uses same inverted file Can combine regions, specify order, ... n-words: all (overlapping) areas of n words.

Overlapping lists

... Hierarchical Models ... List of references Proximal nodes Answers are references (pointers) to regions Only one type of regions (e.g., only sections). No nesting. Known at index time Ancestry of nodes. Can query paths Proximal nodes Compromise between expressiveness and efficiency Many (overlapping) fixed hierarchies Interesting queries: “3rd paragraph of each chapter”, ...

Proximal nodes

... Hierarchical Models Tree matching Query is a tree. Match the text tree. Ordered or unordered trees (are siblings ordered?) Prolog-like constraints on different parts of the tree Variables Answer: root of a match Very inefficient (usually NP-hard) Due to variables and unordered matching

Research issues in hierarchical models Static or dynamic? Define the hierarchy at index time or at query time? Static: text markup. Dynamic: tags, indexed. Restrictions on the structure Restrict structure of restrict the query language For efficiency Integration with text of secondary importance: structure (in IR) or text (in DB)? combine Query language Standardization, expressiveness taxonomy, categorization

Query protocols Used internally Standard: one client can query different libraries In CD-ROMS, disk interchangeability Z39.50: bibliographic (used for other types, too) WAIS (Wide Area Information Service) Includes Z39.50 For CD-ROMs: CCL, Common Command Language CD-RDx (Compact Disk Read only Data Exchange) SFQL (Structured Full-text Query Language). Like DB.

Types of queries we have discussed

Trends and research topics Models: to better understand the user needs Query languages: flexibility, power, expressiveness, functionality Visual languages Example: library shown on the screen. Act: take books, open catalogs, etc. Better Boolean queries: “I need books by Cervantes AND Lope de Vega”?!

Conclusions Width-wide: Depth-wide: words, phrases, proximity, fuzzy Boolean, natural language Depth-wide: Pattern matching If return sets, can be combined using Boolean model Combining with structure Hierarchical structure Standardized low level languages: protocols Reusable

Thank you! Till October 16 October 23: midterm exam