Intelligent Information Retrieval CS 336 Xiaoyan Li Spring 2006 Modified from Lisa Ballesteros’s slides.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
IS530 Lesson 12 Boolean vs. Statistical Retrieval Systems.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Information Retrieval in Practice
Search Engines and Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
Intelligent Information Retrieval CS 336 –Lecture 2: Query Language Xiaoyan Li Spring 2006 Modified from Lisa Ballesteros’s slides.
Intelligent Information Retrieval CS 336 Lisa Ballesteros Spring 2006.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information Retrieval in Practice
INFORMATION RETRIEVAL WEEK 1 AND 2
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
1 Information Retrieval and Web Search Introduction.
Advance Information Retrieval Topics Hassan Bashiri.
1 CS 430: Information Discovery Lecture 20 The User in the Loop.
DB2 Net Search Extender Presenter: Sudeshna Banerji (CIS 595: Bioinformatics)
CS336: Intelligent Information Retrieval Why is Information Retrieval difficult?
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Overview of Search Engines
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Search Result Interface Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Search Engine Architecture
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Information Retrieval
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Chapter Three Presentation: User interface How to Build a Digital Library Ian H. Witten and David Bainbridge.
Information Retrieval in Practice
Information Retrieval in Practice
Information Architecture
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Search Engine Architecture
Information Retrieval (in Practice)
Information Retrieval and Web Search
Search Engine Architecture
CS 430: Information Discovery
Information Retrieval and Web Search
Information Retrieval and Web Search
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Lecture 8 Information Retrieval Introduction
Search Engine Architecture
Information Retrieval and Web Design
Information Retrieval and Web Search
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Intelligent Information Retrieval CS 336 Xiaoyan Li Spring 2006 Modified from Lisa Ballesteros’s slides

What is Information Retrieval? Includes the following: –Organization –Storage/Representation –Manipulation/Analysis –Search/Retrieval How far back in history can we find examples?

IR Through the Ages 3rd Century BCE –Library of Alexandria 500,000 volumes catalogs and classifications 13th Century A.D. –First concordance of the Bible What is a concordance? 15th Century A.D. –Invention of printing 1600 –University of Oxford Library All books printed in England

IR Through the Ages 1755 –Johnson’s Dictionary Set standard for dictionaries Included common language Helped standardize spelling 1800 –Library of Congress 1828 –Webster’s Dictionary Significantly larger than previous dictionaries Standardized American spelling 1852 –Roget’s Thesaurus

IR Through the Ages 1876 –Dewey Decimal Classification 1880’s –Carnegie Public Libraries 1,681 built (first public library 1850) 1930’s –Punched card retrieval systems 1940’s –Bush’s Memex –Shannon’s Communication Theory –Zipf’s “Law”

Historical Summary 1960’s –Basic advances in retrieval and indexing techniques 1970’s –Probabilistic and vector space models –Clustering, relevance feedback –Large, on-line, Boolean information services –Fast string matching 1980’s –Natural Language Processing and IR –Expert systems and IR –Off-the-shelf IR systems

IR Through the Ages Late 1980’s –First mini-computer and PC systems incorporating “relevance ranking” Early 1990’s –information storage revolution 1992 –First large-scale information service incorporating probabilistic retrieval (West’s legal retrieval system)

IR Through the Ages Mid 1990’s to present –Multimedia databases 1994 to present –The Internet and Web explosion e.g. Google, Yahoo, Lycos, Infoseek (now Go) 1995 to present –Digital Libraries –Data Mining –Agents and Filtering –Knowledge and Distributed Intelligence –Information Organization –Knowledge Management

Historical Summary 1990’s –Large-scale, full-text IR and filtering experiments and systems (TREC) –Dominance of ranking –Many web-based retrieval engines –Interfaces and browsing –Multimedia and multilingual –Machine learning techniques

Time On-line Information Batch systems...Interactive systems...Database Systems…Cheap Storage...Internet…Multimedia... Gigabytes Terabytes Petabytes Technologies Boolean Retrieval and Filtering Ranked Retrieval Distributed Retrieval Concept-Based Retrieval Image and Video Retrieval Information Extraction Visualization Summarization Data Mining Ranked Filtering Trends in IR Technology 1-page word document without any images = ~10 kilobytes (kb) of disk space. 1 terabyte = one-hundred million imageless word docs 1 petabyte = one-thousand terabytes.

Historical Summary The Future –Logic-based IR? –NLP? –Integration with other functionality –Distributed, heterogeneous database access –IR in context –“Anytime, Anywhere”

Information Retrieval Ad Hoc Retrieval –Given a query and a large database of text objects, find the relevant objects Distributed Retrieval –Many distributed databases Information Filtering –Given a text object from an information stream (e.g. newswire) and many profiles (long-term queries), decide which profiles match Multimedia Retrieval –Databases of other types of unstructured data, e.g. images, video, audio

Information Retrieval Multilingual Retrieval –Retrieval in a language other than English Cross-language Retrieval –Query in one language (e.g. Spanish), retrieve documents in other languages (e.g. Chinese, French, and Spanish)

What does an IR system do? Generate a representation of each document –essentially pick best words and/or phrases Generate query representation –if documents processed specially, queries must also be –possibly weight query words Match queries and documents –find relevant documents Perhaps, rank and sort documents

Information Retrieval Text Representation (Indexing) –given a text document, identify the concepts that describe the content and how well they describe it what makes a “good” representation? how is a representation generated from text? what are retrievable objects and how are they organized? Representing an Information Need (Query Formulation) –describe and refine information needs as explicit queries what is an appropriate query language? how can interactive query formulation and refinement be supported?

Information Retrieval Comparing Representations (Retrieval) –compare text and information need representations to determine which documents are likely to be relevant what is a “good” model of retrieval? how is uncertainty represented? Evaluating Retrieved Text (Feedback) –present documents for user evaluation and modify query based on feedback what are good metrics? what constitutes a good experimental testbed

Information Retrieval and Filtering Information NeedText Objects Representation Query Comparison Evaluation/Feedback Indexed Objects Retrieved Objects Representation

Features of a Modern IR Product Effective “relevance ranking” Simple free text (“natural language”) query capability Boolean and proximity operators Term weighting Query formulation assistance Query by example Filtering Field-based retrieval Distributed architecture Index anything Fast retrieval Information Organization

Typical Systems IR systems –Verity, Fulcrum, Excalibur Database systems –Oracle, Informix Web search and In-house systems –West, LEXIS/NEXIS, Dialog –Yahoo, Google, MSN, AskJeeves

IR vs. Database Systems Emphasis on effective, efficient retrieval of unstructured data IR systems typically have very simple schemas Query languages emphasize free text although Boolean combinations of words is also common

IR vs. Database Systems Matching is more complex than with structured data (semantics less obvious) –easy to retrieve the wrong objects –need to measure accuracy of retrieval Less focus on concurrency control and recovery, although update is very important

Ambiguity Complicates the Task Synonyms: many ways to express concept –lorry/truck, elevator/lift, pump/impeller, hypertension/high blood pressure –failure to use specific words => failure to get doc Words have many meanings –How many diff meanings are there for “bank”?

Ambiguity Complicates the Task Difficult to Specify Important but Vague Concepts –e.g. will interest rates be raised in the next six months Spelling variants/ spelling errors

Basic Automatic Indexing Parse documents to recognize structure –e.g. title, date, other fields Scan for word tokens –numbers, special characters, hyphenation, capitalization, etc. –languages like Chinese need segmentation –record positional information for proximity operators Stopword removal –based on short list of common words such as “the”, “and”, “or” –saves storage overhead of very long indexes –can be dangerous (e.g. “Mr. The”, “and-or gates”)

Basic Automatic Indexing Stem words –group word variants such as plurals via morphological processing computer, computers, computing, computed, computation, computerized, computerize, computerizable –can make mistakes but generally preferred Optional –phrase indexing –thesaurus classes

How do you rank results? What does it mean for a document to be important/relevant? –Even human assessors do not agree with each other. Word matching is imperfect, how do we decide which documents are most important?

How do you rank results? How do we decide which documents are most important? –Count words high frequency words indicate document “aboutness” –Weight infrequent corpus words more strongly can be strong signifiers of meaning; easier to partition –Determine meaning by analyzing text surrounding a word » –Give extra weight to title words, etc. –Make sense of references given, citations received, etc.

Free Text Search Engines Different engines use different ranking strategies (often a trade secret) –Word frequency –Placement in document –Popularity of document –Number of links to document –Business relationships etc….

Announcement: Writing assignment: due next Monday –Create 10 topics/queries and search on three popular Web search engines: Google.com, Yahoo.com and ask.com. Write a report to compare the three search engines and discuss why IR is so hard. Next Lecture: Query languages. (Ch. 4)