Information Retrieval on the World Wide Web

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Traditional IR models Jian-Yun Nie.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Multimedia Database Systems
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
Modern Information Retrieval Chapter 1: Introduction
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Web Search - Summer Term 2006 II. Information Retrieval (Basics Cont.)
IR Models: Overview, Boolean, and Vector
Search Engines and Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
IR Models: Structural Models
Models for Information Retrieval Mainly used in science and research, (probably?) less often in real systems But: Research results have significance for.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Information Retrieval in Practice
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.
Vector Space Model CS 652 Information Extraction and Integration.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
IR Models: Review Vector Model and Probabilistic.
Chapter 5: Information Retrieval and Web Search
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Search Engines and Information Retrieval Chapter 1.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Music Information Retrieval Information Universe Seongmin Lim Dept. of Industrial Engineering Seoul National University.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
4 1 SEARCHING THE WEB Using Search Engines and Directories Effectively New Perspectives on THE INTERNET.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Evaluation of Agent Building Tools and Implementation of a Prototype for Information Gathering Leif M. Koch University of Waterloo August 2001.
Search Tools and Search Engines Searching for Information and common found internet file types.
Information Retrieval
- University of North Texas - DSCI 5240 Fall Graduate Presentation - Option A Slides Modified From 2008 Jones and Bartlett Publishers, Inc. Version.
Information Retrieval Transfer Cycle Dania Bilal IS 530 Fall 2007.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Information Retrieval and Web Search Introduction to IR models and methods Rada Mihalcea (Some of the slides in this slide set come from IR courses taught.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Information Retrieval Models School of Informatics Dept. of Library and Information Studies Dr. Miguel E. Ruiz.
Information Retrieval in Practice
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Information Retrieval and Web Search
Information Retrieval and Web Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
موضوع پروژه : بازیابی اطلاعات Information Retrieval
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Information Retrieval and Web Search
Information Retrieval and Web Design
Presentation transcript:

Information Retrieval on the World Wide Web Authors: Venkat N. Gudivada Vijay V. Raghavan William I. Grosky Rajesh Kasanagottu Presented by Rob von Behren

Roadmap Information Retrieval Implementation Issues and Techniques Analysis

Definitions Information retrieval - querying against a set of documents to find a subset of "relevant" documents Objective terms - external descriptions, not related to content Nonobjective terms (content terms) - descriptions of the informational content of the document indexing exhaustivity - degree to which the index covers the document space term specificity - describes how well a particular term limits search results. Recall - relevant docs found / relevant docs in collection Precision - relevant docs found / total docs found

Key qualities Document and query representations Mechanisms for finding relevant documents and ranking the results Mechanisms for obtaining user feedback

Types of IR models Set Theoretic Algebraic Probabilistic Hybrid

Set Theoretic Models Boolean model - Simple Boolean queries regarding existence of terms within documents. Queries do not contain information about the context of the terms. fuzzy set model - Slight expansion of Boolean. Allows results to include documents that meet most of the requirements of the Boolean search.

Algebraic models (vector-space model) Documents are represented by n-dimensional vectors. Typically one dimension per term Also possible to treat signatures as bit vectors Queries are n-dimensional vectors Query relevance is the scalar product of the document with the query

Probabilistic models Start with some user-supplied relevance information about a “training set” of documents Compute P(relevant | T) and P(non-relevant |T) based on the terms observed in the training set Useful for theoretical analysis, but probably not in practice (?)

Hybrid models (extended Boolean model) Represent documents as vectors Use the L-p norm, to allow definition of Boolean operations on vectors p=1 ==> the vector model p=infinity & terms are equally weighted ==> Boolean model Empirically best values: 2 <= p <= 5

User feedback Modify query representation (can be done by the user) modify term weights query expansion (add new terms) split the query Modify document representation change term weights within the database agent-based filtering

Roadmap Information Retrieval Implementation Issues and Techniques Analysis

Web Crawling WWW is a directed graph starting points: Use your favorite graph traversal algorithm!! Netizenship issues starting points: individual page set of pages domain name searching (good because the web isn't necessarily connected)

Automatic Indexing single term - Just look at the existence or non-existence of the term in the document phrase - Additionally store other information about the position of the term in the document, and the positions of other terms relative to it

Automatic Indexing (cont) Statistical - Term weights depend on how well they differentiate between documents Information-theoretic - Signal to noise. Similar to some types of statistical indexing Probabilistic - Compute the importance of terms based on user feedback on a subset of the documents linguistic - Use language syntax information such as part of speech

Current Search Engines type 1: automatically indexed type 2: (partially) human indexed, hierarchically organized Common features allow Boolean searches do vector-like queries to find document relevance

Current Search Engines (cont) Type 1 AltaVista, Excite, HotBot, InfoSeek, Lycos, OpenText Type 2 Yahoo, Magellan, WWW Virtual Library, Galaxy

Roadmap Information Retrieval Implementation Issues and Techniques Analysis

Analysis disjunctive >= conjunctive >= phrase (DUH!) Flaws No tayloring of search to intent of query (by adding/excluding terms) or doing more complicated boolean expressions. No tayloring of search to specific capabilities of search engine (lowest common denominator)

Future Directions Use META tags to note content Add user feedback mechanisms Have small, specific databases, rather than monolithic databases Create common interfaces (federation of databases) Possibly allow better management of index content?