Download presentation
Presentation is loading. Please wait.
1
Information Retrieval on the World Wide Web
Authors: Venkat N. Gudivada Vijay V. Raghavan William I. Grosky Rajesh Kasanagottu Presented by Rob von Behren
2
Roadmap Information Retrieval Implementation Issues and Techniques
Analysis
3
Definitions Information retrieval - querying against a set of documents to find a subset of "relevant" documents Objective terms - external descriptions, not related to content Nonobjective terms (content terms) - descriptions of the informational content of the document indexing exhaustivity - degree to which the index covers the document space term specificity - describes how well a particular term limits search results. Recall - relevant docs found / relevant docs in collection Precision - relevant docs found / total docs found
4
Key qualities Document and query representations
Mechanisms for finding relevant documents and ranking the results Mechanisms for obtaining user feedback
5
Types of IR models Set Theoretic Algebraic Probabilistic Hybrid
6
Set Theoretic Models Boolean model - Simple Boolean queries regarding existence of terms within documents. Queries do not contain information about the context of the terms. fuzzy set model - Slight expansion of Boolean. Allows results to include documents that meet most of the requirements of the Boolean search.
7
Algebraic models (vector-space model)
Documents are represented by n-dimensional vectors. Typically one dimension per term Also possible to treat signatures as bit vectors Queries are n-dimensional vectors Query relevance is the scalar product of the document with the query
8
Probabilistic models Start with some user-supplied relevance information about a “training set” of documents Compute P(relevant | T) and P(non-relevant |T) based on the terms observed in the training set Useful for theoretical analysis, but probably not in practice (?)
9
Hybrid models (extended Boolean model)
Represent documents as vectors Use the L-p norm, to allow definition of Boolean operations on vectors p=1 ==> the vector model p=infinity & terms are equally weighted ==> Boolean model Empirically best values: 2 <= p <= 5
10
User feedback Modify query representation (can be done by the user)
modify term weights query expansion (add new terms) split the query Modify document representation change term weights within the database agent-based filtering
11
Roadmap Information Retrieval Implementation Issues and Techniques
Analysis
12
Web Crawling WWW is a directed graph starting points:
Use your favorite graph traversal algorithm!! Netizenship issues starting points: individual page set of pages domain name searching (good because the web isn't necessarily connected)
13
Automatic Indexing single term - Just look at the existence or non-existence of the term in the document phrase - Additionally store other information about the position of the term in the document, and the positions of other terms relative to it
14
Automatic Indexing (cont)
Statistical - Term weights depend on how well they differentiate between documents Information-theoretic - Signal to noise. Similar to some types of statistical indexing Probabilistic - Compute the importance of terms based on user feedback on a subset of the documents linguistic - Use language syntax information such as part of speech
15
Current Search Engines
type 1: automatically indexed type 2: (partially) human indexed, hierarchically organized Common features allow Boolean searches do vector-like queries to find document relevance
16
Current Search Engines (cont)
Type 1 AltaVista, Excite, HotBot, InfoSeek, Lycos, OpenText Type 2 Yahoo, Magellan, WWW Virtual Library, Galaxy
17
Roadmap Information Retrieval Implementation Issues and Techniques
Analysis
18
Analysis disjunctive >= conjunctive >= phrase (DUH!) Flaws
No tayloring of search to intent of query (by adding/excluding terms) or doing more complicated boolean expressions. No tayloring of search to specific capabilities of search engine (lowest common denominator)
19
Future Directions Use META tags to note content
Add user feedback mechanisms Have small, specific databases, rather than monolithic databases Create common interfaces (federation of databases) Possibly allow better management of index content?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.