Download presentation
Presentation is loading. Please wait.
Published byConstance Lynette Walton Modified over 9 years ago
1
IR: representation, storage, organization of, and access to information items Focus is on the user information need User information need: Find all docs containing information on college tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament. Emphasis is on the retrieval of information (not data)
2
Motivation ► IR at the center of the stage IR in the last 20 years: ► classification and categorization ► systems and languages ► user interfaces and visualization Still, area was seen as of narrow interest Advent of the Web changed this perception once and for all ► universal repository of knowledge ► free (low cost) universal access ► no central editorial board ► many problems though: IR seen as key to finding the solutions!
3
Motivation ► Data retrieval which docs contain a set of keywords? Well defined semantics a single erroneous object implies failure! ► Information retrieval information about a subject or topic semantics is frequently loose small errors are tolerated ► IR system: interpret contents of information items generate a ranking which reflects relevance notion of relevance is most important
4
Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
5
Salton-1989: ” Information-retrieval systems process files of records and requests for information, and identify and retrieve from files certain records is response to the information requests. the retrieval depends on similarity between records and queries. Kowalski-1997: ” An information retrieval systems is a systems that is capable of storage, retrieval, and maintenance of information.
6
IR is a branch of applied computer science focusing on the representation, storage, organization, access, and distribution of information. IR involves helping users find information that matches their information needs. System- centered View User- centered
7
IR systems contain three components: System People Documents (information items) User System Documents
8
Basic Concepts ► The User Task Retrieval ► information or data Browsing ( main objectives are not clearly defined and might change during the interaction ) ► Task of Hypertest Modern Digital libraries and web interfaces - try to combine retreival + browsing ► WWW(retrieval & browsing) either Pulling or Pushing Retrieval Browsing Database
9
Basic Concepts ► Logical view of the documents Text operations or transformation Reduce the complexity of the document representation and allow moving the logical view from that of full text to that of a set of index terms ► Several intermediate logical views (of a document) might be adopted structure Accents spacing stopwords Noun groups stemming Manual indexing Docs structureFull textIndex terms
10
► “ Stop Words ” Certain words are considered irrelevant and not placed in the bag(e.g., the, HTML Tage like ) “ Stemming ” and other cotent analysis Stimming:Reduce terms to their roots before indexing Use English-specific rules, convert word to their basic form example( surfing, surfed surf Identification of noun groups:eliminates adjectives, adverbs, and verbs ► logical view of docs might shift
11
Processor Document Input Feedback Output Queries
12
User Interface Text Operations Query Operations Indexing Searching Ranking Index Text query user need user feedback ranked docs retrieved docs logical view inverted file DB Manager Module 4, 10 6, 7 58 2 8 Text Database Text The Retrieval Process
13
Text Operation: forms index words (token) Tokenization-Stopwords removal Stemming Indexing: construct an inverted index of words to document pointers. Mapping from keyword to document ids Searching: retrieves documents that contain a given query token from the inverted index Ranking: scores all retrieved documents according to a relevance metric A ranking is based on fundamental premises regarding the notion of relevance such as: Common set of index terms Sharing of weight terms Likelihood of relevance Each set of premises leads to distinct IR Model
14
User Interface: Manages interaction with user: Query input and document output Visualization of result Relevance feedback Query operation: transform the query to improve retrieval: Query expansion using a thesaurus.chpt2 Query transformation using relevance feedback chpt2
15
Simplest notion of relevance is that the query string appears verbatim (order is important) in the document Slightly less strict notion is that the word in the query appears frequently in the document, in any order (bag of words) May not retrieve relevant documents that include synonymous terms Restaurant vs. caf é May retrieve irrelevant document that include ambiguous terms Apple (company or fruit) Bit (unit of data or act of eating)
16
Data String of symbols associated with objects, people, and events Values of an attribute Data need not have meaning to everyone Data must be interpreted with associated attributes.
17
Information The meaning of the data interpreted by a person or a system Data that changes the state of a person or system that perceives it. Data that reduces uncertainty. if data contain no uncertainty, there are no information with the data. Examples: It snows in the winter. It does not snow this winter.
18
knowledge Structured information through structuring, information becomes understandable Processed Information through processing, information becomes meaningful and useful information shared and agreed upon within a community Data information knowledge
19
Strings of ASCII symbols or Unicode structured by the author indexed by information service providers Representation of natural languages people use To convey meanings To communicate between readers and authors. Data or information? If it can be understood, it ’ s information. by Whom? A person or a system?
20
Logical unit of text articles, books, links, web pages Other components that come with the text figures, charts, graphics multimedia
21
Repository of human intellectuals Rich and diverse resources for all answers. If it is written, it is there (in text) Meaningful and understandable (to users). Simple ASCII representation Free of pre-formatted structures continuous separated into documents Easy to process by the computer Machine Intensive (not labor intensive)
22
Massive Any IR system needs the capability of large scale data processing. Use of indexes and various representations are required. Inconsistent It’s a human language Syntactical and semantic variances Same information expressed in different ways. Different information expressed in similar ways. Incomplete It uses common knowledge. It’s an open system.
23
Retrieval What do we retrieve? Data Information Knowledge We retrieve documents that contains text which carries information. Information can be anywhere in the text, in the links, in the process of text.
24
Are they the same? Text retrieval Document retrieval Information retrieval
25
Conceptually, information retrieval is used to cover all related problems in finding needed information Historically, information retrieval is about document retrieval, emphasizing document as the basic unit Technically, information retrieval refers to (text) string manipulation, indexing, matching, querying, etc.
26
Data retrieval Information retrieval ContentData Information Data objectTable Document MatchingExact match Partial match, best match Items wantedMatching Relevant Query languageSQL(artificial) Natural Query specificationComplete Incomplete ModelDeterministic Probabilistic Highly structured less structure
27
The goal of IR systems is to help users find information that satisfies their information needs. The main process of IR systems is to match data abstracted from the real world to queries abstracted from user ’ s information needs. Information retrieval is much more difficult than data retrieval.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.