Presentation is loading. Please wait.

Presentation is loading. Please wait.

 IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find.

Similar presentations


Presentation on theme: " IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find."— Presentation transcript:

1  IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find all docs containing information on college tennis teams which: (1) are maintained by a USA university and (2) participate in the NCAA tournament.  Emphasis is on the retrieval of information (not data)

2 Motivation ► IR at the center of the stage  IR in the last 20 years: ► classification and categorization ► systems and languages ► user interfaces and visualization  Still, area was seen as of narrow interest  Advent of the Web changed this perception once and for all ► universal repository of knowledge ► free (low cost) universal access ► no central editorial board ► many problems though: IR seen as key to finding the solutions!

3 Motivation ► Data retrieval  which docs contain a set of keywords?  Well defined semantics  a single erroneous object implies failure! ► Information retrieval  information about a subject or topic  semantics is frequently loose  small errors are tolerated ► IR system:  interpret contents of information items  generate a ranking which reflects relevance  notion of relevance is most important

4 Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).

5  Salton-1989: ” Information-retrieval systems process files of records and requests for information, and identify and retrieve from files certain records is response to the information requests. the retrieval depends on similarity between records and queries.  Kowalski-1997: ” An information retrieval systems is a systems that is capable of storage, retrieval, and maintenance of information.

6  IR is a branch of applied computer science focusing on the representation, storage, organization, access, and distribution of information.  IR involves helping users find information that matches their information needs. System- centered View User- centered

7  IR systems contain three components:  System  People  Documents (information items) User System Documents

8 Basic Concepts ► The User Task  Retrieval ► information or data  Browsing ( main objectives are not clearly defined and might change during the interaction ) ► Task of Hypertest  Modern Digital libraries and web interfaces - try to combine retreival + browsing ► WWW(retrieval & browsing) either Pulling or Pushing Retrieval Browsing Database

9 Basic Concepts ► Logical view of the documents Text operations or transformation Reduce the complexity of the document representation and allow moving the logical view from that of full text to that of a set of index terms ► Several intermediate logical views (of a document) might be adopted structure Accents spacing stopwords Noun groups stemming Manual indexing Docs structureFull textIndex terms

10 ► “ Stop Words ”  Certain words are considered irrelevant and not placed in the bag(e.g., the, HTML Tage like )  “ Stemming ” and other cotent analysis  Stimming:Reduce terms to their roots before indexing  Use English-specific rules, convert word to their basic form example( surfing, surfed  surf Identification of noun groups:eliminates adjectives, adverbs, and verbs ► logical view of docs might shift

11 Processor Document Input Feedback Output Queries

12 User Interface Text Operations Query Operations Indexing Searching Ranking Index Text query user need user feedback ranked docs retrieved docs logical view inverted file DB Manager Module 4, 10 6, 7 58 2 8 Text Database Text The Retrieval Process

13  Text Operation: forms index words (token)  Tokenization-Stopwords removal  Stemming  Indexing: construct an inverted index of words to document pointers.  Mapping from keyword to document ids  Searching: retrieves documents that contain a given query token from the inverted index  Ranking: scores all retrieved documents according to a relevance metric  A ranking is based on fundamental premises regarding the notion of relevance such as:  Common set of index terms  Sharing of weight terms  Likelihood of relevance  Each set of premises leads to distinct IR Model

14  User Interface: Manages interaction with user:  Query input and document output  Visualization of result  Relevance feedback  Query operation: transform the query to improve retrieval:  Query expansion using a thesaurus.chpt2  Query transformation using relevance feedback chpt2

15  Simplest notion of relevance is that the query string appears verbatim (order is important) in the document  Slightly less strict notion is that the word in the query appears frequently in the document, in any order (bag of words)  May not retrieve relevant documents that include synonymous terms  Restaurant vs. caf é  May retrieve irrelevant document that include ambiguous terms  Apple (company or fruit)  Bit (unit of data or act of eating)

16  Data  String of symbols associated with objects, people, and events  Values of an attribute  Data need not have meaning to everyone  Data must be interpreted with associated attributes.

17  Information  The meaning of the data interpreted by a person or a system  Data that changes the state of a person or system that perceives it.  Data that reduces uncertainty.  if data contain no uncertainty, there are no information with the data.  Examples: It snows in the winter. It does not snow this winter.

18  knowledge  Structured information  through structuring, information becomes understandable  Processed Information  through processing, information becomes meaningful and useful  information shared and agreed upon within a community Data information knowledge

19  Strings of ASCII symbols or Unicode  structured by the author  indexed by information service providers  Representation of natural languages people use  To convey meanings  To communicate between readers and authors.  Data or information?  If it can be understood, it ’ s information.  by Whom? A person or a system?

20  Logical unit of text  articles, books,  links, web pages  Other components that come with the text  figures, charts, graphics  multimedia

21  Repository of human intellectuals  Rich and diverse resources for all answers.  If it is written, it is there (in text)  Meaningful and understandable (to users).  Simple ASCII representation  Free of pre-formatted structures  continuous  separated into documents  Easy to process by the computer  Machine Intensive (not labor intensive)

22  Massive  Any IR system needs the capability of large scale data processing.  Use of indexes and various representations are required.  Inconsistent  It’s a human language  Syntactical and semantic variances  Same information expressed in different ways.  Different information expressed in similar ways.  Incomplete  It uses common knowledge.  It’s an open system.

23  Retrieval  What do we retrieve?  Data  Information  Knowledge  We retrieve documents that contains text which carries information.  Information can be anywhere  in the text, in the links, in the process of text.

24  Are they the same?  Text retrieval  Document retrieval  Information retrieval

25  Conceptually, information retrieval is used to cover all related problems in finding needed information  Historically, information retrieval is about document retrieval, emphasizing document as the basic unit  Technically, information retrieval refers to (text) string manipulation, indexing, matching, querying, etc.

26 Data retrieval Information retrieval ContentData Information Data objectTable Document MatchingExact match Partial match, best match Items wantedMatching Relevant Query languageSQL(artificial) Natural Query specificationComplete Incomplete ModelDeterministic Probabilistic Highly structured less structure

27  The goal of IR systems is to help users find information that satisfies their information needs.  The main process of IR systems is to match data abstracted from the real world to queries abstracted from user ’ s information needs.  Information retrieval is much more difficult than data retrieval.


Download ppt " IR: representation, storage, organization of, and access to information items  Focus is on the user information need  User information need:  Find."

Similar presentations


Ads by Google