Basics of Information Retrieval W Arms Digital Libraries 1999 Manuscript as background reading.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Content and Systems Week 3. Today’s goals Obtaining, describing, indexing content –XML –Metadata Preparing for the installation of Dspace –Computers available.
Digital Libraries Models and Content. Goals for tonight Finish up from last week – the 5 S model more formally – Status of the systems available Obtaining,
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
IR Models: Overview, Boolean, and Vector
Information Retrieval in Practice
Search Engines and Information Retrieval
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
1 CS 502: Computing Methods for Digital Libraries Lecture 17 Descriptive Metadata: Dublin Core.
Vector Space Model CS 652 Information Extraction and Integration.
The RDF meta model: a closer look Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
1 CS 430: Information Discovery Lecture 2 Introduction to Text Based Information Retrieval.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Chapter 5: Information Retrieval and Web Search
Introduction to XML This material is based heavily on the tutorial by the same name at
Overview of Search Engines
1 CS 430: Information Discovery Lecture 15 Library Catalogs 3.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Publishing Digital Content to a LOR Publishing Digital Content to a LOR 1.
8/28/97Organization of Information in Collections Introduction to Description: Dublin Core and History University of California, Berkeley School of Information.
Search Engines and Information Retrieval Chapter 1.
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 32-33: Information Retrieval: Basic concepts and Model.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Meta Tagging / Metadata Lindsay Berard Assisted by: Li Li.
Content and Computer Platforms Week 3. Today’s goals Obtaining, describing, indexing content –XML –Metadata Preparing for the installation of Dspace –Computers.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Weighting and Matching against Indices. Zipf’s Law In any corpus, such as the AIT, we can count how often each word occurs in the corpus as a whole =
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
BEN METADATA SPECIFICATION Isovera Consulting Feb
Content and Systems Week 3. Today’s goals Obtaining, describing, indexing content –XML –Metadata Preparing for the installation of Dspace –Computers available.
Web- and Multimedia-based Information Systems Lecture 2.
1 CS 430: Information Discovery Sample Midterm Examination Notes on the Solutions.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
The RDF meta model Basic ideas of the RDF Resource instance descriptions in the RDF format Application-specific RDF schemas Limitations of XML compared.
Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Searching Full Text 3.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
Attributes and Values Describing Entities. Metadata At the most basic level, metadata is just another term for description, or information about an entity.
Geospatial metadata Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
Alexandria Digital Library ADL Metadata Architecture Greg Janée.
Automated Information Retrieval
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Search Engine Architecture
Text Based Information Retrieval
CS 430: Information Discovery
CS 430: Information Discovery
Multimedia Information Retrieval
Attributes and Values Describing Entities.
Representation of documents and queries
CS 430: Information Discovery
Introduction to Information Retrieval
Chapter 5: Information Retrieval and Web Search
Attributes and Values Describing Entities.
Presentation transcript:

Basics of Information Retrieval W Arms Digital Libraries 1999 Manuscript as background reading

Information discovery  Searching vs browsing  When do you use one over the other?  Do we need both?  Is one a special case of the other?  Types of information seeking  Comprehensive search  Known (specific) item  Facts  Introduction or overview  Related information

Item descriptions  Metadata  Catalogs  Library catalog records are time-consuming to produce. They contain more than just easily available information about the item.  Services produce the catalog records and distribute them to libraries.  OCLC: Online Computer Library Center  Abstracting and indexing services  Alternative to catalog, more detailed description  Specific to a discipline  Automating the process is a subject of research and experiment

Paper topic suggestion  What is the state of the art in automatic indexing and abstracting? In what fields is this most fully researched? Who is leading the efforts?  What has been accomplished in auomatic e- mail indexing, summarizing?

Controlled Vocabularies and Ontologies  Effective description of materials requires unambiguous descriptive terms  Natural language is inherently ambiguous  Controlled vocabularies force use of a restricted set of terms  ACM CCS  Regularly updated, difficult to use  Computing Ontology

Dublin Core  Standard set of metadata fields for entries in digital libraries:  Title, creator, subject, description, publisher, contributor, date, type, format, identifier, source, language, relation, coverage, rights

Dublin Core elements see:  Title  Creator  Subject - C  Description  Publisher  Contributor  Date  Type - C  Format - C  Identifier  Source  Language  Relation  Coverage - C  Rights Rights Management information Space, time, jurisdiction. C = controlled vocabulary recommended. Reference to related resource Standards RFC 3066, ISO639 Unambiguous ID Ex: collection, dataset, event, image YYYY-MM-DD, ex. Entity primarily responsible for making content of the resource Entity making the resource available Contributor to content of the resource What is needed to display or operate the resource. Resource from which this one was derived

Metadata  What does metadata look like?  Metadata is data about data  Information about a resource, encoded in the resource or associated with the resource.  The language of metadata: XML  eXtensible Markup Language

XML  XML is a markup language  XML describes features  There is no standard XML  Use XML to create a resource type  Separately develop software to interact with the data described by the XML codes. Source: tutorial at w3school.com

XML rules  Easy rules, but very strict  First line is the version and character set used:   The rest is user defined tags  Every tag has an opening and a closing

Element naming  XML elements must follow these naming rules:  Names can contain letters, numbers, and other characters  Names must not start with a number or punctuation character  Names must not start with the letters xml (or XML or Xml..)  Names cannot contain spaces

Elements and attributes  Use elements to describe data  Use attributes to present information that is not part of the data  For example, the file type or some other information that would be useful in processing the data, but is not part of the data.

Repeating elements  Naming an element means it appears exactly once.  Name+ means it appears one or more times  Name* means it appears 0 or more times.  Name? Means it appears 0 or one time.

Using XML - an example Define the fields of a recipe collection: ISO 8859 is a character set. See

Processing the XML data  How do we know what to do with the information in an XML file?  Document Type Definition (DTD)  Put in the same file as the data -- immediate reference  Put a reference to an external description  Provides the definition of the legitimate content for each element

Document Type Definition   <!DOCTYPE recipe [   ]> Repeat 0 or more times

Meringue cookies 3 egg whites 1 cup sugar 1 teaspoon vanilla 2 cups mini chocolate chips Beat the egg whites until stiff. Stir in sugar, then vanilla. Gently fold in chocolate chips. Place in warm oven at 200 degrees for an hour. Alternatively, place in an oven at 350 degrees. Turn oven off and leave overnight. Not the way that I want to see a recipe in a magazine! What could we do with a large collection of such entries? How would we get the information entered into a collection? External reference to DTD

XML exercise  Design an XML schema for an application of your choice. Keep it simple.  Examples -- address book, TV program listing, DVD collection, …

Another example  A paper with content encoded with XML:  First few lines:   Standards E-learning and their possible support for a rich pedagogic approach in a  'Integrated Learning' context   Rodolophe  Borer   "ePBLpaper11.dtd” shown on next slide This paper is no longer available online

%foreign-dtd; Source:

Vocabulary  Given the need for processing, do you want free text or restricted entries?  Free text gives more flexibility for the person making the entry  Controlled vocabulary helps with  Consistent processing  Comparison between entries  Controlled vocabulary limits  Options for what is said

Vocabulary example  Recipe example  What text should be controlled?  What should be free text?  Ingredients  Ingredient-amount  Ingredient-name  Should we revise how we coded ingredient amount?  Directions

A DSpace example  CITIDEL:

IEEE - LOM  Example of a specialized metadata scheme  Learning Object Metadata  Specifically for collections of educational materials  Includes all of Dublin Core  See

Information Retrieval  Until now, information description  Now, how to match the information need to the resources available  Query - expresses the information need  Composed of individual words or symbols called search terms  Search types  Full text search  Compare search terms to every word in the text  Fielded search  Match the search terms to the relevant parts of the text

Information retrieval techniques  Eliminate stop words  Words that do not contribute to identifying useful resources  Typically: articles (a, an, the), prepositions (in, of, with, to, on,..), conjunctions (and, or), pronouns (he, she, it, they, them, …), auxiliary verbs or verb parts (to, be, was, …)  Making the stop list is not trivial.  Arms example of query to be or not to be, composed entirely of words usually considered stop words.

IR techniques - 2  Inverted files  List the words in the whole document collection and append a pointer to each place the word appears  Extract all the words  Alphabetize  “Tokenize” find the root word, without punctuation, etc.  Stemming Reduce the word to its basic stem  Index Link each word to its location in the document  Example - handout and exercise Interesting resource:

Inverted file exercise  Given the short documents,  Each team takes a document and produces an alphabetical list of all the words in the document.  Make a stop list (what words will you put on it?)  Reduce each word to its stem. (For computer, computing, etc. use “compute” as the stem.)  List the location of each word by counting the word position in the document.  Make your list show the word, then the document number, then the number of times the word occurs in that document and the locations in which the word appears.

Using our inverted file  Search for Computer Science and Election  How well does our inverted file serve our purpose?

Search result evaluation  Basic search question -- Is this word (or are these words) in the document?  Answer -- yes or no.  Is that good enough? Are all yes responses the same?  Boolean search --  Using the inverted list, we can find what terms are in the document and also the relative positions of the words.  Basic Boolean search is for an exact match.  Tokenizing and stemming improves performance, but more is possible.

The Vector model  Simplest verstion -- boolean vectors  Vector for a document  Each position represents a word  There is a 1 if the document contains the word and a 0 if it does not  Vector for a query  Each position is the same as for the document  There is a 1 wherever the word corresponds to a term in the query and a 0 everywhere else.

Example boolean vector  Consider the “documents”  (1) Is computer science relevant?  (2) Computing majors are needed.  Index terms:  compute major need science relevant  Document vectors:  (1) { }  (2) { }  Consider a query: Computing relevance  Query vector { }

Compare document and query vectors  Document vectors:  (1) { }  (2) { }  Consider a query: Computing relevance  Query vector { }  Doc 1 & Query: { }  Doc 2 & Query: { }  Note that this tells us that document 1 contains exactly the terms of the query, but does not tell us how many occurrences there are, or the relative positions of the terms. If we had many documents and computed the same vectors for several, how would we decide which is best? Can we rank the results? Does the notion of a vector make sense?

The vector model  Let  k i be an index term (keyword)  N be the total number of documents in the collection  n i be the number of documents that contain k i  freq(i,j) raw frequency of k i within document d j  A normalized tf (term frequency) factor is given by  tf(i,j) = freq(i,j) / max(freq(i,j))  where the maximum is computed over all terms that occur within the document d j  The idf (inverse term frequency) factor, computed as  idf i = log (N/n i )  the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term k i.

Consider these terms  tf(i,j) = freq(i,j) / max(freq(i,j))  idf i = log (N/n i )

Vector Model - 2  These expressions allow us to give weights to terms within documents.  tf: term frequency, quantifies intra-document occurrence (also called term density in a document)  idf: inverse document frequency, quantifies inter- document differentiation. If a word is common to nearly all the documents in the collection, it will not be very useful in finding good matches to a query.  The weight assigned to word i in document j is  wij = tf(i,j) * idf(i)  This is called the tf-idf weighting scheme  This method generally does as well as any other ranking scheme and has the advantage simplicity and computational efficiency.  Weights may also be assigned to words in the query.

Some examples  The following examples come from slides provided by the author of the textbook:  Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto  Addison Wesley Longman Publishing Company  All the slides that go with the book are available there.

The Vector Model: Example I d1 d2 d3 d4d5 d6 d7 k1 k2 k3 Relationship between documents and keywords shown here Here, all the keywords are equally weighted in the documents and in the query. This column tells us how well each document matches the query

The Vector Model: Example II d1 d2 d3 d4d5 d6 d7 k1 k2 k3 Query terms are weighted, but the documents are not.

The Vector Model: Example III d1 d2 d3 d4d5 d6 d7 k1 k2 k3 Document and query terms are weighted. Compare the three results for document recommendations.

Evaluating IR results  Precision  Of the results returned, what percentage were relevant  Recall  Of the matches available, what percentage were returned.

This session  Talked about the way that content is described.  Looked at how a document is indexed  Looked at how a query is matched to a document  Looked at the value of weighting the occurrence of words in a document  Some specific things: Dublin Core, XML, Boolean and Vector Space Modeling of Information Retrieval.