Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:

Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source: http://www.stanford.edu/class/cs276/cs276-2006-syllabus.html

Basic ideas  Information overload  The challenging byproduct of the information age  Huge amounts of information available -- how to find what you need when you need it  Think about addresses, e-mail messages, files of interesting articles, etc.  Information retrieval is the formal study of efficient and effective ways to extract the right bit of information from a collection.  The web is a special case, as we will discuss.

Some distinctions  Data, information, knowledge  How do you distinguish among them?  http://www.systems-thinking.org/dikw/dikw.htm http://www.systems-thinking.org/dikw/dikw.htm  Information sources  Very well organized, indexed, controlled  Totally unorganized, uncharacterized, uncontrolled  Something in between

Databases  Databases hold specific data items  Organization is explicit  Keys relate items to each other  Queries are constrained, but effective in retrieving the data that is there  Databases generally respond to specific queries with specific results  Browsing is difficult  Searching for items not anticipated by the designers can be difficult

The Web  The Web contains many kinds of elements  Organization?  There are no keys to relate items to each other  Queries are unconstrained; effectiveness depends on the tools used.  Web queries generally respond to general queries with specific results  Browsing is possible, though somewhat complicated  There are no designers of the overall Web structure.  Describe how you frequently use the web  What works easily?  What has been difficult?

Digital Library  Something in between the very structured database and the unstructured Web.  Content is controlled. Someone makes the entries. (Maybe a lot of people make the entries, but there are rules for admission.)  Searching and browsing are somewhat open, not controlled by fixed keys and anticipated queries.  Nature of the collection regulates indexing somewhat.

In all cases  Trying to connect an information user to the specific information wanted.  Concerned with efficiency and effectiveness  Effectiveness - how well did we do?  Efficiency - how well did we use available resources?

Effectiveness  Two measures:  Precision  Of the results returned, what percentage are meaningful to the goal of the query?  Recall  Of the materials available that match the query, what percentage were returned?  Ex. Search returns 590,000 responses and 195 are relevant. How well did we do?  Not enough information.  Did the 590,000 include all relevant responses? If so, recall is perfect.  195/590,000 is not good precision!

The process Query entered Query Interpreted Items retrieved Index searched Results Ranked

The Collection  Where does the collection come from?  How is the index created?  Those are important distinguishing characteristics  Inverted Index -- Ordered list of terms related to the collected materials. Each term has an associated pointer to the related material(s).  www.cs.cityu.edu.hk/~deng/5286/T51.doc www.cs.cityu.edu.hk/~deng/5286/T51.doc

Crawling the web  Misnomer as the spider or robot does not actually move about the web  Program sends a normal request for the page, just as a browser would.  Retrieve the page and parse it.  Look for anchors -- pointers to other pages. Put them on a list of URLs to visit  Extract key words (possibly all words) to use as index terms related to that page  Take the next URL and do it again  Actually, the crawling and processing are parallel activities

Responding to search queries  Use the query string provided  Form a boolean query  Join all words with AND? With OR?  Find the related index terms  Return the information available about the pages that correspond to the query terms.  Many variations on how to do this. Usually proprietary to the company.

Making the connections  Stemming  Making sure that simple variations in word form are recognized as equivalent for the purpose of the search: exercise, exercises, exercised, for example.  Indexing  A keyword or group of selected words  Any word (more general)  How to choose the most relevant terms to use as index elements for a set of documents.  Build an inverted file for the chosen index terms.

The Vector model  Let,  N be the total number of documents in the collection  n i be the number of documents which contain k i  freq(i,j) raw frequency of k i within d j  A normalized tf (term frequency) factor is given by  tf(i,j) = freq(i,j) / max(freq(i,j))  where the maximum is computed over all terms which occur within the document d j  The idf (index term frequency) factor is computed as  idf(i) = log (N/n i )  the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term k i.

Anatomy of a web page  Metatags: Information about the page  Primary source of indexing information for a search engine.  Ex. Title. Never mind what has an H1 tag (though that may be considered), what is in the brackets?  Other tags provide information about the page. This is easier for the search engine to use than determining the meaning of the text of the page.  Dealing with the cheaters  False information provided in the web page to make the search engine return this page  False metatags, invisible words (repeated many times), etc

Standard Metatags  The Dublin Core (http://dublincore.org/)http://dublincore.org/ 15 common items to use in labeling any web document TitleContributorSource CreatorDateLanguage Subject Resources typeRelation DescriptionFormatCoverage PublisherIdentifierRights

Hubs and authorities  Hub points to a lot of other places.  CITIDEL is a hub for computing information  NSDL is a hub for science, technology, engineering and mathematics education.  Authorities are pointed to by a lot of other places.  W3C.org is an authority for information about the web.  When Hub or Authority status is captured, the search can be more accurate.  If several pages match a query, and one is an authority page, it will be ranked higher.  When a hub matches a query, the pages it points to are likely to be relevant.

Some Digital Library examples  Between the chaos of the Web and the strict structure of a database, the digital library contains an organized collection.  We saw the digital collection at the Falvey library session.  See also:  NSDL www.nsdl.orgwww.nsdl.org  And the computing component, CITIDEL: citidel.villanova.edu  American Memory http://memory.loc.gov/ammem/index.html http://memory.loc.gov/ammem/index.html

Conclusions  The plan was to introduce the basic concepts of information retrieval in a form accessible to most students,before you have read anything about it.  We will look more deeply at these subjects in the coming weeks.  A word about the pattern for these slides …

Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:

Similar presentations

Presentation on theme: "Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:

Similar presentations

Presentation on theme: "Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:"— Presentation transcript:

Similar presentations

About project

Feedback