LIS618 lecture 3 Thomas Krichel 2004-02-15. Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector.

Slides:



Advertisements
Similar presentations
ELibrary Topic Search Basics eLibrary topic search allows users to locate articles and multimedia resources –Relevant to K-12 curricula and user.
Advertisements

Publishers Web Sites Standard Features. Objectives Access publishers websites Identify general features available on most publishers websites Know how.
LIS618 lecture 2 Thomas Krichel Structure Theory: information retrieval performance Practice: more advanced dialog.
LIS618 lecture 3 Thomas Krichel Structure Theory: discussion of the Boolean model Theory: the vector model Practice: Nexis.
LIS618 lecture 4 Thomas Krichel Structure Document preprocessing Practice: Nexis –document preprocessing –segment theory and practice Practice:
LIS618 lecture 6 Thomas Krichel structure DIALOG –basic vs additional index –initial database file selection (files) Lexis/Nexis.
LIS618 lecture 1 Thomas Krichel Structure of talk Recap on Boolean Before online searching Working with DIALOG –Overview –Search command –Bluesheets.
Effective Searching Strategies and Techniques
Text Categorization.
Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Natural Language Processing WEB SEARCH ENGINES August, 2002.
LIS618 lecture 9 Thomas Krichel Structure Google “theory”, see essay by Brin and Page fullpapers/1921/com1921.htm.
Implicit Queries for Vitor R. Carvalho (Joint work with Joshua Goodman, at Microsoft Research)
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
IR Models: Overview, Boolean, and Vector
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
Internet Resources Discovery (IRD) Search Engines Quality.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Vector Space Model CS 652 Information Extraction and Integration.
Computer comunication B Information retrieval Repetition Retrieval models Wildcards Web information retrieval Digital libraries.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Information Retrieval
WISER: Newspapers online : an introduction to the scope and range of recent and current newspapers available on Oxlip, including hints on effective search.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
Unit 4: Mathematics Introduce the laws of Logarithms. Aims Objectives
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
LIS618 lecture 5 Thomas Krichel Structure of talk Nexis.com OCLC firstsearch.
LIS618 lecture 4 before searching + introduction to dialog Thomas Krichel
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 32-33: Information Retrieval: Basic concepts and Model.
LIS618 lecture 1 Thomas Krichel economic rational for traditional model In olden days the cost of telecommunication was high. database use.
Basics of Information Retrieval Lillian N. Cassel Some of these slides are taken or adapted from Source:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Medline on OvidSP. Medline Facts Extensive MeSH thesaurus structure with many synonyms used in mapping and multidatabase searching with Embase Thesaurus.
LIS618 lecture 4 Thomas Krichel Structure of talk Before online searching Introduction to online searching Introduction to DIALOG –Overview.
The Internet 8th Edition Tutorial 4 Searching the Web.
LIS618 lecture 8 Credo and Gale Thomas Krichel
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Parallel and Distributed Searching. Lecture Objectives Review Boolean Searching Indicate how Searches may be carried out in parallel Overview Distributed.
SEO. SEO Market Store Best Practice “The Rakuten Merchant Package for SEO will aid in improving the visibility of your store in search.” Getting Started.
Vector Space Models.
A process of taking your best guesses. Companies have web sites where you can access your information.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
LIS618 lecture 8 Thomas Krichel Lexis/Nexis Lexis is a specialized legal research service Nexis is primarily a news services adds an important.
LIS618 lecture 4 Thomas Krichel Structure of talk The blue sheet Working with Dialog Nexis.com.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Plan for Today’s Lecture(s)
Education 499-R01 Search Basics.
Internet Searching: Finding Quality Information
Information Retrieval and Web Search
Multimedia Information Retrieval
ITE 130 Web Searching.
אחזור מידע, מנועי חיפוש וספריות
Information Retrieval
Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.
Chapter 5: Information Retrieval and Web Search
PubMed Search Options (Basic Course: Module 6)
Recuperação de Informação B
Information Retrieval and Web Design
Recuperação de Informação B
Presentation transcript:

LIS618 lecture 3 Thomas Krichel

Structure Revision of what was done last week. Theory: discussion of the Boolean model Theory: the vector model Practice: introducing Nexis More Nexis next week

advantages of Boolean model supposedly easy to grasp by the user precise semantics of queries implemented in the majority of commercial systems

problems of Boolean model sharp distinction between relevant and irrelevant documents no ranking possible users find it difficult to formulate Boolean queries users find it difficult to resolve Boolean queries

vector model associates weights with each index term appearing in the query and in each database document. relevance can be calculated as the cosine between the two vectors, i.e. their cross product divided be the square roots of the squares of each vector. This measure varies between 0 and 1.

tf/idf stands for term frequency / inverse document frequency This refers to a technique that gives term a high rank in a document if –the term appears frequently in a document –the term does not appear frequently in other documents We will look at each component one at time.

absolute & maximum term frequency Let F_t_d be the number of times term t appears in the document d. This is its absolute term frequency in the document. Let m_d be the maximum absolute term frequency achieved by any term in document d. Examples –Document 1: a b a a b c c d m_1 = 3, because "a" appears 3 times –Document 2: a b a f f f e d f a a m_2 = 4, because "a" or "f" appears 4 times

relative document term frequency The relative term frequency f_t_d, is given by f_t_d = F_t_d / m_d that is the absolute term frequency of term t in document d divided by the maximum absolute term frequency of document d. This completes the "term frequency" part of the tf/idf formula. Let us look at this part through an example.

main example, part I Consider three documents –1: a b c a f o n l p o f t y x –2: a m o e e e n n n a n p l –3: r a e e f n l i f f f f x l First, look at the maximum frequency achieved by any term in a given document. m_1 = 2("a", "f" and "o" are there twice) m_2 = 4("n" is there four times) m_3 = 5 ("f" is there five times)

main example part II Now look at some example of absolute term frequency F_a_1 = 2 F_e_2 = 3 F_x_3 = 1 and some examples of relative term frequency f_a_1 = F_a_1 / m_1 = 2 / 2 = 1 f_e_2 = F_e_2 / m_2 = 3 / 4 = 0.75 f_x_3 = F_x_3 / m_3 = 1 / 5 = 0.2

inverse document frequency Let N be the number of documents in the datebase. N=3 in our example. Let n_t be the number of documents where the term t appears. In our example n_a = 3n_e = 2n_x = 2 N/n_t is an indication of inverse document frequency of a term. It is larger the less a term appears across documents in the database.

intermezzo: the logarithm The logarithm, written log() is a mathematical function. You should know that –log() is an increasing function, i.e. the bigger is x, the bigger is log(x). –log(1) = 0 –log(x) > 0 if x > 1 Your calculator will tell you what the logarithm of a number is.

tf/idf formula Term frequency and inverse document frequency have to be combined. The final formula for the weight combines the terms as follows w_t_d = f_t_d * log( N / n_t )

main example part III N = 3 w_a_1 = 1 * log(3/3) = log(1) = 0! w_e_2 = 0.75 * log(3/2) w_x_3 = 0.2 * log(3/2) where log(3/2) = 0.176, approximately

practical operation The computer will search the documents for the query term and return the documents where the weight of term in the index for that document is strictly positive, by order of weights, highest to lowest. If there are several query terms the computer will perform a more complicated operation that we will not further study here, so we limit ourselves to the case of one query term.

practical tests You ask the computer to query the term "a" in our example. What documents are being returned? –Compare with the result of the Boolean model. You ask the computer to query the term "e". What documents are being returned, and in what order?

advantages of vector model term weighting improves performance sorting is possible easy to compute, therefore fast results are difficult to improve without –query expansion –user feedback circle

Lexis/Nexis Lexis is a specialized legal research service Nexis is primarily a news services adds an important temporal component to all its contents restricts contents as compared to Dialog potentially bad competition from Google lives at

compilation of Nexis Uses a number of news sources such as newspapers. Uses company reports databases Uses web sites, the URLs of which are found in the news sources. Some of the material there can be of low value (remember the comments in the first lecture)

SmartIndexing There is a controlled vocabulary of indexing terms A document is indexed –In full text view (except web sites) –With automatic addition of index terms that correspond to the document. Index terms are added Weight of index terms is calculated nexis.com/infopro/products/index/ has more on it. nexis.com/

equivalents Nexis has a number of "equivalents" where, depending on sources, it replaces one with the other. Contrary to their claims they also work in quick search First (second, third, etc.)is 1st (2nd, 3rd, etc.) Monday (All days ex. Sunday) Mon (Tues, Weds, etc.) January (Abbreviations work) Jan (Feb, Mar, etc.) One (all numbers < 20) 1 (2, 3, etc.) and& companyco corporationcorp incorporated inc

Six interfaces to Nexis Quick search Subject directory Power search Personal news Search forms Real time news In the remainder of the lecture I will go through some of these

Quick search Implicit OR between terms Use quotes to require adjacency of terms You can select from a drop-down box of sources You can set the date range, though unclear what it means It seems to OR a plural to your search term. Sometimes returns documents with none of the search terms. she is the one

Quick search It is not clear what parts of documents are being searched Apparently it does not search the full text. But it seems to prioritize –TERM, i.e. smart keywords extracted, –HLEAD for news –TITLE for legal documents –WEB-SEARCH-TEXT for web pages

relevance ranking concerns where terms appear within the document how many occurrences of the terms appear in the document how often those search terms appear throughout the document apparently not how much they occur, example search for "the" or "the the" seems that they guard algorithm a secret

Subject directory you can follow the subject tree but there seems to be only a tiny amount of documents categories are not particularly deep or developed there is a "more like this" feature of limited use, Thomas finds

Power search You can first create a customized set of sources to search Do this at the start, you browse a menu, then click done, search now This is a lot more efficient than trying to build a search strategy on a large set.

power search truncation * represents a single character, present or absent –wom*n –labo*r ! truncates to the end of the word –bookk!

Power search connectors OR AND AND NOT PRE/n, n is a number, ordered proximity W/n, n is a number, unordered proximity W/S words in same sentence W/P words is the some paragraph Use parentheses! There is no implicit or as in the simple search, so forget about the double quotes.

Power search expressions Parentheses group terms together * for one or no letter ! for any number of letters ATLEAST n (term), where n is a minimum number of occurrences PLURAL (term) only the plural of term SINGULAR (term) only the singular of term ALLCAPS (term) only capitals of term NOCAPS (term) no capitals of term CAPS (term) capitalized term only

Thank you for your attention!