Presentation is loading. Please wait.

Presentation is loading. Please wait.

C.Watterscsci64031 Information Retrieval Csci6403 Dr.Carolyn Watters.

Similar presentations


Presentation on theme: "C.Watterscsci64031 Information Retrieval Csci6403 Dr.Carolyn Watters."— Presentation transcript:

1

2 C.Watterscsci64031 Information Retrieval Csci6403 Dr.Carolyn Watters

3 C.Watterscsci64032 Outline Definitions Information Retrieval Information Theory Feature Sets & Term characteristics

4 C.Watterscsci64033 General Terms & Concepts Data Information Retrieval Document Question Answering Filtering Clustering Browsing

5 C.Watterscsci64034 History Card catalog Hole punch Databases and queries Multimedia (images, audio, etc) Web

6 C.Watterscsci64035 Examples New York Times Google Amazon Medline Lexis/Nexis

7 C.Watterscsci64036 http://www.lexis-nexis.com

8 C.Watterscsci64037 IR and CS? What are these systems based on? How can we make them better? How do we know if they are effective? What else could we do using these techniques?

9 C.Watterscsci64038 IR and Databases IRDBMS Dataunstructuredstructured AttributesvagueWell defined queriesKeyword & features SQL defined Resultsimpreciseexact

10 C.Watterscsci64039 Basic Ideas/Problems Behind IR Retrieve text that contains the answer Use keywords to represent query Assume user can articulate need No universal categorization of data Relevant items are similar to query Relevant items are similar to each other More than one right answer Results may “satisfice”

11 C.Watterscsci640310 Similarity Query -> document Document -> document Similar? –String matching –Controlled vocabulary match –Same meaning –Probability about same topic

12 C.Watterscsci640311 Using Keywords as Feature Set Bag of Words Approach Compare words as independent tokens Why would we do this? For Example – DOW weathers storm –storm weathers door

13 C.Watterscsci640312 Important Words? Enron Ruling Leaves Corporate Advisers Open to Lawsuits By KURT EICHENWALD A ruling last week by a federal judge in Houston may well have accomplished what a year's worth of reform by lawmakers and regulators has failed to achieve: preventing the circumstances that led to Enron's stunning collapse from happening again. To casual observers, Friday's decision by the judge, Melinda F. Harmon, may seem innocuous and not surprising. In it, she held that banks, law firms and investment houses — many of them criticized on Capitol Hill for helping Enron construct off-the-books partnerships that led to its implosion — could be sued by investors who are seeking to regain billions of dollars they lost in the debacle.

14 C.Watterscsci640313 IR and the Bag of Words Find all words in document Compare query words to these words Works pretty well!!! Improvements –???

15 C.Watterscsci640314 Information Theory & IR Shannon 1948 Information content (value) of a message depends on both receiver’s knowledge and message content

16 C.Watterscsci640315 Try this Merry …. Happy …. Prime Minister …. Professor….. teaches computer science. Tomorrow we expect a high temperature of …. Warning …….

17 C.Watterscsci640316 Information Theory Value or content of a message is based on how much the receiver’s uncertainty (entropy) is reduced Predictability of the message (impact of content) –Very predictable – low uncertainty – low entropy Hello, good day, how are you? Fine. –Unpredictable – high uncertainty – high entropy Move your car. Leave the building.

18 C.Watterscsci640317 Information Content Function H defines the Information Content H(p) = -log p H(p) is the a priori probability that a message could be predicted So, if a receiver can predict a message With p=1 then H(1) = 0 If cannot predict message Then p=0 and H(0) is undefined

19 C.Watterscsci640318 Calculation of Entropy Example – receive one letter of the alphabet H = log 1/26 or 4.7 bits if all equally likely 4.14 bits given known distribution Given n messages, the average information content (bits) of any one of those messages is H = - p r log p r Average Entropy is maximized when? –All messages are equally likely –When would this occur?

20 C.Watterscsci640319 Entropy and Words Given D unique words in a vocabulary H = -  p r log p r Turns out that DH (bits) 10,0009.5 50,00010.9 100,00011.4

21 C.Watterscsci640320 Using Entropy Information Content is additive H(p 1, p 2 ) = H(p 1 ) + H( p 2 ) So what?? Google Queries some terms have more information value some retrieval messages have more information value SO??

22 C.Watterscsci640321 Next? Examine the nature of these words Why? What is the relative value of search terms? What is the relative value of terms in document set?

23 C.Watterscsci640322 For Tuesday Read the handout article Prepare a review for it using the form found on the web site Articles on Reviewing can be found at the end of the Topics Page Think about the notion of finding information based only on the words used in the text!


Download ppt "C.Watterscsci64031 Information Retrieval Csci6403 Dr.Carolyn Watters."

Similar presentations


Ads by Google