Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell

EE3J2 Data Mining Objectives  To introduce Topic Spotting –Salience and Usefulness –Example: The AT&T “How May I Help You?” system  Text retrieval revisited –Query expansion –Synonyms & hyponyms –WordNet

EE3J2 Data Mining Topic Spotting  Type of dedicated IR system –Always looking for the same type of documents –Documents about a particular topic –Corpus from which data is retrieved is dynamic  Examples –Detect all weather forecasts in BBC radio 4 broadcasts –Find all documents written by Charlotte Bronte –…

EE3J2 Data Mining Topic Spotting Queries  Examples rather than explicit queries  ‘Find more like this’  May have substantial quantity of example data  But why is Topic Spotting different to IR?  Can exploit large amount of query data  Calculate better measures of the usefulness of a term than simple IDF

EE3J2 Data Mining IDF Revisited  Recall the definition of IDF  IDF is concerned only about whether or not a term is contained in a document  But there is also information in the number of times the term occurs in documents

EE3J2 Data Mining Salience  Suppose we have a set of ‘training’ documents –Some ‘on topic’ –Some ‘off topic’  Then, for a term t, we can estimate the probabilities: –P(t | Topic) –P(t | Not Topic) –P(t)

EE3J2 Data Mining ‘Salience’ and ‘Usefulness’  Given a term t and a topic T, define the salience of t (relative to T) by:  Similarly, the usefulness of t (relative to T) is given by:

EE3J2 Data Mining Bayes’ Theorem  Remember Bayes’ Theorem?

EE3J2 Data Mining Salience and Usefulness

EE3J2 Data Mining Salience and Usefulness  Now, T is the topic, so P(T) is fixed. Therefore

EE3J2 Data Mining Salience and Usefulness  So, main difference between Salience and Usefulness is that to have high usefulness, a term must occur frequently

EE3J2 Data Mining Example  A term w occurs: – t 1 times in documents about topic T –t 2 times in documents which are not about topic T  Total number of terms: –in documents about topic T is N 1 –in documents not about topic T is N 2  Then: P(w|T) = t 1 /N 1, P(w) = (t 1 +t 2 )/(N 1 +N 2 ) –So

EE3J2 Data Mining Example  A term w occurs: – 150 times in documents about topic T –230 times in documents which are not about topic T  Total number of terms: –in documents about topic T is 12,500 –in documents not about topic T is 23,100  So: –p(w|T)=0.012, p(w)=0.0107, log(p(w|T)/p(w)) = 0.051 –U(w) = 0.00054

EE3J2 Data Mining Example (continued)  P(T) =N T /N, where N T and N are the total number of documents about topic T and the total number of documents, respectively  So, if, say, 1 document in 100 is about the topic, then P(T) = 0.01  Then S(w) = (P(T)/p(w))*U(w) = (0.01/0.0107)*U(w)

EE3J2 Data Mining Topic spotter Topic Spotter Data stream

EE3J2 Data Mining Example  The AT&T “How May I Help You?” system  Task: to understand what AT&T customers say to operators  Look HMIHY? Up on the web

EE3J2 Data Mining AT&T How May I Help You? Speech Recognition Language Processing Salient word list Service 1 Service 2 Service 3 Service 15

EE3J2 Data Mining AT&T How May I Help You?  HMIHY? Treats telephone network services as topics or documents, to be detected or retrieved  Example salient words: WordSalience WordSalience Difference4.04 Dialled1.29 Cost3.39 Area1.28 Rate3.37 Time1.23 Much3.24 Person1.23 Emergency2.23 Charge1.22 Misdialed1.43 Home1.13 Wrong1.37 Information1.11 code1.36 credit1.11 Allen Gorin, “Processing of semantic information in fluent spoken language, Proc. ICSLP 1996

EE3J2 Data Mining HMIHY Demonstrations.  See http://www.research.att.com/~algor/hmihy/samples.html http://www.research.att.com/~algor/hmihy/samples.html

EE3J2 Data Mining Query Processing  Remember how we previously processed a query:  Example: –“I need information on distance running”  Stop word removal –information, distance, running  Stemming –information, distance, run  But what about: –“The London marathon will take place…”

EE3J2 Data Mining Synonyms, hyponyms, hypernyms & antonyms  We know there is a relationship between –run, distance, and –marathon  We know that a ‘marathon’ is a ‘long distance run’  Words with the same meaning are synonyms  If a query q contains a word w 1 and w 2 is a synonym of w 1, then w 2 should be added to q  This is an example of query expansion

EE3J2 Data Mining Thesaurus  A thesaurus is a ‘dictionary’ of synonyms and semantically related words and phrases  E.G: Roget’s Thesaurus  Example: physician syn: || croaker, doc, doctor, MD, medical, mediciner, medico || rel: medic, general practitioner, surgeon

EE3J2 Data Mining Hyponyms  Not only synonyms are useful for query expansion  Query q = “Tell me about England”  Document d = “A visit to London should be on everyone’s itinerary”  ‘London’ is a hyponym of ‘England’  Hyponym ~ subordinate ~ subset  If a query q contains a word w 1 and w 2 is a hyponym of w 1, then w 2 should be added to q

EE3J2 Data Mining Hypernyms  Hypernyms are also useful for query expansion  Query q = “Tell me about England”  Document d = “Places to visit in the British Isles”  ‘British Isles’ is a hypernym of ‘England’  Hyponym ~ generalisation ~ superset  If a query q contains a word w 1 and w 2 is a hypernym of w 1, then w 2 should be added to q

EE3J2 Data Mining Antonyms  An antonym is a word which is opposite in meaning to another (e.g. bad and good)  The occurrence of an antonym can also be relevant

EE3J2 Data Mining WordNet  Online lexical database for the English Language  http://www.cogsci.princeton.edu/~wn http://www.cogsci.princeton.edu/~wn CategoryFormsMeanings (syn sets) Nouns57,00048,800 Adjectives19,50010,000 Verbs21,0008,400 See Belew, chapter 6

EE3J2 Data Mining WordNet  WordNet is organised as a set of hierarchical trees  For nouns, there are 25 trees  Children of a node correspond to hyponyms  Words become more specific as you move deeper into the tree

EE3J2 Data Mining Noun Categories act, action, activitynatural object animal, faunanatural phenomenon artefactperson, human being attribute, propertyplant, flora body, corpuspossession cognition, knowledgeprocess communicationquantity, amount event, happeningrelation feeling, emotionshape foodstate, condition group, collectionsubstance location, placetime motive

EE3J2 Data Mining Query-document scoring  A query q is expanded to include hyponyms and synonyms  Recall that for a document d

EE3J2 Data Mining Query expansion  Suppose: – t is the original term in the query, –t’ is a synonym or hyponym of t which occurs in d  Then we could define:  Where tt’ is a weighting depending on how ‘far’ t and t’ are apart according to WordNet

EE3J2 Data Mining Summary  Topic spotting: –Salience and usefulness –How May I Help You?  Query expansion  WordNet

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

Similar presentations

Presentation on theme: "Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

Similar presentations

Presentation on theme: "Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell."— Presentation transcript:

Similar presentations

About project

Feedback