Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell.

Slides:



Advertisements
Similar presentations
Relevance Feedback & Query Expansion
Advertisements

Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
© Paradigm Publishing, Inc Word 2010 Level 2 Unit 1Formatting and Customizing Documents Chapter 2Proofing Documents.
Improved TF-IDF Ranker
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
1 Words and the Lexicon September 10th 2009 Lecture #3.
1 Advanced information retrieval Chapter. 05: Query Reformulation.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
WMES3103 : INFORMATION RETRIEVAL
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
EE1J2 Mathematics for Applied Computing Lecture 9: Bayes’ Rule and Statistical Inference  Lecture content –Motivation –Conditional probabilities –Bayes’
Evaluating the Performance of IR Sytems
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
Information Retrieval in Practice
Course G Web Search Engines 3/9/2011 Wei Xu
Mining and Summarizing Customer Reviews
Query Relevance Feedback and Ontologies How to Make Queries Better.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.
Information Retrieval and Web Search Relevance Feedback. Query Expansion Instructor: Rada Mihalcea Class web page:
Text mining.
COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Query Operations. Query Models n IR Systems usually adopt index terms to process queries; n Index term: u A keyword or a group of selected words; u Any.
1 Query Operations Relevance Feedback & Query Expansion.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
Word Sense Disambiguation in Queries Shaung Liu, Clement Yu, Weiyi Meng.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
Query Languages Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
WordNet Enhancements: Toward Version 2.0 WordNet Connectivity Derivational Connections Disambiguated Definitions Topical Connections.
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.
Information Retrieval
Information Retrieval and Web Search Relevance Feedback. Query Expansion Instructor: Rada Mihalcea.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Natural Language Processing Topics in Information Retrieval August, 2002.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Query expansion COMP423. Menu Query expansion Two approaches Relevance feedback Thesaurus-based Most Slides copied from
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
Query Type Classification for Web Document Retrieval In-Ho Kang, GilChang Kim KAIST SIGIR 2003.
Sampath Jayarathna Cal Poly Pomona
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Lecture 12: Relevance Feedback & Query Expansion - II
Multimedia Information Retrieval
Relevance Feedback & Query Expansion
Mining Anchor Text for Query Refinement
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou
Introduction to Search Engines
Presentation transcript:

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 7 Topic Spotting & Query Expansion Martin Russell

Slide 2 EE3J2 Data Mining Objectives  To introduce Topic Spotting –Salience and Usefulness –Example: The AT&T “How May I Help You?” system  Text retrieval revisited –Query expansion –Synonyms & hyponyms –WordNet

Slide 3 EE3J2 Data Mining Topic Spotting  Type of dedicated IR system –Always looking for the same type of documents –Documents about a particular topic –Corpus from which data is retrieved is dynamic  Examples –Detect all weather forecasts in BBC radio 4 broadcasts –Find all documents written by Charlotte Bronte –…

Slide 4 EE3J2 Data Mining Topic Spotting Queries  Examples rather than explicit queries  ‘Find more like this’  May have substantial quantity of example data  But why is Topic Spotting different to IR?  Can exploit large amount of query data  Calculate better measures of the usefulness of a term than simple IDF

Slide 5 EE3J2 Data Mining IDF Revisited  Recall the definition of IDF  IDF is concerned only about whether or not a term is contained in a document  But there is also information in the number of times the term occurs in documents

Slide 6 EE3J2 Data Mining Salience  Suppose we have a set of ‘training’ documents –Some ‘on topic’ –Some ‘off topic’  Then, for a term t, we can estimate the probabilities: –P(t | Topic) –P(t | Not Topic) –P(t)

Slide 7 EE3J2 Data Mining ‘Salience’ and ‘Usefulness’  Given a term t and a topic T, define the salience of t (relative to T) by:  Similarly, the usefulness of t (relative to T) is given by:

Slide 8 EE3J2 Data Mining Bayes’ Theorem  Remember Bayes’ Theorem?

Slide 9 EE3J2 Data Mining Salience and Usefulness

Slide 10 EE3J2 Data Mining Salience and Usefulness  Now, T is the topic, so P(T) is fixed. Therefore

Slide 11 EE3J2 Data Mining Salience and Usefulness  So, main difference between Salience and Usefulness is that to have high usefulness, a term must occur frequently

Slide 12 EE3J2 Data Mining Example  A term w occurs: – t 1 times in documents about topic T –t 2 times in documents which are not about topic T  Total number of terms: –in documents about topic T is N 1 –in documents not about topic T is N 2  Then: P(w|T) = t 1 /N 1, P(w) = (t 1 +t 2 )/(N 1 +N 2 ) –So

Slide 13 EE3J2 Data Mining Example  A term w occurs: – 150 times in documents about topic T –230 times in documents which are not about topic T  Total number of terms: –in documents about topic T is 12,500 –in documents not about topic T is 23,100  So: –p(w|T)=0.012, p(w)=0.0107, log(p(w|T)/p(w)) = –U(w) =

Slide 14 EE3J2 Data Mining Example (continued)  P(T) =N T /N, where N T and N are the total number of documents about topic T and the total number of documents, respectively  So, if, say, 1 document in 100 is about the topic, then P(T) = 0.01  Then S(w) = (P(T)/p(w))*U(w) = (0.01/0.0107)*U(w)

Slide 15 EE3J2 Data Mining Topic spotter Topic Spotter Data stream

Slide 16 EE3J2 Data Mining Example  The AT&T “How May I Help You?” system  Task: to understand what AT&T customers say to operators  Look HMIHY? Up on the web

Slide 17 EE3J2 Data Mining AT&T How May I Help You? Speech Recognition Language Processing Salient word list Service 1 Service 2 Service 3 Service 15

Slide 18 EE3J2 Data Mining AT&T How May I Help You?  HMIHY? Treats telephone network services as topics or documents, to be detected or retrieved  Example salient words: WordSalience WordSalience Difference4.04 Dialled1.29 Cost3.39 Area1.28 Rate3.37 Time1.23 Much3.24 Person1.23 Emergency2.23 Charge1.22 Misdialed1.43 Home1.13 Wrong1.37 Information1.11 code1.36 credit1.11 Allen Gorin, “Processing of semantic information in fluent spoken language, Proc. ICSLP 1996

Slide 19 EE3J2 Data Mining HMIHY Demonstrations.  See

Slide 20 EE3J2 Data Mining Query Processing  Remember how we previously processed a query:  Example: –“I need information on distance running”  Stop word removal –information, distance, running  Stemming –information, distance, run  But what about: –“The London marathon will take place…”

Slide 21 EE3J2 Data Mining Synonyms, hyponyms, hypernyms & antonyms  We know there is a relationship between –run, distance, and –marathon  We know that a ‘marathon’ is a ‘long distance run’  Words with the same meaning are synonyms  If a query q contains a word w 1 and w 2 is a synonym of w 1, then w 2 should be added to q  This is an example of query expansion

Slide 22 EE3J2 Data Mining Thesaurus  A thesaurus is a ‘dictionary’ of synonyms and semantically related words and phrases  E.G: Roget’s Thesaurus  Example: physician syn: || croaker, doc, doctor, MD, medical, mediciner, medico || rel: medic, general practitioner, surgeon

Slide 23 EE3J2 Data Mining Hyponyms  Not only synonyms are useful for query expansion  Query q = “Tell me about England”  Document d = “A visit to London should be on everyone’s itinerary”  ‘London’ is a hyponym of ‘England’  Hyponym ~ subordinate ~ subset  If a query q contains a word w 1 and w 2 is a hyponym of w 1, then w 2 should be added to q

Slide 24 EE3J2 Data Mining Hypernyms  Hypernyms are also useful for query expansion  Query q = “Tell me about England”  Document d = “Places to visit in the British Isles”  ‘British Isles’ is a hypernym of ‘England’  Hyponym ~ generalisation ~ superset  If a query q contains a word w 1 and w 2 is a hypernym of w 1, then w 2 should be added to q

Slide 25 EE3J2 Data Mining Antonyms  An antonym is a word which is opposite in meaning to another (e.g. bad and good)  The occurrence of an antonym can also be relevant

Slide 26 EE3J2 Data Mining WordNet  Online lexical database for the English Language  CategoryFormsMeanings (syn sets) Nouns57,00048,800 Adjectives19,50010,000 Verbs21,0008,400 See Belew, chapter 6

Slide 27 EE3J2 Data Mining WordNet  WordNet is organised as a set of hierarchical trees  For nouns, there are 25 trees  Children of a node correspond to hyponyms  Words become more specific as you move deeper into the tree

Slide 28 EE3J2 Data Mining Noun Categories act, action, activitynatural object animal, faunanatural phenomenon artefactperson, human being attribute, propertyplant, flora body, corpuspossession cognition, knowledgeprocess communicationquantity, amount event, happeningrelation feeling, emotionshape foodstate, condition group, collectionsubstance location, placetime motive

Slide 29 EE3J2 Data Mining Query-document scoring  A query q is expanded to include hyponyms and synonyms  Recall that for a document d

Slide 30 EE3J2 Data Mining Query expansion  Suppose: – t is the original term in the query, –t’ is a synonym or hyponym of t which occurs in d  Then we could define:  Where tt’ is a weighting depending on how ‘far’ t and t’ are apart according to WordNet

Slide 31 EE3J2 Data Mining Summary  Topic spotting: –Salience and usefulness –How May I Help You?  Query expansion  WordNet