Pattern Recognition Problems in Computational Linguistics Information Retrieval: – Is this doc more like relevant docs or irrelevant docs? Author Identification:

Slides:



Advertisements
Similar presentations
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Advertisements

COMP423 Intelligent Agents. Recommender systems Two approaches – Collaborative Filtering Based on feedback from other users who have rated a similar set.
Towards Google-like Search on Spoken Documents with Zero Resources How to get something from nothing in a language that you’ve never heard of Language/CS.
Applications (2 of 2): Recognition, Transduction, Discrimination, Segmentation, Alignment, etc. Kenneth Church Dec 9,
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Search Engines and Information Retrieval
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Applications (1 of 2): Information Retrieval Kenneth Church Dec 2,
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
Information Retrieval in Practice
1 CS 430: Information Discovery Lecture 21 Web Search 3.
INFO 624 Week 3 Retrieval System Evaluation
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Sigir’99 Inside Internet Search Engines: Search Jan Pedersen and William Chang.
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
Information Retrieval
Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.
Entropy of Search Logs - How Big is the Web? - How Hard is Search? - With Personalization? With Backoff? Qiaozhu Mei †, Kenneth Church ‡ † University of.
Overview of Search Engines
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.
CC 2007, 2011 attrbution - R.B. Allen Text and Text Processing.
Query Rewriting Using Monolingual Statistical Machine Translation Stefan Riezler Yi Liu Google 2010 Association for Computational Linguistics.
Search Engines and Information Retrieval Chapter 1.
The Technology Behind. The World Wide Web In July 2008, Google announced that they found 1 trillion unique webpages! Billions of new web pages appear.
PageRank for Product Image Search Kevin Jing (Googlc IncGVU, College of Computing, Georgia Institute of Technology) Shumeet Baluja (Google Inc.) WWW 2008.
LIS510 lecture 3 Thomas Krichel information storage & retrieval this area is now more know as information retrieval when I dealt with it I.
LIS618 lecture 1 Thomas Krichel economic rational for traditional model In olden days the cost of telecommunication was high. database use.
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Web Data Management Dr. Daniel Deutch. Web Data The web has revolutionized our world Data is everywhere Constitutes a great potential But also a lot of.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
CIS 430 November 6, 2008 Emily Pitler. 3  Named Entities  1 or 2 words  Ambiguous meaning  Ambiguous intent 4.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Introduction to CL & NLP CMSC April 1, 2003.
Exploring Text: Zipf’s Law and Heaps’ Law. (a) (b) (a) Distribution of sorted word frequencies (Zipf’s law) (b) Distribution of size of the vocabulary.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
Why Decision Engine Bing Demos Search Interaction model Data-driven Research Problems Q & A.
Selecting Relevant Documents Assume: –we already have a corpus of documents defined. –goal is to return a subset of those documents. –Individual documents.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
COMP423 Intelligent Agents. Recommender systems Two approaches – Collaborative Filtering Based on feedback from other users who have rated a similar set.
Information Retrieval in Practice
22C:145 Artificial Intelligence
Search Engine Architecture
Information Retrieval (in Practice)
Search User Behavior: Expanding The Web Search Frontier
Statistical Learning Methods for Natural Language Processing on the Internet 徐丹云.
Information Retrieval and Web Search
Web Mining Ref:
Introduction to Web Mining
Smoothing 10/27/2017.
Information Retrieval and Web Search
Information Retrieval and Web Search
WIRED Week 2 Syllabus Update Readings Overview.
Introduction to Information Retrieval
CS 345A Data Mining Lecture 1
CS 345A Data Mining Lecture 1
Introduction to Web Mining
Information Retrieval and Web Search
CS 345A Data Mining Lecture 1
Presentation transcript:

Pattern Recognition Problems in Computational Linguistics Information Retrieval: – Is this doc more like relevant docs or irrelevant docs? Author Identification: – Is this doc more like author A’s docs or author B’s docs? Word Sense Disambiguation – Is the context of this use of bank more like sense 1’s contexts or like sense 2’s contexts? Machine Translation – Is the context of this use of drug more like those that were translated as drogue – or those that were translated as medicament? Dec 2, 20091

Applications of Naïve Bayes Dec 2, 20092

Classical Information Retrieval (IR) Boolean Combinations of Keywords – Dominated the Market (before the web) – Popular with Intermediaries (Librarians) Rank Retrieval (Google) – Sort a collection of documents (e.g., scientific papers, abstracts, paragraphs) by how much they ‘‘match’’ a query – The query can be a (short) sequence of keywords or arbitrary text (e.g., one of the documents) Dec 2, 20093

IR Models Keywords (and Boolean combinations thereof) Vector-Space ‘‘Model’’ (Salton, chap 10.1) – Represent the query and the documents as V- dimensional vectors – Sort vectors by Probabilistic Retrieval Model – (Salton, chap 10.3) – Sort documents by Dec 2, 20094

Motivation for Information Retrieval (circa 1990, about 5 years before web) Text is available like never before Currently, N≈100 million words – and projections run as high as bytes by 2000! What can we do with it all? – It is better to do something simple, – than nothing at all. IR vs. Natural Language Understanding – Revival of 1950-style empiricism Dec 2, 20095

How Large is Very Large? From a Keynote to EMNLP Conference, formally Workshop on Very Large Corpora Dec 2, 20096

7 Rising Tide of Data Lifts All Boats If you have a lot of data, then you don’t need a lot of methodology 1985: “There is no data like more data” – Fighting words uttered by radical fringe elements (Mercer at Arden House) 1993 Workshop on Very Large Corpora – Perfect timing: Just before the web – Couldn’t help but succeed – Fate 1995: The Web changes everything All you need is data (magic sauce) – No linguistics – No artificial intelligence (representation) – No machine learning – No statistics – No error analysis

Dec 2, “It never pays to think until you’ve run out of data” – Eric Brill Banko & Brill: Mitigating the Paucity-of-Data Problem (HLT 2001) Fire everybody and spend the money on data More data is better data! No consistently best learner Quoted out of context Moore’s Law Constant: Data Collection Rates  Improvement Rates

Dec 2, The rising tide of data will lift all boats! TREC Question Answering & Google: What is the highest point on Earth?

Dec 2, The rising tide of data will lift all boats! Acquiring Lexical Resources from Data: Dictionaries, Ontologies, WordNets, Language Models, etc. EnglandJapanCatcat FranceChinaDogmore GermanyIndiaHorsels ItalyIndonesiaFishrm IrelandMalaysiaBirdmv SpainKoreaRabbitcd ScotlandTaiwanCattlecp BelgiumThailandRatmkdir CanadaSingaporeLivestockman AustriaAustraliaMousetail AustraliaBangladeshHumanpwd

Dec 2, More data  better results – TREC Question Answering Remarkable performance: Google and not much else – Norvig (ACL-02) – AskMSR (SIGIR-02) – Lexical Acquisition Google Sets – We tried similar things » but with tiny corpora » which we called large Rising Tide of Data Lifts All Boats If you have a lot of data, then you don’t need a lot of methodology

Entropy of Search Logs - How Big is the Web? - How Hard is Search? - With Personalization? With Backoff? Qiaozhu Mei †, Kenneth Church ‡ † University of Illinois at Urbana-Champaign ‡ Microsoft Research 12Dec 2, 2009

13 Big How Big is the Web? 5B? 20B? More? Less? What if a small cache of millions of pages – Could capture much of the value of billions? Big Could a Big bet on a cluster in the clouds – Turn into a big liability? Examples of Big Bets – Computer Centers & Clusters Capital (Hardware) Expense (Power) Dev (Mapreduce, GFS, Big Table, etc.) – Sales & Marketing >> Production & Distribution Small Dec 2, 2009

14 Millions (Not Billions) Dec 2, 2009

15 Population Bound With all the talk about the Long Tail – You’d think that the Web was astronomical – Carl Sagan: Billions and Billions… Lower Distribution $$  Sell Less of More But there are limits to this process – NetFlix: 55k movies (not even millions) – Amazon: 8M products – Vanity Searches: Infinite??? Personal Home Pages << Phone Book < Population Business Home Pages << Yellow Pages < Population Millions, not Billions (until market saturates) Dec 2, 2009

16 It Will Take Decades to Reach Population Bound Most people (and products) – don’t have a web page (yet) Currently, I can find famous people (and academics) but not my neighbors – There aren’t that many famous people (and academics)… – Millions, not billions (for the foreseeable future) Dec 2, 2009

17 Equilibrium: Supply = Demand If there is a page on the web, – And no one sees it, – Did it make a sound? How big is the web? – Should we count “silent” pages – That don’t make a sound? How many products are there? – Do we count “silent” flops – That no one buys? Dec 2, 2009

18 Demand Side Accounting Consumers have limited time – Telephone Usage: 1 hour per line per day – TV: 4 hours per day – Web: ??? hours per day Suppliers will post as many pages as consumers can consume (and no more) Size of Web: O(Consumers) Dec 2, 2009

19 How Big is the Web? Related questions come up in language How big is English? – Dictionary Marketing – Education (Testing of Vocabulary Size) – Psychology – Statistics – Linguistics Two Very Different Answers – Chomsky: language is infinite – Shannon: 1.25 bits per character How many words do people know? What is a word? Person? Know? Dec 2, 2009

20 Chomskian Argument: Web is Infinite One could write a malicious spider trap –   Not just academic exercise Web is full of benign examples like – – Infinitely many months – Each month has a link to the next Dec 2, 2009

21 Big How Big is the Web? 5B? 20B? More? Less? More (Chomsky) – Less (Shannon) Entropy (H) Query 21.1  22.9 URL 22.1  22.4 IP 22.1  22.6 All But IP23.9 All But URL26.0 All But Query27.1 All Three27.2 Millions (not Billions) MSN Search Log 1 month  x18 Cluster in Cloud  Desktop  Flash Comp Ctr ($$$$)  Walk in the Park ($) More Practical Answer Dec 2, 2009

22 Entropy (H) – Size of search space; difficulty of a task H = 20  1 million items distributed uniformly Powerful tool for sizing challenges and opportunities – How hard is search? – How much does personalization help? Dec 2, 2009

23 How Hard Is Search? Millions, not Billions Traditional Search – H(URL | Query) – 2.8 (= 23.9 – 21.1) Personalized Search IP – H(URL | Query, IP) – 1.2 – 1.2 (= 27.2 – 26.0) Entropy (H) Query21.1 URL22.1 IP22.1 All But IP23.9 All But URL26.0 All But Query27.1 All Three27.2 Personalization cuts H in Half! Dec 2, 2009

24 How Hard are Query Suggestions? The Wild Thing? C* Rice  Condoleezza Rice Traditional Suggestions – H(Query) – 21 bits Personalized IP – H(Query | IP) – 5 – 5 bits (= 26 – 21) Entropy (H) Query21.1 URL22.1 IP22.1 All But IP23.9 All But URL26.0 All But Query27.1 All Three27.2 Personalization cuts H in Half! Twice Dec 2, 2009

25 Personalization with Backoff Ambiguous query: MSG – Madison Square Garden – Monosodium Glutamate Disambiguate based on user’s prior clicks When we don’t have data – Backoff to classes of users Proof of Concept: – Classes defined by IP addresses Better: – Market Segmentation (Demographics) – Collaborative Filtering (Other users who click like me) Dec 2, 2009

26 Conclusions: Millions (not Billions) How Big is the Web? – Upper bound: O(Population) Not Billions Not Infinite Shannon >> Chomsky – How hard is search? – Query Suggestions? – Personalization? Cluster in Cloud ($$$$)  Walk-in-the-Park ($) Entropy is a great hammer Dec 2, 2009

Noisy Channel Model for Web Search Michael Bendersky Input  Noisy Channel  Output – Input’ ≈ ARGMAX Input Pr( Input ) * Pr( Output | Input ) Speech – Words  Acoustics – Pr( Words ) * Pr( Acoustics | Words ) Machine Translation – English  French – Pr( English ) * Pr ( French | English ) Web Search – Web Pages  Queries – Pr( Web Page ) * Pr ( Query | Web Page ) Prior Channel Model Prior Dec 2,

Document Priors Page Rank (Brin & Page, 1998) – Incoming link votes Browse Rank (Liu et al., 2008) – Clicks, toolbar hits Textual Features (Kraaij et al., 2002) – Document length, URL length, anchor text – Wikipedia Dec 2,

Page Rank (named after Larry Page) aka Static Rank & Random Surfer Model Dec 2,

Page Rank = 1 st Eigenvector Dec 2,

Document Priors Human Ratings (HRS): Perfect judgments  more likely Static Rank (Page Rank): higher  more likely Textual Overlap: match  more likely – “cnn”  (match) Popular: – lots of clicks  more likely (toolbar, slogs, glogs) Diversity/Entropy: – fewer plausible queries  more likely Broder’s Taxonomy – Applies to documents as well – “cnn”  (navigational) Dec 2,

Learning to Rank: Task Definition What will determine future clicks on the URL? – Past Clicks ? – High Static Rank ? – High Toolbar visitation counts ? – Precise Textual Match ? – All of the Above ? ~3k queries from the extracts – 350k URL’s – Past Clicks: February – May, 2008 – Future Clicks: June, 2008 Dec 2,

Informational vs. Navigational Queries – Fewer plausible URL’s  easier query – Click Entropy Less is easier – Broder’s Taxonomy: Navigational / Informational Navigational is easier: – “BBC News” (navigational) easier than “news” – Less opportunity for personalization (Teevan et al., 2008) “bbc news” “news” Navigational queries have smaller entropy Dec 2,

Informational/Navigational by Residuals Informational Navigational ClickEntropy ~ Log(#Clicks) Dec 2,

Alternative Taxonomy: Click Types Classify queries by type – Problem: query logs have no “informational/navigational” labels Instead, we can use logs to categorize queries – Commercial Intent  more ad clicks – Malleability  more query suggestion clicks – Popularity  more future clicks (anywhere) Predict future clicks ( anywhere ) – Past Clicks: February – May, 2008 – Future Clicks: June, 2008 Dec 2,

Mainline Ad Right Rail Spelling Suggestions Snippet Query Left Rail Dec 2,

Paid Search Dec 2,

Summary: Information Retrieval (IR) Boolean Combinations of Keywords – Popular with Intermediaries (Librarians) Rank Retrieval – Sort a collection of documents (e.g., scientific papers, abstracts, paragraphs) by how much they ‘‘match’’ a query – The query can be a (short) sequence of keywords or arbitrary text (e.g., one of the documents) Logs of User Behavior (Clicks, Toolbar) – Solitaire  Multi-Player Game: Authors, Users, Advertisers, Spammers – More Users than Authors  More Information in Logs than Docs – Learning to Rank: Use Machine Learning to combine doc features & log features Dec 2,