Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.
Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos.
Yansong Feng and Mirella Lapata
Date: 2013/1/17 Author: Yang Liu, Ruihua Song, Yu Chen, Jian-Yun Nie and Ji-Rong Wen Source: SIGIR12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Adaptive.
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.
Ziv Bar-YossefMaxim Gurevich Google and Technion Technion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A AA.
Farag Saad i-KNOW 2014 Graz- Austria,
Improved TF-IDF Ranker
A Machine Learning Approach for Improved BM25 Retrieval
Implicit Queries for Vitor R. Carvalho (Joint work with Joshua Goodman, at Microsoft Research)
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Ao-Jan Su † Y. Charlie Hu ‡ Aleksandar Kuzmanovic † Cheng-Kok Koh ‡ † Northwestern University ‡ Purdue University How to Improve Your Google Ranking: Myths.
Information Retrieval in Practice
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
Overview of Search Engines
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Keyphrase Extraction in Scientific Documents Thuy Dung Nguyen and Min-Yen Kan School of Computing National University of Singapore Slides available at.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
TREC 2009 Review Lanbo Zhang. 7 tracks Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真.
Date: 2013/8/27 Author: Shinya Tanaka, Adam Jatowt, Makoto P. Kato, Katsumi Tanaka Source: WSDM’13 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Estimating.
Text Feature Extraction. Text Classification Text classification has many applications –Spam detection –Automated tagging of streams of news articles,
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Automatic Keyphrase Extraction (Jim Nuyens) Keywords are an everyday part of looking up topics and specific content. What are some of the ways of obtaining.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs, Cambridge Filip Radlinski, Microsoft In Proceedings of WSDM
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Personalization with user’s local data Personalizing Search via Automated Analysis of Interests and Activities 1 Sungjick Lee Department of Electrical.
Scott Wen-tau Yih (Microsoft Research) Joint work with Hannaneh Hajishirzi (University of Illinois) Aleksander Kolcz (Microsoft Bing)
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Ranking Definitions with Supervised Learning Methods J.Xu, Y.Cao, H.Li and M.Zhao WWW 2005 Presenter: Baoning Wu.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Post-Ranking query suggestion by diversifying search Chao Wang.
ASSOCIATIVE BROWSING Evaluating 1 Jinyoung Kim / W. Bruce Croft / David Smith for Personal Information.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
NTU & MSRA Ming-Feng Tsai
Learning in a Pairwise Term-Term Proximity Framework for Information Retrieval Ronan Cummins, Colm O’Riordan Digital Enterprise Research Institute SIGIR.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Language Identification and Part-of-Speech Tagging
Applying Key Phrase Extraction to aid Invalidity Search
Introduction to Information Retrieval
Learning Term-weighting Functions for Similarity Measures
Information Retrieval and Web Design
Presentation transcript:

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University

Content-targeted Ads Important funding for free web services The system automatically –Finds the keywords on a web page –Displays advertisements based on those keywords Quality of the extracted keywords –Relevance of the advertisements More useful or interesting to readers Higher click-through rate Generate more revenue

Introduction A machine learning based system –Significantly better than simple TF  IDF baseline –Better than an existing system, KEA Explore different frameworks of choosing keyword candidates –Phrases vs. Words Will show that looking at whole phrases is better –Combined vs. Separate Will show that looking at all instances of a phrase together (combined) is better Extensive feature study –TF and DF Instead of TF  IDF, use them as separate features –Search Query Log Keywords that people use to query are good features to find keywords people like

Outline System Architecture –Preprocessor –Candidate selector –Classifier –Postprocessor Experiments –Data preparation –Performance measures –Results Related Work

System Architecture Pre-processor Candidate Selector Classifier Post-processor HTML Documents PowerShot0.17 Canon0.14 Canon’s S-series0.06 Digital Camera0.07 …

Pre-processor Facilitate keyword candidate selection and feature extraction Transform HTML documents into sentence-split plain-text documents –No sophisticated parsing –No block detection –Preserve/Augment some information Linguistic analysis: POS tagging Pre-processor Candidate Selector Classifier Post-processor

Candidate Selector Monolithic (1/2) Consider every consecutive words up to length 5 as candidates Digital Camera Review The new flagship of Canon’s S-series, PowerShot S80 digital camera, incorporates 8 megapixels for shooting still images and a movie mode that records an impressive 1024 x 768 pixels. Pre-processor Candidate Selector Classifier Post-processor

Consider every consecutive words up to length 5 as candidates Digital Camera Review The new flagship of Canon’s S-series, PowerShot S80 digital camera, incorporates 8 megapixels for shooting still images and a movie mode that records an impressive 1024 x 768 pixels. Candidate Selector Monolithic (1/2) Pre-processor Candidate Selector Classifier Post-processor

Consider every consecutive words up to length 5 as candidates Digital Camera Review The new flagship of Canon’s S-series, PowerShot S80 digital camera, incorporates 8 megapixels for shooting still images and a movie mode that records an impressive 1024 x 768 pixels. Candidate Selector Monolithic (1/2) Pre-processor Candidate Selector Classifier Post-processor

Consider every consecutive words up to length 5 as candidates Digital Camera Review The new flagship of Canon’s S-series, PowerShot S80 digital camera, incorporates 8 megapixels for shooting still images and a movie mode that records an impressive 1024 x 768 pixels. Candidate Selector Monolithic (1/2) Pre-processor Candidate Selector Classifier Post-processor

Consider every consecutive words up to length 5 as candidates Digital Camera Review The new flagship of Canon’s S-series, PowerShot S80 digital camera, incorporates 8 megapixels for shooting still images and a movie mode that records an impressive 1024 x 768 pixels. Candidate Selector Monolithic (1/2) Pre-processor Candidate Selector Classifier Post-processor

Consider every consecutive words up to length 5 as candidates Digital Camera Review The new flagship of Canon’s S-series, PowerShot S80 digital camera, incorporates 8 megapixels for shooting still images and a movie mode that records an impressive 1024 x 768 pixels. Candidate Selector Monolithic (1/2) Pre-processor Candidate Selector Classifier Post-processor

Combined vs. Separate –Information extraction community looks at keywords separately, while previous work in this area has combined all instances together Digital Camera Review The new flagship of Canon’s S-series, PowerShot S80 digital camera, incorporates 8 megapixels for shooting still images and a movie mode that records an impressive 1024 x 768 pixels. Candidate Selector Monolithic (2/2) Pre-processor Candidate Selector Classifier Post-processor

Classifier Once we have candidates, must determine which ones are best Two steps: –For each phrase, find “features” of phrase –From features, determine score of the phrase Pre-processor Candidate Selector Classifier Post-processor

Features (1/2) Digital Camera Review The new flagship of Canon’s S- series, PowerShot S80 digital camera, incorporates 8 megapixels for shooting still images and a movie mode that records an impressive 1024 x 768 pixels. Capitalization Linguistics (noun) Location Phrase Length Length –Sentence Information Retrieval –Term Frequency –Document Frequency Pre-processor Candidate Selector Classifier Post-processor

Features (2/2) Digital Camera Review The new flagship of Canon’s S- series, PowerShot S80 digital camera, incorporates 8 megapixels for shooting still images and a movie mode that records an impressive 1024 x 768 pixels. Hypertext Title Meta Tags –Description –Keywords –Title URL string Search Query Log –Most frequent 7.5 million queries Pre-processor Candidate Selector Classifier Post-processor

Logistic Regression Need to combine the features to get a score for each phrase For each feature, compute a weight –For a given phrase, find weighted sum of features, add them up Need to find the weights –Use training data (more later) with list of “correct” keyphrases for each document –Use “logistic regression” to find best weights y is 1 if word/phrase is relevant x is the features of the word/phrase (a vector of numbers) Learning: find weights that match the labeled training data Pre-processor Candidate Selector Classifier Post-processor

Monolithic Combined –Direct output what classifier predicts Monolithic Separate –Output the largest probability estimation of identical candidates Pre-processor Candidate Selector Classifier Post-processor

Experiments How do we collect data to train and evaluate our system? How good is our system? –How to measure performance –Which framework is the best? –Compare it with other systems Feature contribution

Data Annotation Raw data: 828 web pages –Have content-targeted advertising –Remove advertisements 5 annotators pick keywords –Asked them to choose only words/phrases that occurred in the documents –Asked them to label phrases about “things they might want to buy if reading this page” 10-fold cross validation for experiments

Performance Measures Accuracy or Recall are not very meaningful –Hard to define/pick a complete set of keywords –Rank of keywords is also important Top- n scores –We return our top n phrases –Get 1 point for each correct phrase we return (Annotator listed that keyphrase) –Divide by maximum points any system could possibly get Score is between 0 and 1 (1 is best) –K i : set of top n keywords chosen by the system for page i –A i : keywords selected by the annotators for page i –Score =

Top- n Score for 1 Document Digital Camera PowerShot S80 Canon’s S-series S PowerShot0.17 Canon0.14 Canon’s S-series0.06 Digital Camera0.07 S-series0.04 … Top-1 score? Top-1 score: 1/1 = 1.0

Top- n Score for 1 Document Digital Camera PowerShot S80 Canon’s S-series S PowerShot0.17 Canon0.14 Canon’s S-series0.06 Digital Camera0.07 S-series0.04 … Top-5 score?Top-5 score: 3/4= 0.75

Performance Comparison PhraseWord Combining identical phrases as candidates is the best framework

Performance Comparison PhraseWord Better than KEA

Performance Comparison PhraseWord Learning weights for TF and DF separately is better than TF  IDF

IR + One Set of Features

Related Work Keyword extraction (from scientific papers) –GenEx: rules + GA [Turney '00] –KEA: Naïve Bayes using 3 features [Frank et al. '99] TFxIDF, Loc, keyphrase-frequency Impedance coupling [Ribeiro-Neto et al. '05] –Match advertisements to web pages directly News Query Extraction [Henzinger, Page, et al. '03] –Extract keywords from TV news caption –Using TF  IDF and its variations to score phrases Implicit Queries from s [Goodman&Carvalho '05]

Conclusions Keyword extraction drives content-targeted advertising –Foundation of free web services –Very successful business model Extensive experimental study –TF, DF, Search Query Log are the three most useful features –Machine learning is important in tuning the weights –Monolithic combined (combine identical phrases together) is the best approach Our system is substantially better than KEA – the only publicly available keyword extraction system

Search Engine Query Log 2 nd helpful feature Size could be too large especially for client-side applications –7.5 million queries, 20 bytes per query –20 languages –3GB query log files Effects of Using a smaller query log file Restrict candidates by query log

Using Different Sizes of Query Log File