Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Learning to Suggest: A Machine Learning Framework for Ranking Query Suggestions Date: 2013/02/18 Author: Umut Ozertem, Olivier Chapelle, Pinar Donmez,
Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)
Search in Source Code Based on Identifying Popular Fragments Eduard Kuric and Mária Bieliková Faculty of Informatics and Information.
Introduction to Information Retrieval
Date: 2012/8/13 Source: Luca Maria Aiello. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Behavior-driven Clustering of Queries into Topics.
Overview of the KBP 2013 Slot Filler Validation Track Hoa Trang Dang National Institute of Standards and Technology.
WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
Introduction to Supervised Machine Learning Concepts PRESENTED BY B. Barla Cambazoglu February 21, 2014.
Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Context-Aware Query Classification Huanhuan Cao 1, Derek Hao Hu 2, Dou Shen 3, Daxin Jiang 4, Jian-Tao Sun 4, Enhong Chen 1 and Qiang Yang 2 1 University.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
Learning Object Identification Rules for Information Integration Sheila Tejada Craig A. Knobleock Steven University of Southern California.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Web queries classification Nguyen Viet Bang WING group meeting June 9 th 2006.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
Mining Officially Unrecognized Side effects of drugs by combining Web Search and Machine learning Carlo Carino, Yuanyuan Jia, Bruce Lambert, Patricia West.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Web Usage Mining with Semantic Analysis Date: 2013/12/18 Author: Laura Hollink, Peter Mika, Roi Blanco Source: WWW’13 Advisor: Jia-Ling Koh Speaker: Pei-Hao.
1 Context-Aware Search Personalization with Concept Preference CIKM’11 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Modeling Relationship Strength in Online Social Networks Rongjing Xiang: Purdue University Jennifer Neville: Purdue University Monica Rogati: LinkedIn.
Date: 2012/10/18 Author: Makoto P. Kato, Tetsuya Sakai, Katsumi Tanaka Source: World Wide Web conference (WWW "12) Advisor: Jia-ling, Koh Speaker: Jiun.
 An important problem in sponsored search advertising is keyword generation, which bridges the gap between the keywords bidded by advertisers and queried.
Presented by Tienwei Tsai July, 2005
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Topical Crawlers for Building Digital Library Collections Presenter: Qiaozhu Mei.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
Page 1 Ming Ji Department of Computer Science University of Illinois at Urbana-Champaign.
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
Improving Classification Accuracy Using Automatically Extracted Training Data Ariel Fuxman A. Kannan, A. Goldberg, R. Agrawal, P. Tsaparas, J. Shafer Search.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Improving Web Search Results Using Affinity Graph Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, Wei-Ying Ma Microsoft Research.
 Examine two basic sources for implicit relevance feedback on the segment level for search personalization. Eye tracking Display time.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
Chapter 4: Pattern Recognition. Classification is a process that assigns a label to an object according to some representation of the object’s properties.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Augmenting Focused Crawling using Search Engine Queries Wang Xuan 10th Nov 2006.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Post-Ranking query suggestion by diversifying search Chao Wang.
Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Finding Question-Answer Pairs from Online Forums ACM, SIGIR 08 Gao Cong Aalborg University, Aalborg, Denmark Long Wang Tianjin University, Tianjin, China.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Ariel Fuxman, Panayiotis Tsaparas, Kannan Achan, Rakesh Agrawal (2008) - Akanksha Saxena 1.
Vertical Search for Courses of UIUC Homepage Classification The aim of the Course Search project is to construct a database of UIUC courses across all.
User Modeling for Personal Assistant
CIKM Competition 2014 Second Place Solution
CIKM Competition 2014 Second Place Solution
Web Information retrieval (Web IR)
Presentation transcript:

Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309

 Introduction  Calculating Page Similarity  Finding Similar Pages ◦ Click Data Model (CDM) ◦ Query Constraint (QC) algorithm  Experimental Results  Discussion  Conclusion 2

 Large labor cost of annotating the data  The aggregated click data across many users over time provides valuable information  Leveraging click logs to argument training data by propagating class labels to unlabeled similar documents 3

 “Two pages that tend to be clicked by the same user queries tend to be topically similar” 4 AB “How to tie a tie” “How to tie a neck tie knots ” “Tying a tie” Label as “Positive” (class “How-to”) Unknown Label “Positive” ?

 A page is represented as a node in the similar graph  Normalize all the URLs e.g. the following 4 URLs are treated as the same (1) “ (2) “ (3) “ (4) “ 5

 Each URL is represented as a vector of queries that users issued and clicked through to the page 6 Pantel & Lin (2002)

 Compute the similarity between two pages using the cosine similarity of their respective feature vector  sim (p1,p2) > sim (p1,p3)  sim (p1,p2) > sim(p2,p3) Because p1 and p2 share more common queries than p3 7

 What’s a “seed set” ? A set of some labeled data  Two algorithms for seed set expansion ◦ Click Data Model (CDM) ◦ Query Constraints (QC) algorithm 8

 Two phases ◦ Updating score phase ◦ Filtering phase  Input ◦ S1 (positive set) ◦ S2 (negative set) ◦ G (click graph)  Output ◦ E1 (positive) ◦ E2 (negative)  Thresholds ◦ 0.1<T 1 <0.6 ◦ 0.6<T 2 <1.2 9

 Additional Module that checks whether the common queries between two nodes have certain term patterns 10

 Reduce the amount of human annotation effort by leveraging the click data  Build an expansion model with labeled training data and use it to select next round of training data 11

 Click Data ◦ During December 2008 from Yahoo! Search engine ◦ Only the top 10 URLs are considered ◦ URLs with less than 10 clicks are excluded  Tree classification tasks ◦ How-to ◦ Adult ◦ review 12

 Training sets ◦ 10,000 manually labeled positive and negative examples ◦ For “review” classifier, queries such as “digital camera reviews” or “baby swing reviews” ◦ For “How-to” classifier, queries such as “how to clean uggs” or “best way to loose weight”  Testing sets 13

 Classifier ◦ Gradient Boosting Decision Tree (GBDT)  Features ◦ Textual, Link, URL, HTML, Other features  Metrics ◦ Area Under the ROC Curve (AUC) ( Fawcett, 2003 ) ◦ F score ◦ Accuracy 14

 The big improvement of CDM is observed with a model using 5000 labeled data as a seed set (+1.07% in F-score, in Accuracy and +0.25% in AUC) 15

 Reduce the manual labor by 50%  QC (exclude pages that do not have “review” in query terms) is useful when labeled data is small 16

 With 1000 and 2000 human labeled data, CDM performs worse than the baseline  QC (exclude pages that do not have “How-to” in query terms) 17

 Baseline: Type A  CDM: Type C 18

 From “How-to” Classifier  Seed 1  Seed 2 (human label from Expnd1)  Expand2 19

 A random sample of 50 positive and 50 negative example from “how-to” classifier  Positive class has 82.3% precision whereas negative class has 83.6% precision 20

 Is the proposed method always useful for web page classification ?  How can we improve the quality of automatically labeled data from unlabeled data ? 21

 Present a method for improve webpage classification by leveraging click data to augment training data  Argument manually labeled data by modeling the similarity between pages in a click graph 22

 Thank you very much  Questions & Answers 23