CIKM Competition 2014 Second Place Solution

Slides:



Advertisements
Similar presentations
Pat Langley Computational Learning Laboratory Center for the Study of Language and Information Stanford University, Stanford, California
Advertisements

Florida International University COP 4770 Introduction of Weka.
Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)
Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.
Random Forest Predrag Radenković 3237/10
Overview of the TAC2013 Knowledge Base Population Evaluation: Temporal Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji,
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
1 Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis.
Modelling Paying Behavior in Game Social Networks Zhanpeng Fang +, Xinyu Zhou +, Jie Tang +, Wei Shao #, A.C.M. Fong *, Longjun Sun #, Ying Ding -, Ling.
© 2014 Fair Isaac Corporation. Confidential. This presentation is provided for the recipient only and cannot be reproduced or shared without Fair Isaac.
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Ensemble Learning (2), Tree and Forest
Data Mining Techniques
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
From Devices to People: Attribution of Search Activity in Multi-User Settings Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz Microsoft Research,
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Mehdi Ghayoumi Kent State University Computer Science Department Summer 2015 Exposition on Cyber Infrastructure and Big Data.
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 6 Ensembles of Trees.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Benk Erika Kelemen Zsolt
Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.
Presenter: Lung-Hao Lee ( 李龍豪 ) January 7, 309.
1 Automating Slot Filling Validation to Assist Human Assessment Suzanne Tamang and Heng Ji Computer Science Department and Linguistics Department, Queens.
Transfer Learning Motivation and Types Functional Transfer Learning Representational Transfer Learning References.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
An Ensemble of Three Classifiers for KDD Cup 2009: Expanded Linear Model, Heterogeneous Boosting, and Selective Naive Bayes Members: Hung-Yi Lo, Kai-Wei.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Team Dogecoin: An Experience in Predicting Hospital Readmissions Acknowledgements The Problem Hospitals in the UK must keep track of which patients, once.
Ensemble Methods in Machine Learning
1 Accurate Product Name Recognition from User Generated Content Team: ISSSID Sen Wu, Zhanpeng Fang, Jie Tang Department of Computer Science Tsinghua University.
Classification Ensemble Methods 1
Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.
Copyright © 2001, SAS Institute Inc. All rights reserved. Data Mining Methods: Applications, Problems and Opportunities in the Public Sector John Stultz,
BOOTSTRAPPING INFORMATION EXTRACTION FROM SEMI-STRUCTURED WEB PAGES Andrew Carson and Charles Schafer.
Risk Solutions & Research © Copyright IBM Corporation 2005 Default Risk Modelling : Decision Tree Versus Logistic Regression Dr.Satchidananda S Sogala,Ph.D.,
Predicting Short-Term Interests Using Activity-Based Search Context CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
A Level Computing#BristolMet Session Objectives#U2S11 MUST identify built-in string manipulation functions SHOULD correctly use string manipulation functions.
Strong Supervision From Weak Annotation Interactive Training of Deformable Part Models ICCV /05/23.
Coached Active Learning for Interactive Video Search Xiao-Yong Wei, Zhen-Qun Yang Machine Intelligence Laboratory College of Computer Science Sichuan University,
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Big Data Processing of School Shooting Archives
A Simple Approach for Author Profiling in MapReduce
Experience Report: System Log Analysis for Anomaly Detection
Data Science for Finance and Business
Machine Learning with Spark MLlib
Boosting and Additive Trees (2)
Lecture 17. Boosting¶ CS 109A/AC 209A/STAT 121A Data Science: Harvard University Fall 2016 Instructors: P. Protopapas, K. Rader, W. Pan.
Supervised Time Series Pattern Discovery through Local Importance
Learning Triadic Influence in Large Social Networks
FRM: Modeling Sponsored Search Log with Full Relational Model
Introduction Feature Extraction Discussions Conclusions Results
Machine Learning Week 1.
Adaboost Team G Youngmin Jun
CLSciSumm-2018 What to submit Task Framework Task 1A Task 1B
Example: Academic Search
CIKM Competition 2014 Second Place Solution
Weakly Learning to Match Experts in Online Community
Using decision trees and their ensembles for analysis of NIR spectroscopic data WSC-11, Saint Petersburg, 2018 In the light of morning session on superresolution.
ISWC 2013 Entity Recommendations in Web Search
Overfitting and Underfitting
Ryen White, Ahmed Hassan, Adish Singla, Eric Horvitz
Attention for translation
Predicting Loan Defaults
Modeling IDS using hybrid intelligent systems
THE ASSISTIVE SYSTEM SHIFALI KUMAR BISHWO GURUNG JAMES CHOU
Predictive Analysis by Leveraging Temporal User Behavior and User Embeddings CIKM2018 Zheng Yongli.
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

CIKM Competition 2014 Second Place Solution Team: FAndy Zhanpeng Fang, Jie Tang Department of Computer Science Tsinghua University

Task Input: a sequence of query sessions Example Class1 Query1 – Class1 Query1 Title1 Class2 Query2 – Class2 Query2 Title2 Class2 Query2 Title3 Partially labeled Output: the class label of test queries

Challenges Encoding character Heterogeneous data User search behavior Only little prior knowledge can be used Heterogeneous data Query, title, session information User search behavior How to incorporate user search behavior to help the classification task? Unlabeled data How to utilize the large scale unlabeled data?

Result 0.9245(public score)/0.9245(private score) 2nd place winner Achieve in 4 days, from Sep. 27th to Sep. 30th EST 2018/11/14

Our Approach Feature extraction Learning models Ensemble Bag of words User search behavior Learning models Logistic regression Gradient boosted decision trees Factorization machines Ensemble

Feature Extraction – Bag of Words Given a query Q One gram, two grams, last gram of Q 0 -> 0.8452

Feature Extraction – Bag of Words Given a query Q One gram, two grams, last gram of Q 0 -> 0.8452 One gram, two grams of the clicked titles 0.8452 -> 0.9091, top 12 in the leaderboard!

Feature Extraction – Bag of Words Given a query Q One gram, two grams, last gram of Q 0 -> 0.8452 One gram, two grams of the clicked titles 0.8452 -> 0.9091, top 12 in the leaderboard! More bag of words features? Queries of same session? Titles of same session?

Feature Extraction – Bag of Words Given a query Q One gram, two grams, last gram of Q 0 -> 0.8452 One gram, two grams of the clicked titles 0.8452 -> 0.9091, top 12 in the leaderboard! More bag of words features? Queries of same session? Titles of same session? Performance decreases, 0.9091 -> 0.89x How to use the session information?

Feature Extraction – Search Behavior Given a query Q Macro features #total search, average length of clicked titles, length of the query 0.9091 -> 0.9105

Feature Extraction – Search Behavior Given a query Q Macro features #total search, average length of clicked titles, length of the query 0.9091 -> 0.9105 Session class features For each potential class C, calculate: #class C queries in the same session #class C queries in the next/previous query 0.9105 -> 0.9145

Feature Extraction – Search Behavior Same session’s queries can help but might contain noises

Feature Extraction – Search Behavior Same session’s queries can help but might contain noises Only use similar queries! Same session’s queries feature Bag of words feature for same session’s queries that are similar to the query Q Use Jaccard to measure similarity between queries 0.9145 -> 0.9182, utilizing the large scale unlabeled data!

Feature Extraction – Search Behavior Further add clicked titles of same session’s similar queries Performance decrease, 0.9182 -> 0.9176

Feature Extraction – Feature List 1 gram, 2 grams of the query 1 gram, 2 grams of the clicked titles of the query Marco statistics of the query #specific class of queries in same session 1 gram, 2 grams of similar queries in same session

Factorization Machine Learning Models Logistic regression Use the implementation of Liblinear Factorization machine Use the implementation of LibFM Gradient boosted decision trees Use the implementation of XGBoost Method Implementation Score on leaderboard Logistic Regression Liblinear 0.9182 Factorization Machine LibFM 0.9151 GBDT XGBoost 0.9225

Factorization Machine Ensemble Ensemble prediction results from different models by logistic regression Ensemble can effectively improves the performance Method Implementation Score on Validation Logistic Regression Liblinear 0.9182 Factorization Machine LibFM 0.9151 GBDT XGBoost 0.9225 Ensemble 0.9245

“Tricks” on how to win 2nd place Summary “Tricks” on how to win 2nd place Use unlabeled data Train multiple models Ensemble different results

Thank you! Questions?