Implementing Query Classification HYP: End of Semester Update prepared Minh.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Cognitive Modelling – An exemplar-based context model Benjamin Moloney Student No:
Supervised Learning Techniques over Twitter Data Kleisarchaki Sofia.
VisualRank: Applying PageRank to Large-Scale Image Search Yushi Jing, Member, IEEE, and Shumeet Baluja, Member, IEEE.
嵌入式視覺 Feature Extraction
Large-Scale Object Recognition with Weak Supervision
Robust Moving Object Detection & Categorization using self- improving classifiers Omar Javed, Saad Ali & Mubarak Shah.
Web Query Analysis: A Functional Faceted Classification WING group meeting Nguyen Viet Bang.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.
Mapping Between Taxonomies Elena Eneva 11 Dec 2001 Advanced IR Seminar.
Web queries classification Nguyen Viet Bang WING group meeting June 9 th 2006.
On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis Dimitrios Kotsakos Dimitrios Gunopulos SIGKDD 2009.
1 Automatic Identification of User Goals in Web Search Uichin Lee, Zhenyu Liu, Junghoo Cho Computer Science Department, UCLA {uclee, vicliu,
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Query Log Analysis Naama Kraus Slides are based on the papers: Andrei Broder, A taxonomy of web search Ricardo Baeza-Yates, Graphs from Search Engine Queries.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Improving web image search results using query-relative classifiers Josip Krapacy Moray Allanyy Jakob Verbeeky Fr´ed´eric Jurieyy.
Evaluating Performance for Data Mining Techniques
Evaluating Classifiers
Face Detection using the Viola-Jones Method
Web Usage Mining Sara Vahid. Agenda Introduction Web Usage Mining Procedure Preprocessing Stage Pattern Discovery Stage Data Mining Approaches Sample.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
ENDA MOLLOY, ELECTRONIC ENG. FINAL PRESENTATION, 31/03/09. Automated Image Analysis Techniques for Screening of Mammography Images.
Detecting Movement Type by Route Segmentation and Classification Karol Waga, Andrei Tabarcea, Minjie Chen and Pasi Fränti.
Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
Author: Sali Allister Date: 18/10/2011 COASTAL Google Analytics Report June 2011 – September /06/2011 – 08/09/11.
Query Routing in Peer-to-Peer Web Search Engine Speaker: Pavel Serdyukov Supervisors: Gerhard Weikum Christian Zimmer Matthias Bender International Max.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Learning user preferences for 2CP-regression for a recommender system Alan Eckhardt, Peter Vojtáš Department of Software Engineering, Charles University.
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Predictive Analytics World CONFIDENTIAL1 Predictive Keyword Scores to Optimize PPC Campaigns Vincent Granville, Ph.D. Click Forensics February 19, 2009.
Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock,
Mingyang Zhu, Huaijiang Sun, Zhigang Deng Quaternion Space Sparse Decomposition for Motion Compression and Retrieval SCA 2012.
The Tube Over Time: Characterizing Popularity Growth of YouTube Videos ` Abstract In this work, we characterize the growth patterns of video popularity.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.
Limitations of Cotemporary Classification Algorithms Major limitations of classification algorithms like Adaboost, SVMs, or Naïve Bayes include, Requirement.
CONFIDENTIAL1 Hidden Decision Trees to Design Predictive Scores – Application to Fraud Detection Vincent Granville, Ph.D. AnalyticBridge October 27, 2009.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
A Content-Based Approach to Collaborative Filtering Brandon Douthit-Wood CS 470 – Final Presentation.
Understanding User Goals in Web Search University of Seoul Computer Science Database Lab. Min Mi-young.
CS155b: E-Commerce Lecture 16: April 10, 2001 WWW Searching and Google.
Post-Ranking query suggestion by diversifying search Chao Wang.
Learning to Estimate Query Difficulty Including Applications to Missing Content Detection and Distributed Information Retrieval Elad Yom-Tov, Shai Fine,
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
A Latent Social Approach to YouTube Popularity Prediction Amandianeze Nwana Prof. Salman Avestimehr Prof. Tsuhan Chen.
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
1 CS 430: Information Discovery Lecture 5 Ranking.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Machine Vision Edge Detection Techniques ENT 273 Lecture 6 Hema C.R.
Identifying Spam Web Pages Based on Content Similarity Sole Pera CS 653 – Term paper project.
SUMMERY 1. VOLUMETRIC FEATURES FOR EVENT DETECTION IN VIDEO correlate spatio-temporal shapes to video clips that have been automatically segmented we.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Real-Time Hierarchical Scene Segmentation and Classification Andre Uckermann, Christof Elbrechter, Robert Haschke and Helge Ritter John Grossmann.
Does one size really fit all? Evaluating classifiers in a Bag-of-Visual-Words classification Christian Hentschel, Harald Sack Hasso Plattner Institute.
Introduction to Machine Learning
Data Driven Attributes for Action Detection
CALIFORNIA STATE UNIVERSITY, SACRAMENTO
Michal Rosen-Zvi University of California, Irvine
Nearest Neighbors CSC 576: Data Mining.
Presentation transcript:

Implementing Query Classification HYP: End of Semester Update prepared Minh

Previously… Web search queries: ◦ Understand user goal Broder (et al 2002): ◦ Queries are classified into 3 categories:  Informational  Navigational  Transactional

Previously… Functional Faceted Web Query Classification  Ambiguity: Polysemous, General, Specific  Authority Sensitivity: Yes - No  Spatial Sensitivity: Yes - No  Temporal Sensitivity: Yes - No ◦ Query’s 4-Tuple: ◦ 3 * 2 * 2 * 2 = 24 different combinations.

Temporal Sensitivity Definition: ◦ A keyword is temporal sensitive if the results returned by querying it on web search engine tends to change with respect to time. ◦ Example:  Temporal sensitive: Liverpool, Beyonce, Jennifer Hawkins, etc..  Non-temporal sensitive: video, buying car, etc..

Up-to-date Project Scope Objective: to analyze the temporal sensitivity facet of web search queries. Problem: find the temporal correlation between web queries

Web Query Histogram Periodic queries: Non-periodic queries: Champions League Final Liverpool

Queries Correlation Correlation Observation: 2 keywords are temporally related to each other

Proposed System Framework 1. Ask Google Trends for query’s histogram 2. Use histogram digitizer program (Plotparser by WeiHua) to get the numerical data 3. Query Correlation: Calculate correlation coefficient between queries 4. Query classification

Google Trends

Histogram Digitizer

Queries Correlation: 1 st attempt Calculate Correlation coefficient: ◦ Using data of 45 months: Jan 2004 until September 2007 ◦ Calculate coefficient based on the entire histograms

Result classification: 1 st attempt Data of 15 different popular keywords, of which: ◦ Periodic keywords:  Champions League Final, Grammy, Pro Evolution Soccer, Oscar Winner, Valentine, Chrismas(!). ◦ Related keywords:  PS2, Xbox, Jack Nicholson, Beyonce, chocolate, chocolateNews, Liverpool, EA Sport, Konami All keywords are compare to each other based on correlation coefficient of their histograms. (15*14)/2 = 105 instances

Result classification: 1 st attempt Classification based on threshold method: ◦ Statistical result:  Threshold value: 0.25 Correlation Prediction True Positive RateFalse Positive Rate Yes88.89%10.34% No89.66%11.11%

1 st attempt Problems: Very low threshold value ◦ Only one feature used. Using entire histogram, while some keywords are only temporally related to each other at some periods of time. ◦ Example: Valentine – Chocolate (Correlation appears during February)

Queries Correlation: 2 nd attempt Interesting period: ◦ Period in which two query are highly related to each other -> Segmentation (Clustering) problem

Clustering Using Simple K means Algorithm to predict no. of clusters Use WEKA to cluster the histogram

Query Correlation: 2 nd attempt Periodic keywords detection: ◦ Identify repeated pattern using correlation ◦ Periodic query tends to have highly correlation coefficient on repeated part.

Interesting Periods Projection Interesting periods from related keyword histogram is to be projected on periodic keyword’s histogram

Result Classification: 2 nd Attempt Using previous dataset Related keywords are compared with each of periodic keywords for correlation Result: ◦ Manage to increase threshold value to: 0.5

2 nd attempt problems K – means clustering does not guarantee correct interesting periods detection: ◦ Due to the fact that we have to provide no. of cluster for K-means  -> implemented algorithm to determine no. of cluster failed to provide correct value Small training data set. Too simple method of threshold detector.

Queries Correlation: 3 rd attempt Need to find another way to identify interesting period. Peak period: ◦ Period in which there is a high peak in query volume Peak detection problem: ◦ Mapping and smoothing using convolution

Clustering using peak detection Mapping:

Clustering using peak detection Smoothing using convolution:

Clustering using peak detection Peak Detection: using simple slope- change algorithm to determine peaks and valleys ◦ (with threshold value: mean)

Interesting periods Projections Interesting periods from related keyword histogram is to be projected on periodic keyword’s histogram and vice versa

Result Classification: 3 rd attempt Use large training data: ◦ 47 popular keywords, of which:  15 periodic keywords and 32 related keywords  Each related keyword is to compared with every periodic keyword to get correlation coefficient (Coef). ◦ Data size: 15 * 32 = 480 instances

Result Classification: 3 rd attempt Apply Naïve Bayes Classifier (WEKA):  6 features:  Average Coef from related keyword projection (AveRCoef)  Average Coef from periodic keyword projection (AvePCoef)  Overall Average Coef [= (AveRCoef+AvePCoef)/2]  Max Coef from related keyword projection (MaxRCoef)  Max Coef from periodic keyword projection (MaxPCoef)  Average Max Coef [= (MaxRCoef+MaxPCoef)/2 ]

Result Classification: 3 rd attempt Statistical Result: Confusion Matrix Correlation Prediction True Positive Rate False Positive Rate RecallF-Measure Yes89.3%5.2% No94.8%10.7% AB<- classified as 253A = Yes 16294B = No

Future attempt: Query Normalization Search volumes tends to increase as the Internet becomes more popular Histogram for Top 20 most popular keywords of all time:

Future attempt: Normalization Histograms need to be normalize to ignore this trend’s effect! Proposed action: ◦ Subtract time effect ◦ Current Problem: More distortions are added due to scaling problem.  -> histogram from Google have been scaled. We have no information of raw data.

Future attempt: From Periodic to Non-periodic Find the correlation between two non- periodic queries. Proposed Problem: some keywords are highly searched after other keywords ◦ Example: “tsunami” is usually searched after “earthquake” is issued.

Future attempt: From Periodic to Non-Periodic Tsunami Earthquake

Potential Applications Results re-ranking: ◦ Move result that is more up-to-date up on the result list  Example: when user ask for Beyonce during the time of Grammy -> result that related to Grammy will have a higher rank Server Buffering: ◦ When user query Beyonce, the web page that related to Grammy will be buffer in local server in hope that the user will tend to search for Grammy eventually.

Question?

The End