Implementing Query Classification HYP: End of Semester Update prepared Minh
Previously… Web search queries: ◦ Understand user goal Broder (et al 2002): ◦ Queries are classified into 3 categories: Informational Navigational Transactional
Previously… Functional Faceted Web Query Classification Ambiguity: Polysemous, General, Specific Authority Sensitivity: Yes - No Spatial Sensitivity: Yes - No Temporal Sensitivity: Yes - No ◦ Query’s 4-Tuple: ◦ 3 * 2 * 2 * 2 = 24 different combinations.
Temporal Sensitivity Definition: ◦ A keyword is temporal sensitive if the results returned by querying it on web search engine tends to change with respect to time. ◦ Example: Temporal sensitive: Liverpool, Beyonce, Jennifer Hawkins, etc.. Non-temporal sensitive: video, buying car, etc..
Up-to-date Project Scope Objective: to analyze the temporal sensitivity facet of web search queries. Problem: find the temporal correlation between web queries
Web Query Histogram Periodic queries: Non-periodic queries: Champions League Final Liverpool
Queries Correlation Correlation Observation: 2 keywords are temporally related to each other
Proposed System Framework 1. Ask Google Trends for query’s histogram 2. Use histogram digitizer program (Plotparser by WeiHua) to get the numerical data 3. Query Correlation: Calculate correlation coefficient between queries 4. Query classification
Google Trends
Histogram Digitizer
Queries Correlation: 1 st attempt Calculate Correlation coefficient: ◦ Using data of 45 months: Jan 2004 until September 2007 ◦ Calculate coefficient based on the entire histograms
Result classification: 1 st attempt Data of 15 different popular keywords, of which: ◦ Periodic keywords: Champions League Final, Grammy, Pro Evolution Soccer, Oscar Winner, Valentine, Chrismas(!). ◦ Related keywords: PS2, Xbox, Jack Nicholson, Beyonce, chocolate, chocolateNews, Liverpool, EA Sport, Konami All keywords are compare to each other based on correlation coefficient of their histograms. (15*14)/2 = 105 instances
Result classification: 1 st attempt Classification based on threshold method: ◦ Statistical result: Threshold value: 0.25 Correlation Prediction True Positive RateFalse Positive Rate Yes88.89%10.34% No89.66%11.11%
1 st attempt Problems: Very low threshold value ◦ Only one feature used. Using entire histogram, while some keywords are only temporally related to each other at some periods of time. ◦ Example: Valentine – Chocolate (Correlation appears during February)
Queries Correlation: 2 nd attempt Interesting period: ◦ Period in which two query are highly related to each other -> Segmentation (Clustering) problem
Clustering Using Simple K means Algorithm to predict no. of clusters Use WEKA to cluster the histogram
Query Correlation: 2 nd attempt Periodic keywords detection: ◦ Identify repeated pattern using correlation ◦ Periodic query tends to have highly correlation coefficient on repeated part.
Interesting Periods Projection Interesting periods from related keyword histogram is to be projected on periodic keyword’s histogram
Result Classification: 2 nd Attempt Using previous dataset Related keywords are compared with each of periodic keywords for correlation Result: ◦ Manage to increase threshold value to: 0.5
2 nd attempt problems K – means clustering does not guarantee correct interesting periods detection: ◦ Due to the fact that we have to provide no. of cluster for K-means -> implemented algorithm to determine no. of cluster failed to provide correct value Small training data set. Too simple method of threshold detector.
Queries Correlation: 3 rd attempt Need to find another way to identify interesting period. Peak period: ◦ Period in which there is a high peak in query volume Peak detection problem: ◦ Mapping and smoothing using convolution
Clustering using peak detection Mapping:
Clustering using peak detection Smoothing using convolution:
Clustering using peak detection Peak Detection: using simple slope- change algorithm to determine peaks and valleys ◦ (with threshold value: mean)
Interesting periods Projections Interesting periods from related keyword histogram is to be projected on periodic keyword’s histogram and vice versa
Result Classification: 3 rd attempt Use large training data: ◦ 47 popular keywords, of which: 15 periodic keywords and 32 related keywords Each related keyword is to compared with every periodic keyword to get correlation coefficient (Coef). ◦ Data size: 15 * 32 = 480 instances
Result Classification: 3 rd attempt Apply Naïve Bayes Classifier (WEKA): 6 features: Average Coef from related keyword projection (AveRCoef) Average Coef from periodic keyword projection (AvePCoef) Overall Average Coef [= (AveRCoef+AvePCoef)/2] Max Coef from related keyword projection (MaxRCoef) Max Coef from periodic keyword projection (MaxPCoef) Average Max Coef [= (MaxRCoef+MaxPCoef)/2 ]
Result Classification: 3 rd attempt Statistical Result: Confusion Matrix Correlation Prediction True Positive Rate False Positive Rate RecallF-Measure Yes89.3%5.2% No94.8%10.7% AB<- classified as 253A = Yes 16294B = No
Future attempt: Query Normalization Search volumes tends to increase as the Internet becomes more popular Histogram for Top 20 most popular keywords of all time:
Future attempt: Normalization Histograms need to be normalize to ignore this trend’s effect! Proposed action: ◦ Subtract time effect ◦ Current Problem: More distortions are added due to scaling problem. -> histogram from Google have been scaled. We have no information of raw data.
Future attempt: From Periodic to Non-periodic Find the correlation between two non- periodic queries. Proposed Problem: some keywords are highly searched after other keywords ◦ Example: “tsunami” is usually searched after “earthquake” is issued.
Future attempt: From Periodic to Non-Periodic Tsunami Earthquake
Potential Applications Results re-ranking: ◦ Move result that is more up-to-date up on the result list Example: when user ask for Beyonce during the time of Grammy -> result that related to Grammy will have a higher rank Server Buffering: ◦ When user query Beyonce, the web page that related to Grammy will be buffer in local server in hope that the user will tend to search for Grammy eventually.
Question?
The End