Presentation is loading. Please wait.

Presentation is loading. Please wait.

Implementing Query Classification HYP: End of Semester Update prepared Minh.

Similar presentations


Presentation on theme: "Implementing Query Classification HYP: End of Semester Update prepared Minh."— Presentation transcript:

1 Implementing Query Classification HYP: End of Semester Update prepared Minh

2 Previously… Web search queries: ◦ Understand user goal Broder (et al 2002): ◦ Queries are classified into 3 categories:  Informational  Navigational  Transactional

3 Previously… Functional Faceted Web Query Classification  Ambiguity: Polysemous, General, Specific  Authority Sensitivity: Yes - No  Spatial Sensitivity: Yes - No  Temporal Sensitivity: Yes - No ◦ Query’s 4-Tuple: ◦ 3 * 2 * 2 * 2 = 24 different combinations.

4 Temporal Sensitivity Definition: ◦ A keyword is temporal sensitive if the results returned by querying it on web search engine tends to change with respect to time. ◦ Example:  Temporal sensitive: Liverpool, Beyonce, Jennifer Hawkins, etc..  Non-temporal sensitive: video, buying car, etc..

5 Up-to-date Project Scope Objective: to analyze the temporal sensitivity facet of web search queries. Problem: find the temporal correlation between web queries

6 Web Query Histogram Periodic queries: Non-periodic queries: Champions League Final Liverpool

7 Queries Correlation Correlation Observation: 2 keywords are temporally related to each other

8 Proposed System Framework 1. Ask Google Trends for query’s histogram 2. Use histogram digitizer program (Plotparser by WeiHua) to get the numerical data 3. Query Correlation: Calculate correlation coefficient between queries 4. Query classification

9 Google Trends

10 Histogram Digitizer

11 Queries Correlation: 1 st attempt Calculate Correlation coefficient: ◦ Using data of 45 months: Jan 2004 until September 2007 ◦ Calculate coefficient based on the entire histograms

12 Result classification: 1 st attempt Data of 15 different popular keywords, of which: ◦ Periodic keywords:  Champions League Final, Grammy, Pro Evolution Soccer, Oscar Winner, Valentine, Chrismas(!). ◦ Related keywords:  PS2, Xbox, Jack Nicholson, Beyonce, chocolate, chocolateNews, Liverpool, EA Sport, Konami All keywords are compare to each other based on correlation coefficient of their histograms. (15*14)/2 = 105 instances

13 Result classification: 1 st attempt Classification based on threshold method: ◦ Statistical result:  Threshold value: 0.25 Correlation Prediction True Positive RateFalse Positive Rate Yes88.89%10.34% No89.66%11.11%

14 1 st attempt Problems: Very low threshold value ◦ Only one feature used. Using entire histogram, while some keywords are only temporally related to each other at some periods of time. ◦ Example: Valentine – Chocolate (Correlation appears during February)

15 Queries Correlation: 2 nd attempt Interesting period: ◦ Period in which two query are highly related to each other -> Segmentation (Clustering) problem

16 Clustering Using Simple K means Algorithm to predict no. of clusters Use WEKA to cluster the histogram

17 Query Correlation: 2 nd attempt Periodic keywords detection: ◦ Identify repeated pattern using correlation ◦ Periodic query tends to have highly correlation coefficient on repeated part.

18 Interesting Periods Projection Interesting periods from related keyword histogram is to be projected on periodic keyword’s histogram

19 Result Classification: 2 nd Attempt Using previous dataset Related keywords are compared with each of periodic keywords for correlation Result: ◦ Manage to increase threshold value to: 0.5

20 2 nd attempt problems K – means clustering does not guarantee correct interesting periods detection: ◦ Due to the fact that we have to provide no. of cluster for K-means  -> implemented algorithm to determine no. of cluster failed to provide correct value Small training data set. Too simple method of threshold detector.

21 Queries Correlation: 3 rd attempt Need to find another way to identify interesting period. Peak period: ◦ Period in which there is a high peak in query volume Peak detection problem: ◦ Mapping and smoothing using convolution

22 Clustering using peak detection Mapping:

23 Clustering using peak detection Smoothing using convolution:

24 Clustering using peak detection Peak Detection: using simple slope- change algorithm to determine peaks and valleys ◦ (with threshold value: mean)

25 Interesting periods Projections Interesting periods from related keyword histogram is to be projected on periodic keyword’s histogram and vice versa

26 Result Classification: 3 rd attempt Use large training data: ◦ 47 popular keywords, of which:  15 periodic keywords and 32 related keywords  Each related keyword is to compared with every periodic keyword to get correlation coefficient (Coef). ◦ Data size: 15 * 32 = 480 instances

27 Result Classification: 3 rd attempt Apply Naïve Bayes Classifier (WEKA):  6 features:  Average Coef from related keyword projection (AveRCoef)  Average Coef from periodic keyword projection (AvePCoef)  Overall Average Coef [= (AveRCoef+AvePCoef)/2]  Max Coef from related keyword projection (MaxRCoef)  Max Coef from periodic keyword projection (MaxPCoef)  Average Max Coef [= (MaxRCoef+MaxPCoef)/2 ]

28 Result Classification: 3 rd attempt Statistical Result: Confusion Matrix Correlation Prediction True Positive Rate False Positive Rate RecallF-Measure Yes89.3%5.2%0.8930.725 No94.8%10.7%0.9480.969 AB<- classified as 253A = Yes 16294B = No

29 Future attempt: Query Normalization Search volumes tends to increase as the Internet becomes more popular Histogram for Top 20 most popular keywords of all time:

30 Future attempt: Normalization Histograms need to be normalize to ignore this trend’s effect! Proposed action: ◦ Subtract time effect ◦ Current Problem: More distortions are added due to scaling problem.  -> histogram from Google have been scaled. We have no information of raw data.

31 Future attempt: From Periodic to Non-periodic Find the correlation between two non- periodic queries. Proposed Problem: some keywords are highly searched after other keywords ◦ Example: “tsunami” is usually searched after “earthquake” is issued.

32 Future attempt: From Periodic to Non-Periodic Tsunami Earthquake

33 Potential Applications Results re-ranking: ◦ Move result that is more up-to-date up on the result list  Example: when user ask for Beyonce during the time of Grammy -> result that related to Grammy will have a higher rank Server Buffering: ◦ When user query Beyonce, the web page that related to Grammy will be buffer in local server in hope that the user will tend to search for Grammy eventually.

34 Question?

35 The End


Download ppt "Implementing Query Classification HYP: End of Semester Update prepared Minh."

Similar presentations


Ads by Google