Download presentation
Presentation is loading. Please wait.
Published byGyles Wiggins Modified over 9 years ago
1
Improving Classification Accuracy Using Automatically Extracted Training Data Ariel Fuxman A. Kannan, A. Goldberg, R. Agrawal, P. Tsaparas, J. Shafer Search Labs Microsoft Research – Silicon Valley Mountain View, CA
2
For classification tasks, large amounts of training data can significantly improve accuracy How do we create large training sets? – Conventional methods of using human labelers are expensive and do not scale Thesis: The Web can be used to automatically create labeled data Web as a Source of Training Data 2
3
In this talk Validate the thesis on a task of practical importance: Retail intent identification in Web Search Present desirable properties of sources of labeled data Show how to extract labeled data from the sources 3
4
Importance of Retail Intent Queries 4 Just Behave: A Look at Searcher Behavior -Total U.S. Market ComScore Feb 2009 Share of Searches (% of total search queries) Share of Searches (% of total search queries) Share of Paid Clicks (% of queries leading to paid clicks) Share of Paid Clicks (% of queries leading to paid clicks)
5
Application of Retail Intent 5 Provide enhanced user experience around Commerce Search
6
Retail intent identification Definition: A query posed to a search engine has retail intent if most users who type the query have the intent to buy a tangible product Examples : Queries with retail intentQueries without retail intent Zune 80 gbMedical insurance Buy ipodFree ringtones Digital camera lensesDigital camera history
7
Data Sources for Retail Intent Sources – Web sites of retailers (e.g., Amazon, Walmart, Buy.com) Training Data – Queries typed directly on search box of retailers Extraction from toolbar logs 7 URL in toolbar log
8
Desirable Properties of Web Data Sources Popularity – Sources should yield large amounts of data Orthogonality – Sources should provide training data about different regions of the training space Separation – Sources should provide either positive or negative examples of the target class, but not both 8
9
Popularity Sources should yield large amounts of data For retail intent identification – Web site traffic is a proxy for popularity – More traffic means more queries – Choose Web sites of retailers based on publicly available traffic report (Hitwise) 9
10
Orthogonality For retail intent identification Positive examples: top sites from “Departmental Stores” and “Classified Ads” (Amazon and Craigslist) Negative examples: top site from “Reference” (Wikipedia) 10 Sources should provide training data about different regions of the training space
11
Separation Training examples must unambiguously reflect the intended meaning of most users – Example: there is a book called “World War I”, but the intent of the query is mostly non-commercial Can be enforced by removing groups of confusable queries from the sources
12
Method to Enforce Separation Create “groups” of positive queries Compare the word frequency distribution of each group against the negative class using Jensen-Shannon divergence Remove groups with low divergence 12
13
Groups for Retail Intent Extracting groups from the toolbar log 13 URL in toolbar log
14
Enforcing separation property JS Divergence of Amazon and Craigslist with respect to Wikipedia See paper for experimental validation
15
Experiments Setup – Built multiple classifiers using manual and automatically extracted labels in the training sets – Classification method: logistic regression, using unigrams and bigrams as features – Test set: 5K queries randomly sampled from a query log and labeled using Mechanical Turk
16
Automatic vs. Manual 16 Accuracy of extracted labels classifier on par with manual labels classifier
17
Combining Manual and Automatically Extracted Marginally different from using only automatically extracted labels
18
Using Unlabeled Data 18 Performance of the automatic labels classifier is still on par with classifiers that start with manual labels and exploit unlabeled data using self-training
19
Conclusions By carefully choosing the data sources, we can extract valuable training data Using large amounts of automatically extracted training data, we can get classifiers that are on par with those trained with manual labels As future work, we would like to apply this experience to other classification tasks 19
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.