Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.

Similar presentations


Presentation on theme: "Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007."— Presentation transcript:

1 Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007

2 Motivation  Name entities  essential during the construction of knowledge bases from Web  helpful in various NLP tasks; like parsing, coreference resolution …  constitute a significant part of the Web search queries  helpful in building verticals in Web search

3 Previous works  Mining query logs  to improve various IR tasks re-ranking of retrieved documents query expansion spelling correction  Large-scale IE  mainly on document collections  ignoring the collective knowledge embedded in noisy search queries  This is the first work that applies name entity finding to Web search query logs

4 Extraction from Query Logs  Given  A set of target classes  A set of seed instances  The goal  To extract relevant class instances from query logs  Without using Any domain knowledge Any handcrafted extraction pattern

5 Overview of the System

6 Step 1: identification of query templates that match the seed instances

7 Step 2: identification of candidate instances

8 Step 3: internal representation of candidate instances  query: prefix candidate_instance postfix  entry: prefix postfix  weight of an entry = frequency of the query

9 Step 4: internal representation of seed instances  introducing weak supervision in the extraction process  the vectors associated with the seed instance are merged into a reference search-signature vector  a loose search fingerprint of the desired output type with respect to the class

10 Step 5: instance ranking  ranking based on the similarity score (computed with Jensen-Shannon) between each candidate vector and class vector

11 Reference search-signature vectors  A series of queries that can be asked about instances of a class  Given a set of candidate phrases, the system guess which candidate phrases are more likely to belong to the target class by looking at the queries

12 Experimental setting - 1  Target Classes  10 classes with 5 seed instances for each class City Country Drug Food Location Movie Newspaper Person University VideoGame

13 Experimental setting - 2  Data  A random sample of 50 million unique fully-anonymized queries submitted to Google  Evaluation Procedure  Top 250 candidates of each class are manually assigned a correctness label 1 : correct 0 : incorrect  Precision at rank N has been calculated for several N values

14 Quality of Extracted Instances

15 Does the popularity of seed instances in query logs correlated with precision ? -0.17 -0.19 more queries with seed instances more accurate scoring better internal representation ?

16 Comparing the usefulness of query logs vs. Web documents in NE finding  M. Pasca. Acquisition of categorized named entities for Web search. CIKM 2004  Target classes are incrementally acquired from Web documents along with their respective instances by using hand crafted extraction patterns (D-patt) Class [such as|including] Instance  Manual one-to-one mapping of chosen target classes with acquired classes

17 Comparing the usefulness of query logs vs. Web documents in NE finding  Instances extracted from Web documents are also manually evaluated as correct and incorrect  Except City, Newspaper and Country classes, seed based extraction from queries outperformed D-patt in every other class

18 Conclusion  Search queries, which are thought as noisy, keyword based approximations of underspecified user information needs, proved to be useful in name entity discoveries even with a small set of seed instances  with absolute precision (or precision improvement relative to web based hand crafted system) 0.96 (29%) for prec@50 0.90 (26%) for prec@150 0.80 (15%) for prec@250

19 Questions ?


Download ppt "Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007."

Similar presentations


Ads by Google