KnowItAll and TextRunner

KnowItAll and TextRunner

Key Ideas: So Far High-precision low-coverage extractors and large redundant corpora (macro-reading) Hearst patterns (“cities such as Pittsburgh, Cleveland, and …) Regular structure in tables, etc… (Brin, …) Semi-supervised learning Self-training/bootstrapping or co-training Other semi-supervised methods: Expectation-maximization Transductive margin-based methods (e.g., transductive SVM, logistic regression with entropic regularization, …) Graph-based methods Label propogation Label propogation via random walk with reset

Bootstrapping Lin & Pantel ‘02 Hearst ‘92 BlumMitchell ’98 Brin’98
Clustering by distributional similarity… Lin & Pantel ‘02 Hearst ‘92 Deeper linguistic features, free text… BlumMitchell ’98 Learning, semi-supervised learning, dual feature spaces… Brin’98 Scalability, surface patterns, use of web crawlers…

Bootstrapping Lin & Pantel ‘02 Hearst ‘92
Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Boosting-based co-train method using content & context features; context based on Collins’ parser; learn to classify three types of NE Collins & Singer ‘99 BM’98 Learning, semi-supervised learning, dual feature spaces… Brin’98 Scalability, surface patterns, use of web crawlers…

Bootstrapping Lin & Pantel ‘02 Hearst ‘92 Riloff & Jones ‘99
Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Riloff & Jones ‘99 Hearst-like patterns, Brin-like bootstrapping (+ “meta-level” bootstrapping) on MUC data Collins & Singer ‘99 BM’98 Learning, semi-supervised learning, dual feature spaces… Brin’98 Scalability, surface patterns, use of web crawlers…

Bootstrapping Lin & Pantel ‘02 Hearst ‘92 Riloff & Jones ‘99
Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Riloff & Jones ‘99 Collins & Singer ‘99 BM’98 Learning, semi-supervised learning, dual feature spaces… EM like co-train method with context & content both defined by character-level tries Cucerzan & Yarowsky ‘99 Brin’98 Scalability, surface patterns, use of web crawlers…

Bootstrapping … … Lin & Pantel ‘02 Hearst ‘92 Riloff & Jones ‘99
Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Riloff & Jones ‘99 … Collins & Singer ‘99 BM’98 Learning, semi-supervised learning, dual feature spaces… Etzioni et al 2005 … Cucerzan & Yarowsky ‘99 Brin’98 Scalability, surface patterns, use of web crawlers…

Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Riloff & Jones ‘99 … Collins & Singer ‘99 BM’98 Learning, semi-supervised learning, dual feature spaces… Etzioni et al 2005 … TextRunner Cucerzan & Yarowsky ‘99 Brin’98 Scalability, surface patterns, use of web crawlers…

Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Riloff & Jones ‘99 … Collins & Singer ‘99 ReadTheWeb BM’98 Learning, semi-supervised learning, dual feature spaces… Etzioni et al 2005 … TextRunner Cucerzan & Yarowsky ‘99 Brin’98 Scalability, surface patterns, use of web crawlers…

Today’s paper: the KnowItAll system

Architecture Set of [disjoint?] predicates to consider + two names for each Context – keywords from user to filter out non-domain pages … ? ~= [H92]

Architecture

Bootstrapping - 1 template rule “city” query

Bootstrapping - 2 Each discriminator U is a function: fU(x) = hits(“city x”)/hits(“x”) i.e. fU(“Pittsburgh”) = hits(“city Pittsburgh”)/hits(“Pittsburgh”) These are then used to create features: fU(x)>θ and fU(x)<θ

Bootstrapping - 3 Submit the queries & apply the rules to produce initial seeds. Evaluate each seed with each discriminator U: e.g., compute PMI stats like: |hits(“city Boston”)| / |hits(“Boston”)| Take the top seeds from each class and call them POSITIVE then use disjointness of classes to find NEGATIVE seeds. Train a NaiveBayes classifier using thresholded U’s as features.

Bootstrapping - 4 Estimate using the classifier based on the previously-trained discriminators Some ad hoc stopping conditions… (“signal to noise” ratio)

Architecture - 2

Extensions to KnowItAll
Problem: Unsupervised learning finds clusters—what if the text doesn’t support the clustering we want Eg target is “scientist”, but natural clusters are “biologist”, “physicist”, “chemist” Solution: subclass extraction Modify template/rule system to extract subclasses of target class (eg scientist  chemist, biologist, …) Check extracted subclasses with WordNet and/or PMI-like method (as for instances) Extract from each subclass recursively

Problem: Set of rules is limited: Derived from fixed set of “templates” (general patterns ~ from H92) Solution 1: Pattern learning: augment the initial set of rules derivable from templates Search for instances I on the web Generate patterns: some substring of I in context: “b1 … b4 I a1 … a4” Assume classes are disjoint and estimate recall/precision of each pattern P Exclude patterns that cover only one seed (very low recall) Take the top 200 remaining patterns and Evaluate them as extractors “using PMI” (?) Evaluate them as discriminators (in usual way?) Examples: “headquartered in <city>”, “<city> hotels”, …,

Solution 2: List extraction: augment the initial set of rules with rules that are local to a specific web page Search for pages containing small sets of instances (eg “London Paris Rome Pittsburgh”) For each page P: Find subtrees T of the DOM tree that contain >k seeds Find longest common prefix/suffix of the seeds in T [Some heuristics added to generalize this further] Find all other strings inside T with the same prefix/suffix Heuristically select the “best” wrapper for a page Wrapper = P, T, prefix, suffix

Results - City

Results - Film

Results - Scientist

KnowItAll and TextRunner

Similar presentations

Presentation on theme: "KnowItAll and TextRunner"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

KnowItAll and TextRunner

Similar presentations

Presentation on theme: "KnowItAll and TextRunner"— Presentation transcript:

Similar presentations

About project

Feedback