Download presentation
Presentation is loading. Please wait.
1
KnowItAll and TextRunner
2
Key Ideas: So Far High-precision low-coverage extractors and large redundant corpora (macro-reading) Hearst patterns (“cities such as Pittsburgh, Cleveland, and …) Regular structure in tables, etc… (Brin, …) Semi-supervised learning Self-training/bootstrapping or co-training Other semi-supervised methods: Expectation-maximization Transductive margin-based methods (e.g., transductive SVM, logistic regression with entropic regularization, …) Graph-based methods Label propogation Label propogation via random walk with reset
3
Bootstrapping Lin & Pantel ‘02 Hearst ‘92 BlumMitchell ’98 Brin’98
Clustering by distributional similarity… Lin & Pantel ‘02 Hearst ‘92 Deeper linguistic features, free text… BlumMitchell ’98 Learning, semi-supervised learning, dual feature spaces… Brin’98 Scalability, surface patterns, use of web crawlers…
4
Bootstrapping Lin & Pantel ‘02 Hearst ‘92
Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Boosting-based co-train method using content & context features; context based on Collins’ parser; learn to classify three types of NE Collins & Singer ‘99 BM’98 Learning, semi-supervised learning, dual feature spaces… Brin’98 Scalability, surface patterns, use of web crawlers…
5
Bootstrapping Lin & Pantel ‘02 Hearst ‘92 Riloff & Jones ‘99
Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Riloff & Jones ‘99 Hearst-like patterns, Brin-like bootstrapping (+ “meta-level” bootstrapping) on MUC data Collins & Singer ‘99 BM’98 Learning, semi-supervised learning, dual feature spaces… Brin’98 Scalability, surface patterns, use of web crawlers…
6
Bootstrapping Lin & Pantel ‘02 Hearst ‘92 Riloff & Jones ‘99
Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Riloff & Jones ‘99 Collins & Singer ‘99 BM’98 Learning, semi-supervised learning, dual feature spaces… EM like co-train method with context & content both defined by character-level tries Cucerzan & Yarowsky ‘99 Brin’98 Scalability, surface patterns, use of web crawlers…
7
Bootstrapping … … Lin & Pantel ‘02 Hearst ‘92 Riloff & Jones ‘99
Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Riloff & Jones ‘99 … Collins & Singer ‘99 BM’98 Learning, semi-supervised learning, dual feature spaces… Etzioni et al 2005 … Cucerzan & Yarowsky ‘99 Brin’98 Scalability, surface patterns, use of web crawlers…
8
Bootstrapping … … Lin & Pantel ‘02 Hearst ‘92 Riloff & Jones ‘99
Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Riloff & Jones ‘99 … Collins & Singer ‘99 BM’98 Learning, semi-supervised learning, dual feature spaces… Etzioni et al 2005 … TextRunner Cucerzan & Yarowsky ‘99 Brin’98 Scalability, surface patterns, use of web crawlers…
9
Bootstrapping … … Lin & Pantel ‘02 Hearst ‘92 Riloff & Jones ‘99
Clustering by distributional similarity… Hearst ‘92 Deeper linguistic features, free text… Riloff & Jones ‘99 … Collins & Singer ‘99 ReadTheWeb BM’98 Learning, semi-supervised learning, dual feature spaces… Etzioni et al 2005 … TextRunner Cucerzan & Yarowsky ‘99 Brin’98 Scalability, surface patterns, use of web crawlers…
10
Today’s paper: the KnowItAll system
11
Architecture Set of [disjoint?] predicates to consider + two names for each Context – keywords from user to filter out non-domain pages … ? ~= [H92]
12
Architecture
13
Bootstrapping - 1 template rule “city” query
14
Bootstrapping - 2 Each discriminator U is a function: fU(x) = hits(“city x”)/hits(“x”) i.e. fU(“Pittsburgh”) = hits(“city Pittsburgh”)/hits(“Pittsburgh”) These are then used to create features: fU(x)>θ and fU(x)<θ
15
Bootstrapping - 3 Submit the queries & apply the rules to produce initial seeds. Evaluate each seed with each discriminator U: e.g., compute PMI stats like: |hits(“city Boston”)| / |hits(“Boston”)| Take the top seeds from each class and call them POSITIVE then use disjointness of classes to find NEGATIVE seeds. Train a NaiveBayes classifier using thresholded U’s as features.
16
Bootstrapping - 4 Estimate using the classifier based on the previously-trained discriminators Some ad hoc stopping conditions… (“signal to noise” ratio)
17
Architecture - 2
18
Extensions to KnowItAll
Problem: Unsupervised learning finds clusters—what if the text doesn’t support the clustering we want Eg target is “scientist”, but natural clusters are “biologist”, “physicist”, “chemist” Solution: subclass extraction Modify template/rule system to extract subclasses of target class (eg scientist chemist, biologist, …) Check extracted subclasses with WordNet and/or PMI-like method (as for instances) Extract from each subclass recursively
19
Extensions to KnowItAll
Problem: Set of rules is limited: Derived from fixed set of “templates” (general patterns ~ from H92) Solution 1: Pattern learning: augment the initial set of rules derivable from templates Search for instances I on the web Generate patterns: some substring of I in context: “b1 … b4 I a1 … a4” Assume classes are disjoint and estimate recall/precision of each pattern P Exclude patterns that cover only one seed (very low recall) Take the top 200 remaining patterns and Evaluate them as extractors “using PMI” (?) Evaluate them as discriminators (in usual way?) Examples: “headquartered in <city>”, “<city> hotels”, …,
20
Extensions to KnowItAll
Solution 2: List extraction: augment the initial set of rules with rules that are local to a specific web page Search for pages containing small sets of instances (eg “London Paris Rome Pittsburgh”) For each page P: Find subtrees T of the DOM tree that contain >k seeds Find longest common prefix/suffix of the seeds in T [Some heuristics added to generalize this further] Find all other strings inside T with the same prefix/suffix Heuristically select the “best” wrapper for a page Wrapper = P, T, prefix, suffix
21
Results - City
22
Results - Film
23
Results - Scientist
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.