Download presentation
Presentation is loading. Please wait.
1
1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116 Office Hours: Thurs 12-1, 8-9
2
2Logistics Class evaluation Please do If there were topics you particularly liked, please say so If there were topics you particularly disliked, please so Anything you particularly liked or disliked about class format Project presentations Need eight people to go first, April 29th Not necessary to have all results 2 nd date: May 13, 7:10pm UNLESS…. Sign up by end of class or I will sign you up : http://www.cs.columbia.edu/~kathy/NLPWeb/finalp resentations.htm http://www.cs.columbia.edu/~kathy/NLPWeb/finalp resentations.htm
3
3 Machine Reading Goal: to read all texts on the web, extract all knowledge and represent in DB/KB format DARPA program on machine reading
4
4Issues Background theory and text facts may be inconsistent -> probabilistic representation Beliefs may only be implicit -> need inference Supervised learning not an option due to variety of relations on the web -> IE not a valid solution May require many steps of entailment -> Need more general approach than textual entailment
5
5 Initial Approaches Systems that learn relations using examples (supervised) Systems that learn how to learn patterns using a seed set: SNOBALL (semi-supervised) Systems that can label their own training examples using domain independent patterns: KNOWITALL (self-supervised)
6
6 KnowItAll Require no hand-tagged data A generic pattern such as Learn Seattle, New York City, London as examples of cities Learn new patterns “Headquartered in ” to learn more cities Problem: relation-specific requiring bootstrapping for each relation
7
7 TextRunner “The use of NERs as well as syntactic or dependency parsers is a common thread that unifies most previous work. But this rather “heavy” linguistic technology runs into problems when applied to the heterogeneous text found on the Web.” Self-supervised learner Given a small corpus as example Uses Stanford parser Retains tuples if: Finds all entities in the parse Keeps tuples if there is a dependency between 2 entities shorter than a cerrtain length The path from e1 to e2 does not cross a sentence like boundary (e.g., rel clause) Neither e1 or e2 are a pronoun Learns a classifier that tags tuples as “trustworthy” Each tuple converted to a feature vector u Feature = POS sequence u Number of stop words in r u Number of tokens in r Learned classifier contains no relation-specific or lexical features Single pass extractor No parsing but POS tagging and lightweight NP chunker Entities = NP chunks Relations words in between but heursitically eliminating words like prepositions Generates one or more candidate tuples per sentence and retains one that classifier determines are trustworthy Redundancy-based Assessor Assigns a probability to each one based on a probablistic model of redundancy
8
8 TextRunner Capabilities Tuple outputs are placed in a graph TextRunner operates at large scale, processing 90 million web pages, producing 1 billion tuples, with estimated 70% accuracy Problems: inconsistencies, polysemy, synonymy, entity duplication
9
9 How close are we to realizing the dream of machine reading?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.