1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116.

1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116 Office Hours: Thurs 12-1, 8-9

2Logistics  Class evaluation  Please do  If there were topics you particularly liked, please say so  If there were topics you particularly disliked, please so  Anything you particularly liked or disliked about class format  Project presentations  Need eight people to go first, April 29th  Not necessary to have all results  2 nd date: May 13, 7:10pm UNLESS….  Sign up by end of class or I will sign you up : http://www.cs.columbia.edu/~kathy/NLPWeb/finalp resentations.htm http://www.cs.columbia.edu/~kathy/NLPWeb/finalp resentations.htm

3 Machine Reading  Goal: to read all texts on the web, extract all knowledge and represent in DB/KB format  DARPA program on machine reading

4Issues  Background theory and text facts may be inconsistent  -> probabilistic representation  Beliefs may only be implicit  -> need inference  Supervised learning not an option due to variety of relations on the web  -> IE not a valid solution  May require many steps of entailment  -> Need more general approach than textual entailment

5 Initial Approaches  Systems that learn relations using examples (supervised)  Systems that learn how to learn patterns using a seed set: SNOBALL (semi-supervised)  Systems that can label their own training examples using domain independent patterns: KNOWITALL (self-supervised)

6 KnowItAll  Require no hand-tagged data  A generic pattern  such as  Learn Seattle, New York City, London as examples of cities  Learn new patterns “Headquartered in ” to learn more cities  Problem: relation-specific requiring bootstrapping for each relation

7 TextRunner “The use of NERs as well as syntactic or dependency parsers is a common thread that unifies most previous work. But this rather “heavy” linguistic technology runs into problems when applied to the heterogeneous text found on the Web.”  Self-supervised learner  Given a small corpus as example  Uses Stanford parser  Retains tuples if:  Finds all entities in the parse  Keeps tuples if there is a dependency between 2 entities shorter than a cerrtain length  The path from e1 to e2 does not cross a sentence like boundary (e.g., rel clause)  Neither e1 or e2 are a pronoun  Learns a classifier that tags tuples as “trustworthy”  Each tuple converted to a feature vector u Feature = POS sequence u Number of stop words in r u Number of tokens in r  Learned classifier contains no relation-specific or lexical features  Single pass extractor  No parsing but POS tagging and lightweight NP chunker  Entities = NP chunks  Relations words in between but heursitically eliminating words like prepositions  Generates one or more candidate tuples per sentence and retains one that classifier determines are trustworthy  Redundancy-based Assessor  Assigns a probability to each one based on a probablistic model of redundancy

8 TextRunner Capabilities  Tuple outputs are placed in a graph  TextRunner operates at large scale, processing 90 million web pages, producing 1 billion tuples, with estimated 70% accuracy  Problems: inconsistencies, polysemy, synonymy, entity duplication

9  How close are we to realizing the dream of machine reading?

1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116.

Similar presentations

Presentation on theme: "1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116.

Similar presentations

Presentation on theme: "1 Natural Language Processing for the Web Prof. Kathleen McKeown 722 CEPSR, 939-7118 Office Hours: Wed, 1-2; Tues 4-5 TA: Yves Petinot 719 CEPSR, 939-7116."— Presentation transcript:

Similar presentations

About project

Feedback