3 Machine Reading Goal: to read all texts on the web, extract all knowledge and represent in DB/KB format DARPA program on machine reading
4Issues Background theory and text facts may be inconsistent -> probabilistic representation Beliefs may only be implicit -> need inference Supervised learning not an option due to variety of relations on the web -> IE not a valid solution May require many steps of entailment -> Need more general approach than textual entailment
5 Initial Approaches Systems that learn relations using examples (supervised) Systems that learn how to learn patterns using a seed set: SNOBALL (semi-supervised) Systems that can label their own training examples using domain independent patterns: KNOWITALL (self-supervised)
6 KnowItAll Require no hand-tagged data A generic pattern such as Learn Seattle, New York City, London as examples of cities Learn new patterns “Headquartered in ” to learn more cities Problem: relation-specific requiring bootstrapping for each relation
7 TextRunner “The use of NERs as well as syntactic or dependency parsers is a common thread that unifies most previous work. But this rather “heavy” linguistic technology runs into problems when applied to the heterogeneous text found on the Web.” Self-supervised learner Given a small corpus as example Uses Stanford parser Retains tuples if: Finds all entities in the parse Keeps tuples if there is a dependency between 2 entities shorter than a cerrtain length The path from e1 to e2 does not cross a sentence like boundary (e.g., rel clause) Neither e1 or e2 are a pronoun Learns a classifier that tags tuples as “trustworthy” Each tuple converted to a feature vector u Feature = POS sequence u Number of stop words in r u Number of tokens in r Learned classifier contains no relation-specific or lexical features Single pass extractor No parsing but POS tagging and lightweight NP chunker Entities = NP chunks Relations words in between but heursitically eliminating words like prepositions Generates one or more candidate tuples per sentence and retains one that classifier determines are trustworthy Redundancy-based Assessor Assigns a probability to each one based on a probablistic model of redundancy
8 TextRunner Capabilities Tuple outputs are placed in a graph TextRunner operates at large scale, processing 90 million web pages, producing 1 billion tuples, with estimated 70% accuracy Problems: inconsistencies, polysemy, synonymy, entity duplication
9 How close are we to realizing the dream of machine reading?