The Road to the Semantic Web Michael Genkin SDBI
"The Semantic Web is not a separate Web but an extension of the current one, in which information is given well- defined meaning, better enabling computers and people to work in cooperation." Tim Berners-Lee, James Hendler and Ora Lassila; Scientific American, May 2001 Michael Genkin
Over 25 billion RDF triples (October 2010) More than 24 billion web pages (June 2010) Probably more than one triple per page, lot more
How will we populate the Semantic Web? Humans will enter structured data Data-store owners will share their data Computers will read unstructured data Michael Genkin
Read the Web (or google it) Michael Genkin
Roadmap Motivation Some definitions Natural language processing Machine learning Macro reading the web Coupled training NELL Demo Summary Michael Genkin
Some Definitions Natural Language Processing Machine Learning Michael Genkin
Natural Language Processing Part of Speech Tagging (e.g. noun, verb) Noun phrase: a phrase that normally consists of a (modified) head noun. “pre-modified” (e.g. this, that, the red…) “post-modified” (e.g. …with long hair, …where I live) Proper noun: a noun which represents an unique entity (e.g. Jerusalem, Michael) Common noun: a noun which represents a class of entities (e.g. car, university) Michael Genkin
Learning: What is it? Michael Genkin
Training Methods Michael Genkin
Supervised
Michael Genkin Supervised Unsupervised
A middle way between supervised and unsupervised. Use a minimal amount of labeled examples and a large amount of unlabeled. Learn the structure of D in unsupervised manner, but use the labeled examples to constraint the results. Repeat. Known as bootstrapping. Michael Genkin Supervised Semi- Supervised Unsupervised
Bootstrapping Iterative semi-supervised learning Michael Genkin Jerusalem Tel Aviv Haifa mayor of arg1 life in arg1 Ness-Ziona London denial anxiety selfishness Amsterdam arg1 is home of traits such as arg1 Under constrained! Sematic drift
Macro Reading the Web Populating the Semantic Web by Macro-Reading Internet Text. T.M. Mitchell, J. Betteridge, A. Carlson, E.R. Hruschka Jr., and R.C. Wang. Invited Paper, In Proceedings of the International Semantic Web Conference (ISWC), 2009 Michael Genkin
Problem Specification (1): Input Initial ontology that contains: Dozens of categories and relations (e.g. Company, CompanyHeadquarteredInCity) Relations between categories and relations (e.g. mutual exclusion, type constraints) A few seed examples of each predicate in ontology The web Occasional access to human trainer Michael Genkin
Problem Specification (2): The Task Run forever (24x7) Each day: Run over ~500 million web pages. Extract new facts and relations from the web to populate ontology. Perform better than the day before Populate the semantic web. Michael Genkin
A Solution? An automatic, learning, macro-reader. Michael Genkin
Micro vs. Macro Reading (1) Micro-reading: the traditional NLP task of annotating a single web page to extract the full body of information contained in the document. NLP is hard! Macro-reading: the task of “reading” a large corpus of web pages (e.g. the web) and returning large collection of facts expressed in the corpus. But not necessarily all the facts. Michael Genkin
Micro vs. Macro Reading (2) Macro-reading is easier than micro- reading. Why? Macro-reading doesn’t require extracting every bit of information available. In text corpora as large as the web, many important fact are stated redundantly, thousands of times, using different wordings. Benefit by ignoring complex sentences. Benefit by statistically combining evidence from many fragments to determine a belief in a hypothesis. Michael Genkin
Why an Input Ontology? The problem with understanding free text is that it can mean virtually anything. By formulating the problem of macro- reading as populating an ontology we allow the system to focus only on relevant documents. The ontology can define meta properties of its categories and relations. Allows to populate parts of the semantic web for which an ontology is available. Michael Genkin
Machine Learning Methods Semi-supervised (use an ontology to learn). Learn textual patterns for extraction. Employ methods such as Coupled Training to improve accuracy. Expand the ontology to improve performance. Michael Genkin
Coupled Training Michael Genkin
Bootstrapping – Revised Iterative semi-supervised learning Michael Genkin Jerusalem Tel Aviv Haifa mayor of arg1 life in arg1 Ness-Ziona London denial anxiety selfishness Amsterdam arg1 is home of traits such as arg1
Coupled Training Michael Genkin Couple the training of multiple functions to make unlabeled data more informative Makes the learning task easier by adding constraints
Coupling (1): Output Constraints Michael Genkin
Coupling (1): Output Constraints Michael Genkin arg1 : Nir Barkat is the mayor of Jerusalem X1=arg1 Y=city? X2=arg1 Y=country? X2=arg1 Y=city?
Coupling (2): Compositional Constraints Michael Genkin
Coupling (2): Compositional Constraints Michael Genkin Nir Barkat is the mayor of Jerusalem MayorOf(X1,X2) city? location? politician? city? location? politician?
Coupling (3): Multi-view Agreement Michael Genkin
Coupling (3): Multi-view Agreement Michael Genkin
NELL – Never-Ending Language Learning Coupled Semi-Supervised Learning for Information Extraction. A. Carlson, J. Betteridge, R.C. Wang, E.R. Hruschka Jr. and T.M. Mitchell. In Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), Never Ending Language Learning Tom Mitchell's invited talk in the Univ. of Washington CSE Distinguished Lecture Series, October 21, Michael Genkin
Motivation Humans learn many things, for years, and become better learners over time Why not machines? Michael Genkin
Coupled Constraints (1) Michael Genkin
Coupled Constraints (2) Unstructured and Semi-structured text features: Noun phrases appear on the web in free text context or semi-structured context. Structured and Semi-structured classifiers will make independent mistakes But each is sufficient for classification Both the classifiers must agree. Michael Genkin
Coupled Pattern Learner (CPL): Overview Learns to extract category and pattern instances. Learns high-precision textual patterns. e.g. arg1 scored a goal for arg2 Michael Genkin
Coupled Pattern Learner (CPL): Extracting Runs forever, on each iteration bootstraps a patterns promoted on the last iteration to extract instances. Select the 1000 that co-occur with most patterns. Similar procedure for patterns, but using recently promoted instances. Uses PoS heuristics to accomplish extraction e.g. per category proper/common noun specification, pattern is a sequence of verbs followed by adjectives, prepositions, or determiners (and optionally preceded by nouns). Michael Genkin
Coupled Pattern Learner (CPL): Filtering and Ranking Michael Genkin
Coupled Pattern Learner (CPL): Promoting Candidates For each predicate – promotes at most 100 instances and 5 patterns. Highest rated. Instances and patterns promoted only if they co-occur with two promoted pattern or instances. Relations instances are promoted only if their arguments are candidates for the specified categories. Michael Genkin
Coupled SEAL (1) SEAL is an established wrapper induction algorithm. Creates page specific extractors Independent of language Category wrappers defined by prefix and postfix, relation wrappers defined by infix. Wrappers for each predicate learned independently. Michael Genkin
Coupled SEAL (2) Coupled SEAL adds mutual exclusion and type checking constrains to SEAL. Bootstraps recently promoted wrappers. Filters candidates that are mutually exclusive or not of the right type for relation. Uses a single page per domain for ranking. Promotes the top 100 instances extracted by at least two wrappers. Michael Genkin
Meta-Bootstrap Learner Couples the training of multiple extraction techniques. Intuition: different extractors will make independent errors. Replaces the PROMOTE step of subordinate extractor algorithms. Promotes any instance recommended by all the extractors, as long as mutual exclusion and type checks hold. Michael Genkin
Learning New Constraints Data mine the KB to infer new beliefs. Generates probabilistic, first order, horn clauses. Connects previously uncoupled predicates. Manually filter rules. Michael Genkin
Demo Time Michael Genkin
Summary Populating the semantic web by using NELL for macro reading Michael Genkin
Populating the Semantic Web Many ways to accomplish. Use initial ontology to focus, constrain the learning task. Couple the learning of many, many extractors. Macro Reading: instead of annotating a single page each time, read many pages simultaneously. A never ending task. Michael Genkin
Macro-Reading Helps to improve accuracy. Still doesn’t help to annotate a single page, but… Many things that are true for a single page are also true for many pages Helps to populate databases with frequently mentioned knowledge Michael Genkin
Future Directions Coupling with external sources DBpedia, Freenode Ontology extension New relations through reading, Subcategories Use a macro-reader to train a micro-reader Self-reflection, Self-correction Distinguishing tokens from entities Active learning – crowd sourcing Michael Genkin
Questions?