Seminar: Efficient NLP Session 2, NLP behind Broccoli November 2nd, 2011 Elmar Haußmann Chair for Algorithms and Data Structures Department of Computer.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Paper 1 Source Questions What is the message. What is the purpose
NEURAL NETWORKS Perceptron
Progress update Lin Ziheng. System overview 2 Components – Connective classifier Features from Pitler and Nenkova (2009): – Connective: because – Self.
Query Languages. Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Inference and Reasoning. Basic Idea Given a set of statements, does a new statement logically follow from this. For example If an animal has wings and.
Methods of Proof Chapter 7, second half.. Proof methods Proof methods divide into (roughly) two kinds: Application of inference rules: Legitimate (sound)
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Now where do you put the Commas? Sentences, Sentence Combining, & punctuation basics.
Grammars.
1 Lecture 35 Brief Introduction to Main AI Areas (cont’d) Overview  Lecture Objective: Present the General Ideas on the AI Branches Below  Introduction.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
Proof methods Proof methods divide into (roughly) two kinds: –Application of inference rules Legitimate (sound) generation of new sentences from old Proof.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
Towards Semantic Web: An Attribute- Driven Algorithm to Identifying an Ontology Associated with a Given Web Page Dan Su Department of Computer Science.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Introduction to Machine Learning Approach Lecture 5.
Syntax.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
E = mc 2 (What the heck does that mean?). * He just wanted to Bee a scientist.
Extracting Opinions, Opinion Holders, and Topics Expressed in Online News Media Text Soo-Min Kim and Eduard Hovy USC Information Sciences Institute 4676.
A Light-weight Approach to Coreference Resolution for Named Entities in Text Marin Dimitrov Ontotext Lab, Sirma AI Kalina Bontcheva, Hamish Cunningham,
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.
Albert Einstein.
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
A multiple knowledge source algorithm for anaphora resolution Allaoua Refoufi Computer Science Department University of Setif, Setif 19000, Algeria .
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Unsupervised learning of Natural languages Eitan Volsky Yasmine Meroz.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Opinion Holders in Opinion Text from Online Newspapers Youngho Kim, Yuchul Jung and Sung-Hyon Myaeng Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 24 Nov 2, 2005 Nanjing University of Science & Technology.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
1/21 Automatic Discovery of Intentions in Text and its Application to Question Answering (ACL 2005 Student Research Workshop )
Artificial Intelligence: Natural Language
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Unit 8 Syntax. Syntax Syntax deals with rules for combining words into sentences, as well as with relationship between elements in one sentence Basic.
© Copyright 2008 STI INNSBRUCK Intelligent Systems Propositional Logic.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
1 Some Guidelines for Good Research Dr Leow Wee Kheng Dept. of Computer Science.
Albert Einstein BY Justice.
Survey on Long Queries in Keyword Search : Phrase-based IR Sungchan Park
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
CSC 594 Topics in AI – Natural Language Processing
Text Based Information Retrieval
EA C461 – Artificial Intelligence Logical Agent
Erasmus University Rotterdam
Natural Language Processing (NLP)
Issues in Decision-Tree Learning Avoiding overfitting through pruning
LING 388: Computers and Language
Recognizing Partial Textual Entailment
Clustering Algorithms for Noun Phrase Coreference Resolution
Statistical NLP: Lecture 9
Natural Language - General
CS246: Information Retrieval
Natural Language Processing (NLP)
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Extracting Why Text Segment from Web Based on Grammar-gram
Statistical NLP : Lecture 9 Word Sense Disambiguation
Natural Language Processing (NLP)
Presentation transcript:

Seminar: Efficient NLP Session 2, NLP behind Broccoli November 2nd, 2011 Elmar Haußmann Chair for Algorithms and Data Structures Department of Computer Science University of Freiburg

A genda  Motivation and Problem Definition  Rule based Approach  Machine Learning based Approach  Conclusion / Current and Future Work 2 NLP behind Broccoli Nov. 2, 2011

M otivation and Problem Definition 3  The idea of semantic full-text search –Search in full-text –But combined with “structured information“  Broccoli performs the following NLP-tasks: –Entity recognition –Based on the links inside Wikipedia articles and heuristics –Anaphora resolution –Based on simple, yet efficient heuristics –Contextual Sentence Decomposition –This talk Nov. 2, 2011 NLP behind Broccoli

M otivation and Problem Definition 4  The motivation for Contextual Sentence Decomposition – the “heavy” NLP-task behind Broccoli plant edible leaves Example Query The usable parts of rhubarb are the medicinally used roots and the edible stalks, however its leaves are toxic. Result Sentence Nov. 2, 2011 NLP behind Broccoli

M otivation and Problem Definition 5  Many false-positives caused by words, appearing in same sentence, but part of a different context ➡ Apply natural language processing to decompose sentence based on context and search resulting „sentences“ independently The usable parts of rhubarb are the medicinally used roots and the edible stalks, however its leaves are toxic. Result Sentence Nov. 2, 2011 NLP behind Broccoli

M otivation and Problem Definition 6  The usable parts of rhubarb are the medicinally used roots  The usable parts of rhubarb are the edible stalks  its leaves are toxic  The usable parts of rhubarb are the medicinally used roots  The usable parts of rhubarb are the edible stalks  its leaves are toxic Decomposed Sentence The usable parts of rhubarb are the medicinally used roots and the edible stalks, however its leaves are toxic. Original Sentence Nov. 2, 2011 NLP behind Broccoli

M otivation and Problem Definition 7 Contextual Sentence Decomposition is the process of performing 1. Sentence Constituent Identification followed by 2. Sentence Constituent Recombination Contextual Sentence Decomposition is the process of performing 1. Sentence Constituent Identification followed by 2. Sentence Constituent Recombination Contextual Sentence Decomposition Problem Definition Nov. 2, 2011 NLP behind Broccoli

M otivation and Problem Definition 8  Identify specific parts of sentence  Differentiate 4 types of constituents  Relative clauses  Appositions  List items  Separators Albert Einstein, who was born in Ulm,... Albert Einstein, a well-known scientist,... Albert Einstein published papers on Brownian motion, the photelectric effect and special relativity. Albert Einstein was recognized as a leading scientist and in 1921 he received the Nobel Prize in Physics. Sentence Constituent Identification Nov. 2, 2011 NLP behind Broccoli

M otivation and Problem Definition 9 The usable parts of rhubarb are the medicinally used roots and the edible stalks, however its leaves are toxic. Original Sentence with Identified Constituents list item separator Nov. 2, 2011 NLP behind Broccoli

M otivation and Problem Definition 10  Recombine identified constituents into sub- sentences  Split sentences at separators  Attach relative clauses and appositions to noun (-phrase) they describe  Apply „distributive law“ to list items Sentence Constituent Recombination Nov. 2, 2011 NLP behind Broccoli

M otivation and Problem Definition 11  its leaves are toxic  The usable parts of rhubarb are the medicinally used roots  The usable parts of rhubarb are the edible stalks  its leaves are toxic  The usable parts of rhubarb are the medicinally used roots  The usable parts of rhubarb are the edible stalks Decomposed Sentence The usable parts of rhubarb are the medicinally used roots and the edible stalks, however its leaves are toxic. Original Sentence Nov. 2, 2011 NLP behind Broccoli

M otivation and Problem Definition 12  Given identified constituents, recombination comparably simple - identification challenging part  Constituents possibly nested, e.g. relative clause can contain enumeration etc.  Resulting sub-sentences often grammatically correct but not required to be  Approach must be feasible in terms of efficiency (English Wikipedia ~ 30GB raw text) Remarks Nov. 2, 2011 NLP behind Broccoli

13 M otivation and Problem Definition  Ambiguous, even for humans: ‒ “Time flies like an arrow; fruit flies like a banana.” ‒ “Flying planes can be dangerous.” ‒ “I once shot an elephant in my pajamas. ‒ How he got into my pajamas, I'll never know.”  Focus: large part of less complicated sentences And…Natural Language is Tricky Nov. 2, 2011 NLP behind Broccoli

14 M otivation and Problem Definition …Natural Language is Tricky Difficult Sentence Panofsky was known to be friends with Wolfgang Pauli, one of the main contributors to quantum physics and atomic theory, as well as Albert Einstein, born in Ulm and famous for his discovery of the law of the photoelectric effect and theories of relativity. Difficult Sentence  Even if meaning is clear to a human: arbitrarily deep nesting and syntactic ambiguity  Apposition similar to an element of enumeration  Relative clause contains enumeration and starts in reduced form Nov. 2, 2011 NLP behind Broccoli

A genda  Motivation and Problem Definition  Rule based Approach  Machine Learning based Approach  Conclusion / Current and Future Work 15 Nov. 2, 2011 NLP behind Broccoli

 Devise hand-crafted rules by closely inspecting sentence structure 16 R ule based Approach Koffi Annan, who is the current U.N. Secretary General, has spent much of his tenure working to promote peace in the Third World. Sentence containing Relative Clause  Example: relative clause is set off by comma, starts with word „who“ and extends to the next comma Idea Nov. 2, 2011 NLP behind Broccoli

17 Basic Approach  Identify „stop-words“ The usable parts of rhubarb are the medicinally used roots and the edible stalks, however its leaves are toxic. Original Sentence with marked Stop-words  For each marked word decide if and which constituent it starts R ule based Approach  Determine corresponding constituent ends Nov. 2, 2011 NLP behind Broccoli

The usable parts of rhubarb are the medicinally used roots and the edible stalks, however its leaves are toxic. Original Sentence with Identified Stop-words 18 R ule based Approach Determine Constituent Starts Nov. 2, 2011 NLP behind Broccoli

 If a verb follows but a noun preceeds it: separator The usable parts of rhubarb are the medicinally used roots and the edible stalks, however its leaves are toxic. 19 Original Sentence with Identified Separator R ule based Approach Determine Constituent Starts Nov. 2, 2011 NLP behind Broccoli

The usable parts of rhubarb are the medicinally used roots and the edible stalks, however its leaves are toxic. 20  If a verb follows but a noun preceeds it: separator Original Sentence with Identified List Item Start  If it is no relative clause or apposition: next word list item start R ule based Approach Determine Constituent Starts Nov. 2, 2011 NLP behind Broccoli

The usable parts of rhubarb are the medicinally used roots and the edible stalks, however its leaves are toxic. 21  If a verb follows but a noun preceeds it: separator  If it is no relative clause or apposition: next word list item start  First list item starts at noun-phrase preceeding already discovered list item start Original Sentence with all Identified List Item Starts R ule based Approach Determine Constituent Starts Nov. 2, 2011 NLP behind Broccoli

22 R ule based Approach Determine Constituent Ends  For each start assign a matching end The usable parts of rhubarb are the medicinally used roots and the edible stalks, however its leaves are toxic. Original Sentence with all Identified List Item Starts Nov. 2, 2011 NLP behind Broccoli

23 The usable parts of rhubarb are the medicinally used roots and the edible stalks, however its leaves are toxic. Original Sentence with Identified Constituents R ule based Approach Determine Constituent Ends  For each start assign a matching end  A list item extends to the next constituent start or the sentence end Nov. 2, 2011 NLP behind Broccoli

24 The usable parts of rhubarb are the medicinally used roots and the edible stalks, however its leaves are toxic. Original Sentence with Identified Constituents R ule based Approach Determine Constituent Ends  For each start assign a matching end  A list item extends to the next constituent start or the sentence end Nov. 2, 2011 NLP behind Broccoli

A genda  Motivation and Problem Definition  Rule based Approach  Machine Learning based Approach  Conclusion / Current and Future Work 25 Nov. 2, 2011 NLP behind Broccoli

 Use supervised learning to train classifiers that identify the start and end of constituents  Train Support Vector Machines for each constituent start and end 26 M achine Learning based Approach The usable parts of rhubarb are the medicinally used roots and the edible stalks, however its leaves are toxic. Original Sentence Idea Nov. 2, 2011 NLP behind Broccoli

 Apply classifiers in turn to each word  Ideally this would already give a correct solution 27 M achine Learning based Approach Basic Approach 2. Apply list item start classifier 3. Apply list item end classifier I. Apply separator classifier Nov. 2, 2011 NLP behind Broccoli

 However classifiers are not perfect  Some additional ends and beginnings might be identified  Decisions are local and do not consider admissible constituent structure 28 M achine Learning based Approach Nov. 2, 2011 NLP behind Broccoli

 Train classifiers that identify whether a span of the sentence denotes a valid constituent 29 M achine Learning based Approach Apply list item classifier  Still, identified constituents might overlap  Structural constraints must be satisfied Nov. 2, 2011 NLP behind Broccoli

30 M achine Learning based Approach  Determine MWIS using enumeration or greedy approach for large problem sizes ➡ Reduce to the maximum weight independent set problem Nov. 2, 2011 NLP behind Broccoli

31  Final result adheres to structural constraints  More resistant to wrong „local“ classifications The usable parts of rhubarb are the medicinally used roots and the edible stalks, however its leaves are toxic. Original Sentence with Identified Constituents M achine Learning based Approach Nov. 2, 2011 NLP behind Broccoli

A genda  Motivation and Problem Definition  Rule based Approach  Machine Learning based Approach  Conclusion / Current and Future Work 32 Nov. 2, 2011 NLP behind Broccoli

1. Compare identification using a ground truth 2. Compare resulting decomposition using a ground truth 3. Evaluate influence on search quality against ground truth 33 E valuation / C onclusion Nov. 2, 2011 Evaluation on three levels NLP behind Broccoli

 Rule based approach viable, clear improvement  Machine Learning based approach viable, currently less effective  Search quality increases depend on exact query, but go up to doubling precision, with hardly loss in recall  Contextual Sentence Decomposition integral part of Semantic Full-Text Search 34 E valuation / C onclusion Results NLP behind Broccoli Nov. 2, 2011

 Increasing quality of decomposition by: o efficient additional NLP (deep-parsers?…) o improvements of rules o better understanding what extent of decomposition is reasonable and necessary 35 E valuation / C onclusion Current Work NLP behind Broccoli Nov. 2, 2011

36 T hank you for your attention! NLP behind Broccoli Nov. 2, 2011