Presentation is loading. Please wait.

Presentation is loading. Please wait.

IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,

Similar presentations


Presentation on theme: "IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,"— Presentation transcript:

1 IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi, Branimir Boguraev, Mary Neff, David Ferrucci, Paul Keyser and Anthony Levas IBM T.J. Watson Research Center {youssefd,bran,ferrucci,pkeyser,levas}@us.ibm.comyoussefd,bran,ferrucci,pkeyser,levas}@us.ibm.com

2 IBM Research © Copyright IBM Corporation 2003 Outline  Background: -Text Analytics -Unstructured Information Management Architecture (UIMA)  The Challenges -The Consumability Challenges  Our Approach to meet these challenges -The Concept-Centric Approach -Our Text Analytics Development Cycle  A Scenario (Demo) -Detecting sentiments about cars from a corpus of car reviews

3 IBM Research © Copyright IBM Corporation 2003 Text Analytics FredistheCenterCEOof Organization Person CeoOf Arg2:Org Arg1:Person PP VP NP Parser Named Entity Relationship CenterMicros UIMA: Unstructured Information Management Architecture

4 IBM Research © Copyright IBM Corporation 2003 UIMA: A runtime framework for Text Analytics UIMA: Unstructured Information Management Architecture CEO Relationship PERSON Finder POS Tagger Tokenizer COMPANY Finder data PERSON COMPANY CEO Relationship Concepts analysis results annotators List of terms Dictionaries Regular expressions Pattern files Statistical models etc. Models represented by

5 IBM Research © Copyright IBM Corporation 2003 Sample Annotator: Java Code /** * This annotator searches for person titles using simple string matching. * * @param aTCAS TCAS containing document text and previously discovered * annotations, and to which new annotations are to be written. * @param aResultSpec A list of output types and features that this annotator * should produce. * * @see com.ibm.uima.analysis_engine.annotator.TextAnnotator#process(TCAS, ResultSpecification) */ public void process(TCAS aTCAS, ResultSpecification aResultSpec) throws AnnotatorProcessException { try { //If the ResultSpec doesn't include the PersonTitle type, we have //nothing to do. if (!aResultSpec.containsType("example.PersonTitle")) { return; } if (mContainingType == null) { //Search the whole document for PersonTitle annotations String text = aTCAS.getDocumentText(); annotateRange(aTCAS, text, 0, aResultSpec); } else { //Search only within annotations of type mContainingType // Get an iterator over the annotations of type mContainingType. FSIterator it = aTCAS.getAnnotationIndex(mContainingType).iterator(); // Loop over the iterator. while (it.isValid()) { // Get the next annotation from the iterator AnnotationFS annot = (AnnotationFS) it.get(); // Get text covered by this annotation String coveredText = annot.getCoveredText(); // Get begin position of this annotation int annotBegin = annot.getBegin(); //search for matches within this annotateRange(aTCAS, coveredText, annotBegin, aResultSpec); // Advance the iterator. it.moveToNext(); } catch(Exception e) { throw new AnnotatorProcessException(e); }

6 IBM Research © Copyright IBM Corporation 2003 # Shallow parser cascade: level 8 honour % SUB[], PSUB[], Phrase[] ; boundary % Sentence[] ; #_____ # auxtensed = Token[_unilex=~"VB+AUX:P"] | Token[_unilex=~"VB+AUX:Z"] | Token[_unilex=~"VB+AUX:D"] ; vrbtensed = Token[_unilex=~"VB-AUX:P"] | Token[_unilex=~"VB- AUX:Z"] | Token[_unilex=~"VB-AUX:D"] ; vrbuntensed = Token[_unilex=~"VB-AUX:I"] ; vrbgrpmodal = ( VG[@descend]. Token[_unilex=~"MD"]. Token[_unilex=~"RB"]*. ( ( Token[_unilex=~"VB-AUX:I"] ) | ( Token[_unilex=~"VB+AUX:I"]. Token[_unilex=~"VB-AUX:G"] ) ). Token[_unilex=~"RB"]*. ) | ( PVG[@descend]. Token[_unilex=~"MD"]. Token[_unilex=~"RB"]*. Token[_unilex=~"VB+AUX:I"]. Token[_unilex=~"RB"]*. Token[_unilex=~"VB-AUX:N"]. Token[_unilex=~"RB"]*. ) ; vrbgrpinfform = VG[@descend]. Token[_orth=~*SWORD]*. Token[_unilex=~"VB:I"]. ; Sample Annotator: AFST Grammar Syntax #_____ simplenp = NP[] ;# simple noun phrase possnp = PNP[] ;# possessive noun phrase npp = NPP[] ;# noun phrase with a trailing PP nplist = NPList[] ;# a list of NP's complexnp = CNP[] ;# complex (appositive) NP npphrase = :simplenp | :possnp | :npp | :nplist | :complexnp ; # an entity behaving like an NP #______ export scannerEight = ( :vrbgrptensed | :vrbgrpinfform ). Token[_unilex=~"RP"]|. /[OBJ. :npphrase. /]OBJ ;

7 IBM Research © Copyright IBM Corporation 2003 Sample Annotator: Semantic Dictionary Authority File 

8 IBM Research © Copyright IBM Corporation 2003 The Consumability Challenge  Building Analytics is a complex process - Requires highly trained individuals: NLP Experts UIMA Experts Advanced Java programmers with XML skills - Is very time consuming: Need time for learning the UIMA framework Need time for building the annotators

9 IBM Research © Copyright IBM Corporation 2003 Key Features  End to End Text Analytics Development Tool -Supports the full Cycle of Text Analytics Development Activities  Ease Of Use -Insulates the user from the complexity of the underlying frameworks  Concept-Centric -Lets the user think in terms of concepts as opposed to annotators and software components  Extensibility -Supports for plugging new model types, model editors, results viewers, and exploration tools

10 IBM Research © Copyright IBM Corporation 2003 Text Analytics Development Cycle Develop Concept Models Identify Domain- Relevant Concepts Configure & Assemble Application Analysis Engine Evaluate Discovery Results Run Analytics Evaluation Results Ontology (Type System) Concept Models Concept Finder Start Structured Information Corpus & Domain Exploration Type System Development

11 IBM Research © Copyright IBM Corporation 2003 Scenario: Detecting Sentiments about Cars and Car Features

12 IBM Research © Copyright IBM Corporation 2003 Demo

13 IBM Research © Copyright IBM Corporation 2003 Conclusion  This work addresses the text analytics consumability challenges with Platform, that provides: -Support the full Cycle of Text Analytics Development Activities -Ease Of Use -Support for a Concept-Centric development process -Extensibility

14 IBM Research © Copyright IBM Corporation 2003 Thank You Merci Shoukran

15 IBM Research © Copyright IBM Corporation 2003  Concepts -Concepts to find in Text  Documents -Corpora that can be used in analysis  Concept Finders -Analysis Engines built from concept models  Results -Results from running Concept Finder on Corpora. Overview

16 IBM Research © Copyright IBM Corporation 2003

17 IBM Research © Copyright IBM Corporation 2003 GlossEx: Domain Exploration Tool Domain Exploration

18 IBM Research © Copyright IBM Corporation 2003  Ontology -A group of concepts in a domain  Concept -A Concept in the domain  Model -Analytic for finding a specific Concept Ontologies, Concepts and Models

19 IBM Research © Copyright IBM Corporation 2003 Build CarAspectModel using Semantic Dictionary CAT 1. Enter a representative Term 2. Select synonyms (e.g. From WordNet) 3. Store Terms in a dictionary Building Models For Concepts

20 IBM Research © Copyright IBM Corporation 2003 Build CarAspectModel using Semantic Dictionary CAT 1. add representative Terms 2. Select synonyms (e.g. From WordNet) 3. Store Terms in a dictionary Building Models For Concepts

21 IBM Research © Copyright IBM Corporation 2003 Build CarSentimentModel using AFST CAT 1. Drag and Drop ConceptModels onto WorkArea 2. Interconnect to define pattern sequence Building Models

22 IBM Research © Copyright IBM Corporation 2003 Build a ConceptFinder for CarSentiments 1. Select All Relevant Concepts 2. The System generates a ConceptFinder for the selected concepts Building ConceptFinders

23 IBM Research © Copyright IBM Corporation 2003 Run ConceptFinder on a Corpus 1. Select ConceptFinder 2. Select Corpus 3. Run the analysis Running Analytics to get Results

24 IBM Research © Copyright IBM Corporation 2003 Annotations Viewer Results Evaluation

25 IBM Research © Copyright IBM Corporation 2003 Concordance Viewier Iterative Refinement Tools

26 IBM Research © Copyright IBM Corporation 2003 Collection Level Statistics : Comparing Results Results Evaluation

27 IBM Research © Copyright IBM Corporation 2003 Plugin Components: CATs & KoGs Dictionary Configurable Annotator Configurable Annotator Semantic Dictionary UI CATs Plugin Framework CAT Concordance Indexer KoG KoGs Plugin Framework Concordance Explorer UI KoG


Download ppt "IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,"

Similar presentations


Ads by Google