2009.04.29 - SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

Slides:



Advertisements
Similar presentations
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Advertisements

SLIDE 1FIST Shanghai Digging Into Data: Data Mining for Information Access Ray R. Larson University of California, Berkeley Paul Watry.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Grid & Libraries, 10/18/04.1 Second Invitational Berkeley – Academia Sinica Grid Digital Libraries Workshop, Taipei, October 18, 2004 Grid Middleware Application.
Information Retrieval in Practice
Search Engines and Information Retrieval
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2006 Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval Lecture.
Natural Language Query Interface Mostafa Karkache & Bryce Wenninger.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
- SLAYT 1BBY220 Content Analysis & Stemming Yaşar Tonta Hacettepe Üniversitesi yunus.hacettepe.edu.tr/~tonta/ BBY220 Bilgi Erişim.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval Lecture.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Search Engines
SLIDE 1ISGC Taipei, Taiwan Grid-based Search and Data Mining Using Cheshire3 In collaboration with Robert Sanderson University of Liverpool.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Information Extraction Junichi Tsujii Graduate School of Science University of Tokyo Japan Ronen Feldman Bar Ilan University Israel.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
SLIDE 1IS 240 – Spring 2013 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Using SRB and iRODS with the Cheshire3 Information Framework Building Data Grids with iRODS May, 2008 National e-Science Centre Edinburgh Dr Robert.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
SLIDE 1DID Meeting - Montreal Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Ray R. Larson University of California,
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
Artificial Intelligence Research Center Pereslavl-Zalessky, Russia Program Systems Institute, RAS.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Topic #1: Introduction EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
CNI, 3rd April 2006 Slide 1 UK National Centre for Text Mining: Activities and Plans Dr. Robert Sanderson Dept. of Computer Science University of Liverpool.
SLIDE 1INFOSCALE Hong Kong Integrating Data Mining and Data Management Technologies for Scholarly Inquiry Paul Watry Richard Marciano.
Natural Language Processing for Information Retrieval D a v i d D. L e w i s AT&T Bell Lab.’s K a r e n S p a r c k J o n e s University of Cambridge Ferhat.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Commission on Cyberinfrastructure for the Humanities and Social Sciences Metadata as Infrastructure, Interoperability, and the Larger Context Michael Buckland,
1 Question Answering and Logistics. 2 Class Logistics  Comments on proposals will be returned next week and may be available as early as Monday  Look.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
NATURAL LANGUAGE PROCESSING
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1NaCTeM Launch -Manchester National Center for Text Mining Launch Event Ray R. Larson University of California, Berkeley School of Information.
Information Retrieval in Practice
Search Engine Architecture
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Information Retrieval and Web Search
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Information Retrieval and Web Search
Machine Learning in Natural Language Processing
Dept. of Computer Science University of Liverpool
Introduction to Information Retrieval
CS246: Information Retrieval
Chaitali Gupta, Madhusudhan Govindaraju
Presentation transcript:

SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval Lecture 22: NLP for IR

SLIDE 2IS 240 – Spring 2009 Today Review –Cheshire III Design – GRID-based DLs NLP for IR Text Summarization Credit for some of the slides in this lecture goes to Marti Hearst and Eric Brewer

SLIDE 3IS 240 – Spring 2009 Grid middleware Chemical Engineering Applications Application Toolkits Grid Services Grid Fabric Climate Data Grid Remote Computing Remote Visualization Collaboratories High energy physics Cosmology Astrophysics Combustion.…. Portals Remote sensors..… Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services Grid Architecture -- (Dr. Eric Yen, Academia Sinica, Taiwan.)

SLIDE 4IS 240 – Spring 2009 Chemical Engineering Applications Application Toolkits Grid Services Grid Fabric Grid middleware Climate Data Grid Remote Computing Remote Visualization Collaboratories High energy physics Cosmology Astrophysics Combustion Humanities computing Digital Libraries … Portals Remote sensors Text Mining Metadata management Search & Retrieval … Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services Grid Architecture (ECAI/AS Grid Digital Library Workshop) Bio-Medical

SLIDE 5IS 240 – Spring 2009 Grid IR Issues Want to preserve the same retrieval performance (precision/recall) while hopefully increasing efficiency (I.e. speed) Very large-scale distribution of resources is a challenge for sub-second retrieval Different from most other typical Grid processes, IR is potentially less computing intensive and more data intensive In many ways Grid IR replicates the process (and problems) of metasearch or distributed search

SLIDE 6IS 240 – Spring 2009 Context Environmental Requirements: –Very Large scale information systems Terabyte scale (Data Grid) Computationally expensive processes (Comp. Grid) Digital Preservation Analysis of data, not just retrieval (Data/Text Mining) Ease of Extensibility, Customizability (Python) Open Source Integrate not Re-implement "Web 2.0" – interactivity and dynamic interfaces

SLIDE 7IS 240 – Spring 2009 Context Data Grid Layer Data Grid SRB iRODS Digital Library Layer Application Layer Web Browser Multivalent Dedicated Client User Interface Apache+ Mod_Python+ Cheshire3 Protocol Handler Process Management Kepler Cheshire3 Query Results Query Results ExportParse Document Parsers Multivalent,... Natural Language Processing Information Extraction Text Mining Tools Tsujii Labs,... Classification Clustering Data Mining Tools Orange, Weka,... Query Results Search / Retrieve Index / Store Information System Cheshire3 User Interface MySRB PAWN Process Management Kepler iRODS rules Term Management Termine WordNet... Store

SLIDE 8IS 240 – Spring 2009 Cheshire3 Object Model UserStore User ConfigStore Object Database Query Record Transformer Records Protocol Handler Normaliser IndexStore Terms Server Document Group Ingest Process Documents Index RecordStore Parser Document Query ResultSet DocumentStore Document PreParser Extracter

SLIDE 9IS 240 – Spring 2009 Object Configuration One XML 'record' per non-data object Very simple base schema, with extensions as needed Identifiers for objects unique within a context (e.g., unique at individual database level, but not necessarily between all databases) Allows workflows to reference by identifier but act appropriately within different contexts. Allows multiple administrators to define objects without reference to each other

SLIDE 10IS 240 – Spring 2009 Grid Focus on ingest, not discovery (yet) Instantiate architecture on every node Assign one node as master, rest as slaves. Master then divides the processing as appropriate. Calls between slaves possible Calls as small, simple as possible: (objectIdentifier, functionName, *arguments) Typically: ('workflow-id', 'process', 'document-id')

SLIDE 11IS 240 – Spring 2009 Grid Architecture Master Task Slave Task 1 Slave Task N Data Grid GPFS Temporary Storage (workflow, process, document) fetch document document extracted data

SLIDE 12IS 240 – Spring 2009 Grid Architecture - Phase 2 Master Task Slave Task 1 Slave Task N Data Grid GPFS Temporary Storage (index, load) store index fetch extracted data

SLIDE 13IS 240 – Spring 2009 Workflow Objects Written as XML within the configuration record. Rewrites and compiles to Python code on object instantiation Current instructions: –object –assign –fork –for-each –break/continue –try/except/raise –return –log (= send text to default logger object) Yes, no if!

SLIDE 14IS 240 – Spring 2009 Workflow example workflow.SimpleWorkflow Unparsable Record ”Loaded Record:” + input.id

SLIDE 15IS 240 – Spring 2009 Text Mining Integration of Natural Language Processing tools Including: –Part of Speech taggers (noun, verb, adjective,...) –Phrase Extraction –Deep Parsing (subject, verb, object, preposition,...) –Linguistic Stemming (is/be fairy/fairy vs is/is fairy/fairi) Planned: Information Extraction tools

SLIDE 16IS 240 – Spring 2009 Data Mining Integration of toolkits difficult unless they support sparse vectors as input - text is high dimensional, but has lots of zeroes Focus on automatic classification for predefined categories rather than clustering Algorithms integrated/implemented: –Perceptron, Neural Network (pure python) –Naïve Bayes (pure python) –SVM (libsvm integrated with python wrapper) –Classification Association Rule Mining (Java)

SLIDE 17IS 240 – Spring 2009 Data Mining Modelled as multi-stage PreParser object (training phase, prediction phase) Plus need for AccumulatingDocumentFactory to merge document vectors together into single output for training some algorithms (e.g., SVM) Prediction phase attaches metadata (predicted class) to document object, which can be stored in DocumentStore Document vectors generated per index per document, so integrated NLP document normalization for free

SLIDE 18IS 240 – Spring 2009 Data Mining + Text Mining Testing integrated environment with 500,000 medline abstracts, using various NLP tools, classification algorithms, and evaluation strategies. Computational grid for distributing expensive NLP analysis Results show better accuracy with fewer attributes:

SLIDE 19IS 240 – Spring 2009 Applications (1) Automated Collection Strength Analysis Primary aim: Test if data mining techniques could be used to develop a coverage map of items available in the London libraries. The strengths within the library collections were automatically determined through enrichment and analysis of bibliographic level metadata records. This involved very large scale processing of records to: –Deduplicate millions of records –Enrich deduplicated records against database of 45 million –Automatically reclassify enriched records using machine learning processes (Naïve Bayes)

SLIDE 20IS 240 – Spring 2009 Applications (1) Data mining enhances collection mapping strategies by making a larger proportion of the data usable, by discovering hidden relationships between textual subjects and hierarchically based classification systems. The graph shows the comparison of numbers of books classified in the domain of Psychology originally and after enhancement using data mining

SLIDE 21IS 240 – Spring 2009 Applications (2) Assessing the Grade Level of NSDL Education Material The National Science Digital Library has assembled a collection of URLs that point to educational material for scientific disciplines for all grade levels. These are harvested into the SRB data grid. Working with SDSC we assessed the grade-level relevance by examining the vocabulary used in the material present at each registered URL. We determined the vocabulary-based grade-level with the Flesch-Kincaid grade level assessment. The domain of each website was then determined using data mining techniques (TF-IDF derived fast domain classifier). This processing was done on the Teragrid cluster at SDSC.

SLIDE 22IS 240 – Spring 2009 Applications (2) The formula for the Flesch Reading Ease Score: FRES = –1.015 ((total words)/(total sentences)) – 84.6 ((total syllables)/(total words)) The Flesch-Kincaid Grade Level Formula: FKGLF = 0.39 * ((total words)/(total sentences)) * ((total syllables)/(total words)) –15.59 The Domain was determined by: –Domains used were based upon the AAAS Benchmarks –Taking in samples from each of the domain areas being examined and produces scored and ranked lists of vocabularies for each domain. –Each token in a document is passed through a lookup function against this table and tallies are calculated for the entire document. –These tallies are then used to rank the order of likelihood of the document being about each topic and a statistical pass of the results returns only those topics that are above in certain threshold.

SLIDE 23IS 240 – Spring 2009 Today Natural Language Processing and IR –Based on Papers in Reader and on David Lewis & Karen Sparck Jones “Natural Language Processing for Information Retrieval” Communications of the ACM, 39(1) Jan Text summarization: Lecture from Ed Hovy (USC)

SLIDE 24IS 240 – Spring 2009 Natural Language Processing and IR The main approach in applying NLP to IR has been to attempt to address –Phrase usage vs individual terms –Search expansion using related terms/concepts –Attempts to automatically exploit or assign controlled vocabularies

SLIDE 25IS 240 – Spring 2009 NLP and IR Much early research showed that (at least in the restricted test databases tested) –Indexing documents by individual terms corresponding to words and word stems produces retrieval results at least as good as when indexes use controlled vocabularies (whether applied manually or automatically) –Constructing phrases or “pre-coordinated” terms provides only marginal and inconsistent improvements

SLIDE 26IS 240 – Spring 2009 NLP and IR Not clear why intuitively plausible improvements to document representation have had little effect on retrieval results when compared to statistical methods –E.g. Use of syntactic role relations between terms has shown no improvement in performance over “bag of words” approaches

SLIDE 27IS 240 – Spring 2009 General Framework of NLP Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 28IS 240 – Spring 2009 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation John runs. Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 29IS 240 – Spring 2009 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation John runs. John run+s. P-N V 3-pre N plu Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 30IS 240 – Spring 2009 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation John runs. John run+s. P-N V 3-pre N plu S NP P-N John VP V run Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 31IS 240 – Spring 2009 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation John runs. John run+s. P-N V 3-pre N plu S NP P-N John VP V run Pred: RUN Agent:John Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 32IS 240 – Spring 2009 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation John runs. John run+s. P-N V 3-pre N plu S NP P-N John VP V run Pred: RUN Agent:John John is a student. He runs. Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 33IS 240 – Spring 2009 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Domain Analysis Appelt:1999 Tokenization Part of Speech Tagging Term recognition (Ananiadou) Inflection/Derivation Compounding Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 34IS 240 – Spring 2009 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 35IS 240 – Spring 2009 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge Incomplete Lexicons Open class words Terms Term recognition Named Entities Company names Locations Numerical expressions Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 36IS 240 – Spring 2009 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge Incomplete Grammar Syntactic Coverage Domain Specific Constructions Ungrammatical Constructions Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 37IS 240 – Spring 2009 Syntactic Analysis General Framework of NLP Morphological and Lexical Processing Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge Incomplete Domain Knowledge Interpretation Rules Predefined Aspects of Information Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 38IS 240 – Spring 2009 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Explosion Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 39IS 240 – Spring 2009 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Explosion Most words in English are ambiguous in terms of their parts of speech. runs: v/3pre, n/plu clubs: v/3pre, n/plu and two meanings Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 40IS 240 – Spring 2009 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Explosion Structural Ambiguities Predicate-argument Ambiguities Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 41IS 240 – Spring 2009 Structural Ambiguities (1)Attachment Ambiguities John bought a car with large seats. John bought a car with $3000. (2) Scope Ambiguities young women and men in the room (3)Analytical Ambiguities Visiting relatives can be boring. The manager of Yaxing Benz, a Sino-German joint venture The manager of Yaxing Benz, Mr. John Smith John bought a car with Mary. $3000 can buy a nice car. Semantic Ambiguities(1) Semantic Ambiguities(2) Every man loves a woman. Co-reference Ambiguities Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 42IS 240 – Spring 2009 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge (2) Ambiguities: Combinatorial Explosion Structural Ambiguities Predicate-argument Ambiguities Combinatorial Explosion Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 43IS 240 – Spring 2009 Note: Ambiguities vs Robustness More comprehensive knowledge: More Robust big dictionaries comprehensive grammar More comprehensive knowledge: More ambiguities Adaptability: Tuning, Learning Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 44IS 240 – Spring 2009 Framework of IE IE as compromise NLP Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 45IS 240 – Spring 2009 Syntactic Analysis General Framework of NLP Morphological and Lexical Processing Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge Incomplete Domain Knowledge Interpretation Rules Predefined Aspects of Information Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 46IS 240 – Spring 2009 Syntactic Analysis General Framework of NLP Morphological and Lexical Processing Semantic Analysis Context processing Interpretation Difficulties of NLP (1) Robustness: Incomplete Knowledge Incomplete Domain Knowledge Interpretation Rules Predefined Aspects of Information Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 47IS 240 – Spring 2009 Techniques in IE (1) Domain Specific Partial Knowledge: Knowledge relevant to information to be extracted (2) Ambiguities: Ignoring irrelevant ambiguities Simpler NLP techniques (4) Adaptation Techniques: Machine Learning, Trainable systems (3) Robustness: Coping with Incomplete dictionaries (open class words) Ignoring irrelevant parts of sentences Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 48IS 240 – Spring 2009 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Anaysis Context processing Interpretation Open class words: Named entity recognition (ex) Locations Persons Companies Organizations Position names Domain specific rules:, Inc. Mr.. Machine Learning: HMM, Decision Trees Rules + Machine Learning Part of Speech Tagger FSA rules Statistic taggers 95 % F-Value 90 Domain Dependent Local Context Statistical Bias Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 49IS 240 – Spring 2009 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Anaysis Context processing Interpretation FASTUS 1.Complex Words: Recognition of multi-words and proper names 2.Basic Phrases: Simple noun groups, verb groups and particles 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. Based on finite states automata (FSA) Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 50IS 240 – Spring 2009 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Anaysis Context processing Interpretation FASTUS 1.Complex Words: Recognition of multi-words and proper names 2.Basic Phrases: Simple noun groups, verb groups and particles 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. Based on finite states automata (FSA) Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 51IS 240 – Spring 2009 General Framework of NLP Morphological and Lexical Processing Syntactic Analysis Semantic Analysis Context processing Interpretation FASTUS 1.Complex Words: Recognition of multi-words and proper names 2.Basic Phrases: Simple noun groups, verb groups and particles 3.Complex phrases: Complex noun groups and verb groups 4.Domain Events: Patterns for events of interest to the application Basic templates are to be built. 5. Merging Structures: Templates from different parts of the texts are merged if they provide information about the same entity or event. Based on finite states automata (FSA) Slides from Prof. J. Tsujii, Univ of Tokyo and Univ of Manchester

SLIDE 52IS 240 – Spring 2009 Using NLP Strzalkowski (in Reader) TextNLPrepres Dbase search TAGGER NLP: PARSERTERMS

SLIDE 53IS 240 – Spring 2009 Using NLP INPUT SENTENCE The former Soviet President has been a local hero ever since a Russian tank invaded Wisconsin. TAGGED SENTENCE The/dt former/jj Soviet/jj President/nn has/vbz been/vbn a/dt local/jj hero/nn ever/rb since/in a/dt Russian/jj tank/nn invaded/vbd Wisconsin/np./per

SLIDE 54IS 240 – Spring 2009 Using NLP TAGGED & STEMMED SENTENCE the/dt former/jj soviet/jj president/nn have/vbz be/vbn a/dt local/jj hero/nn ever/rb since/in a/dt russian/jj tank/nn invade/vbd wisconsin/np./per

SLIDE 55IS 240 – Spring 2009 Using NLP PARSED SENTENCE [assert [[perf [have]][[verb[BE]] [subject [np[n PRESIDENT][t_pos THE] [adj[FORMER]][adj[SOVIET]]]] [adv EVER] [sub_ord[SINCE [[verb[INVADE]] [subject [np [n TANK][t_pos A] [adj [RUSSIAN]]]] [object [np [name [WISCONSIN]]]]]]]]]

SLIDE 56IS 240 – Spring 2009 Using NLP EXTRACTED TERMS & WEIGHTS President soviet President+soviet president+former Hero hero+local Invade tank Tank+invade tank+russian Russian wisconsin

SLIDE 57IS 240 – Spring 2009 Same Sentence, different sys INPUT SENTENCE The former Soviet President has been a local hero ever since a Russian tank invaded Wisconsin. TAGGED SENTENCE (using uptagger from Tsujii) The/DT former/JJ Soviet/NNP President/NNP has/VBZ been/VBN a/DT local/JJ hero/NN ever/RB since/IN a/DT Russian/JJ tank/NN invaded/VBD Wisconsin/NNP./.

SLIDE 58IS 240 – Spring 2009 Same Sentence, different sys CHUNKED Sentence (chunkparser – Tsujii) (TOP (S (NP (DT The) (JJ former) (NNP Soviet) (NNP President) ) (VP (VBZ has) (VP (VBN been) (NP (DT a) (JJ local) (NN hero) ) (ADVP (RB ever) ) (SBAR (IN since) (S (NP (DT a) (JJ Russian) (NN tank) ) (VP (VBD invaded) (NP (NNP Wisconsin) ) ) ) ) ) ) (..) )

SLIDE 59IS 240 – Spring 2009 Same Sentence, different sys Enju Parser ROOTROOTROOTROOT-1ROOTbeenbeVBNVB5 beenbeVBNVB5ARG1PresidentpresidentNNPNNP3 beenbeVBNVB5ARG2heroheroNNNN8 aaDTDT6ARG1heroheroNNNN8 aaDTDT11ARG1tanktankNNNN13 locallocalJJJJ7ARG1heroheroNNNN8 ThetheDTDT0ARG1PresidentpresidentNNPNNP3 formerformerJJJJ1ARG1PresidentpresidentNNPNNP3 RussianrussianJJJJ12ARG1tanktankNNNN13 SovietsovietNNPNNP2MODPresidentpresidentNNPNNP3 invadedinvadeVBDVB14ARG1tanktankNNNN13 invadedinvadeVBDVB14ARG2WisconsinwisconsinNNPNNP15 hashaveVBZVB4ARG1PresidentpresidentNNPNNP3 hashaveVBZVB4ARG2beenbeVBNVB5 sincesinceININ10MODbeenbeVBNVB5 sincesinceININ10ARG1invadedinvadeVBDVB14 evereverRBRB9ARG1sincesinceININ10

SLIDE 60IS 240 – Spring 2009 NLP & IR Indexing –Use of NLP methods to identify phrases Test weighting schemes for phrases –Use of more sophisticated morphological analysis Searching –Use of two-stage retrieval Statistical retrieval Followed by more sophisticated NLP filtering

SLIDE 61IS 240 – Spring 2009 NPL & IR Lewis and Sparck Jones suggest research in three areas –Examination of the words, phrases and sentences that make up a document description and express the combinatory, syntagmatic relations between single terms –The classificatory structure over document collection as a whole, indicating the paradigmatic relations between terms and permitting controlled vocabulary indexing and searching –Using NLP-based methods for searching and matching

SLIDE 62IS 240 – Spring 2009 NLP & IR Issues Is natural language indexing using more NLP knowledge needed? Or, should controlled vocabularies be used Can NLP in its current state provide the improvements needed How to test

SLIDE 63IS 240 – Spring 2009 NLP & IR New “Question Answering” track at TREC has been exploring these areas –Usually statistical methods are used to retrieve candidate documents –NLP techniques are used to extract the likely answers from the text of the documents