27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline.

Slides:



Advertisements
Similar presentations
Introduction to the BinX Library eDIKT project team Ted Wen Robert Carroll
Advertisements

© PureTesting 2008Testing Thought Leadership Extension to Noun and Verb Technique for writing better test cases.
An Introduction to GATE
University of Sheffield NLP Exercise I Objective: Implement a ML component based on SVM to identify the following concepts in company profiles: company.
The eXtensible Markup Language (XML) An Applied Tutorial Kevin Thomas.
Tutorial on Standoff Markup as used in: HCRC Map Task Corpus MATE/NITE Workbench Amy Isard HCRC Language Technology Group University of Edinburgh.
CS4025: Advanced Information Extraction. Overview CS4025, Department of Computing Science, University of Aberdeen 2 Overview of aspects of IE and General.
Information Retrieval in Practice
ANLE1 CC 437: Advanced Natural Language Engineering ASSIGNMENT 2: Implementing a query expansion component for a Web Search Engine.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
Basi di dati distribuite Prof. M.T. PAZIENZA a.a
A Practical Introduction to XML in Libraries Marty Kurth NYLA October 22, 2004.
EAGLES/ISLE Workshop LREC 2000 Athens, Greece The XML Framework Its Implications for Corpus Access and Use Nancy Ide Department of Computer Science Vassar.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Text mining and the Semantic Web Dr Diana Maynard NLP Group Department of Computer Science University of Sheffield.
Overview of Search Engines
Manohar – Why XML is Required Problem: We want to save the data and retrieve it further or to transfer over the network. This.
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
ELN – Natural Language Processing Giuseppe Attardi
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Some Advances in Transformation-Based Part of Speech Tagging
April 2005CSA2050:NLTK1 CSA2050: Introduction to Computational Linguistics NLTK.
1 XML at a neighborhood university near you Innovation 2005 September 16, 2005 Kwok-Bun Yue University of Houston-Clear Lake.
XML Overview. Chapter 8 © 2011 Pearson Education 2 Extensible Markup Language (XML) A text-based markup language (like HTML) A text-based markup language.
Flexible Interfaces in the Application of Language Technology to an eScience Corpus C.J. Rupp, Ann Copestake, Simone Teufel & Benjamin Waldron Computer.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
Survey of Semantic Annotation Platforms
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Information Extraction From Medical Records by Alexander Barsky.
A Flexible and Extensible Architecture for Linguistic Annotation Steven Bird *, David Day †, John Garofolo ‡, John Henderson †, Christophe Laprun ‡ and.
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
O Supervisor : Dr. Harold Boley o Advisor : Dr. Tara Athan o Team : Simranjit Singh Pratik Shah Bijiteshwar R Aayush.
1/(13) Using Corpora and Evaluation Tools Diana Maynard Kalina Bontcheva
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
Bringing “it” all Together !? Dean Djokic, ESRI David Maidment.
Edinburg March 2001CROSSMARC Kick-off meetingICDC ICDC background and know-how and expectations from CROSSMARC CROSSMARC Project IST Kick-off.
Introduction to GATE Developer Ian Roberts. University of Sheffield NLP Overview The GATE component model (CREOLE) Documents, annotations and corpora.
Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.
17 Apr 2002 XML Syntax: Documents Andy Clark. Basic Document Structure Element tags – Elements have associated attributes Text content Miscellaneous –
05/03/03-06/03/03 7 th Meeting Edinburgh Naïve Bayes Fact Extractor (NBFE) v.1.
Natural language processing tools Lê Đức Trọng 1.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
Project Overview Vangelis Karkaletsis NCSR “Demokritos” Frascati, July 17, 2002 (IST )
© Copyright 2013 STI INNSBRUCK “How to put an annotation in HTML?” Ioannis Stavrakantonakis.
IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,
28/02/02-01/03/02 4 th Meeting Athens ENERC v.2. 28/02/02-01/03/02 4 th Meeting Athens Updates Change in early tokenisation: identification of words now.
Understanding RDF. 2/30 What is RDF? Resource Description Framework is an XML-based language to describe resources. A common understanding of a resource.
Internet & World Wide Web How to Program, 5/e. © by Pearson Education, Inc. All Rights Reserved.2.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
MedKAT Medical Knowledge Analysis Tool December 2009.
COMP9321 Web Application Engineering Semester 2, 2015 Dr. Amin Beheshti Service Oriented Computing Group, CSE, UNSW Australia Week 4 1COMP9321, 15s2, Week.
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
CHAPTER NINE Accessing Data Using XML. McGraw Hill/Irwin ©2002 by The McGraw-Hill Companies, Inc. All rights reserved Introduction The eXtensible.
XML Notes taken from w3schools. What is XML? XML stands for EXtensible Markup Language. XML was designed to store and transport data. XML was designed.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
NCSR “Demokritos” Institute of Informatics & Telecommunications CROSSMARC CROSS-lingual Multi Agent Retail Comparison WP3 Multilingual and Multimedia Fact.
Information Retrieval in Practice
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Institute of Informatics & Telecommunications
Extracting Recipes from Chemical Academic Papers
CSA2050: Introduction to Computational Linguistics
Presentation transcript:

27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline architecture Named Entity Recognition –LT TTT Tools –MUC-7 system

27/03/01CROSSMARC kick-off meeting LT XML Suite of tools which communicate using the LT XML API. All use the same query language to access and manipulate subparts of XML documents. Simple tools can be composed together into complex applications. Programs include sggrep, sgcount, sgsort, xmlnorm, rxp, knit. Additional programs: xmlperl, xmlquery

27/03/01CROSSMARC kick-off meeting Pipeline Architecture An XML document is piped through a series of programs Each program targets a particular part of the document via a particular query Each program performs some operation, e.g. adding or removing mark-up, making other modifications to the structure of the XML, extracting or counting subparts of the document

27/03/01CROSSMARC kick-off meeting LT TTT: Text Tokenisation Tool Suite of XML tools designed to tokenise from the most basic level through to high level mark-up. Useful for many linguistic applications including corpus annotation. Used by the LTG for their MUC-7 system.

27/03/01CROSSMARC kick-off meeting LT TTT: programs ltpos: a part-of-speech tagger and sentence boundary disambiguator fsgmatch: a transducer operating over strings of characters or strings of XML elements using hand-written grammar rules Other programs –sggrep, xmlperl, sgdelmarkup

27/03/01CROSSMARC kick-off meeting LT TTT: grammar files for fsgmatch Titles and paragraphs Sub-word character sequences Words Numbers (300, three hundred) MUC7 style NUMEX and TIMEX elements In-text citations Reference lists Chunks: noun groups and verb groups (LT CHUNK)

27/03/01CROSSMARC kick-off meeting ltpos Statistical (maximum entropy) component Disambiguates full stops (and optionally adds sentence mark-up) Also disambiguates sentence-initial capitals Uses Penn treebank tagset; trained on the Brown corpus Adds POS tag as value of attribute on W element

27/03/01CROSSMARC kick-off meeting LT TTT: example pipeline plain2xml.perl \ | fsgmatch -q ".*/TEXT" GRAM/char/paras.gr \ | fsgmatch -q ".*/P" GRAM/char/words.gr \ | ltpos -q ".*/TEXT" -qs ".*/P" -qw ".*/W" -std_form \ –sent SENT resource.xml \ | fsgmatch -q ".*/P" GRAM/xml/numbers.gr \ | fsgmatch -q ".*/P" GRAM/xml/numex.gr \ | fsgmatch -q ".*/P" GRAM/xml/timex.gr

27/03/01CROSSMARC kick-off meeting LT TTT: example input In July 1995 CEG Corp. posted net of $102 million, or 34 cents a share. Late last night the company announced a growth of 20%.

27/03/01CROSSMARC kick-off meeting LT TTT: example output In July 1995 CEG Corp. posted net of $ 102 million, or 34 cents a share. Late last night the company announced a growth of 20 %.

27/03/01CROSSMARC kick-off meeting Named Entity Recognition: MUC7 mark-up He was one of 118 Nazi rocket engineers secretly brought to the United States after the war. The scientists included Wernher von Braun, the father of the American rocket programs. MCI has long said it would be a bidder and would start the bidding at $175 million. MCI has teamed up with News Corp..

27/03/01CROSSMARC kick-off meeting LTG’s MUC7 System A pipeline made up of calls to LT TTT tools: ltpos and many calls to fsgmatch using different resource grammars. Early stages (before tagging) recognise NUMEX and TIMEX elements. Complex final stages (after tagging) to recognise ENAMEX elements involving calls to fsgmatch using ENAMEX grammars and lexical resources (e.g. first names, gazetteers of place names) interspersed with calls to statistical (maximum entropy) component.

27/03/01CROSSMARC kick-off meeting Platforms LT XML –Unix (Solaris and Linux) –Windows/NT LT TTT –Unix (Solaris and Linux) –planned Window/NT version

27/03/01CROSSMARC kick-off meeting Further LTG Expertise XML –XSLT for document rendering –Document linking and stand-off annotation –XML query languages –Schemas NL Generation Automatic summarisation

27/03/01CROSSMARC kick-off meeting What we hope to gain from CROSSMARC Continued maintenance and development of our existing tools. Extending our expertise beyond NER to fact extraction. Opportunity to experiment with the symbolic/statistical balance in our system and to experiment with alternative statistical methods. Automatic induction of NER rules.