27/03/01CROSSMARC kick-off meeting LTG Background XML-based Processing –Several years of experience in developing XML-based software –LT XML Tools –Pipeline architecture Named Entity Recognition –LT TTT Tools –MUC-7 system
27/03/01CROSSMARC kick-off meeting LT XML Suite of tools which communicate using the LT XML API. All use the same query language to access and manipulate subparts of XML documents. Simple tools can be composed together into complex applications. Programs include sggrep, sgcount, sgsort, xmlnorm, rxp, knit. Additional programs: xmlperl, xmlquery
27/03/01CROSSMARC kick-off meeting Pipeline Architecture An XML document is piped through a series of programs Each program targets a particular part of the document via a particular query Each program performs some operation, e.g. adding or removing mark-up, making other modifications to the structure of the XML, extracting or counting subparts of the document
27/03/01CROSSMARC kick-off meeting LT TTT: Text Tokenisation Tool Suite of XML tools designed to tokenise from the most basic level through to high level mark-up. Useful for many linguistic applications including corpus annotation. Used by the LTG for their MUC-7 system.
27/03/01CROSSMARC kick-off meeting LT TTT: programs ltpos: a part-of-speech tagger and sentence boundary disambiguator fsgmatch: a transducer operating over strings of characters or strings of XML elements using hand-written grammar rules Other programs –sggrep, xmlperl, sgdelmarkup
27/03/01CROSSMARC kick-off meeting LT TTT: grammar files for fsgmatch Titles and paragraphs Sub-word character sequences Words Numbers (300, three hundred) MUC7 style NUMEX and TIMEX elements In-text citations Reference lists Chunks: noun groups and verb groups (LT CHUNK)
27/03/01CROSSMARC kick-off meeting ltpos Statistical (maximum entropy) component Disambiguates full stops (and optionally adds sentence mark-up) Also disambiguates sentence-initial capitals Uses Penn treebank tagset; trained on the Brown corpus Adds POS tag as value of attribute on W element
27/03/01CROSSMARC kick-off meeting LT TTT: example pipeline plain2xml.perl \ | fsgmatch -q ".*/TEXT" GRAM/char/paras.gr \ | fsgmatch -q ".*/P" GRAM/char/words.gr \ | ltpos -q ".*/TEXT" -qs ".*/P" -qw ".*/W" -std_form \ –sent SENT resource.xml \ | fsgmatch -q ".*/P" GRAM/xml/numbers.gr \ | fsgmatch -q ".*/P" GRAM/xml/numex.gr \ | fsgmatch -q ".*/P" GRAM/xml/timex.gr
27/03/01CROSSMARC kick-off meeting LT TTT: example input In July 1995 CEG Corp. posted net of $102 million, or 34 cents a share. Late last night the company announced a growth of 20%.
27/03/01CROSSMARC kick-off meeting LT TTT: example output In July 1995 CEG Corp. posted net of $ 102 million, or 34 cents a share. Late last night the company announced a growth of 20 %.
27/03/01CROSSMARC kick-off meeting Named Entity Recognition: MUC7 mark-up He was one of 118 Nazi rocket engineers secretly brought to the United States after the war. The scientists included Wernher von Braun, the father of the American rocket programs. MCI has long said it would be a bidder and would start the bidding at $175 million. MCI has teamed up with News Corp..
27/03/01CROSSMARC kick-off meeting LTG’s MUC7 System A pipeline made up of calls to LT TTT tools: ltpos and many calls to fsgmatch using different resource grammars. Early stages (before tagging) recognise NUMEX and TIMEX elements. Complex final stages (after tagging) to recognise ENAMEX elements involving calls to fsgmatch using ENAMEX grammars and lexical resources (e.g. first names, gazetteers of place names) interspersed with calls to statistical (maximum entropy) component.
27/03/01CROSSMARC kick-off meeting Platforms LT XML –Unix (Solaris and Linux) –Windows/NT LT TTT –Unix (Solaris and Linux) –planned Window/NT version
27/03/01CROSSMARC kick-off meeting Further LTG Expertise XML –XSLT for document rendering –Document linking and stand-off annotation –XML query languages –Schemas NL Generation Automatic summarisation
27/03/01CROSSMARC kick-off meeting What we hope to gain from CROSSMARC Continued maintenance and development of our existing tools. Extending our expertise beyond NER to fact extraction. Opportunity to experiment with the symbolic/statistical balance in our system and to experiment with alternative statistical methods. Automatic induction of NER rules.