Download presentation
Presentation is loading. Please wait.
Published byAlexandrina Hutchinson Modified over 8 years ago
1
The Workhorse System ● Andrew J. Dougherty ● FRDCSA Project
2
Overview ● Dedicated to various processing jobs – Specifically, processing properly licensed textual materials e.g. Gutenberg, Wikipedia, Google Books, etc. ● Performing NLU, ontology population, information extraction, text mining, etc.
3
Systems Involved ● NLU, Kmax, Sayer, Suppositional Decomposer, Formalize, Capability::TextAnalysis, etc for adding Semantic Markup ● Corpus Manager for interfacing with collected properly licensed corpora ● Namazu for fulltext search, etc ● QUAC for question answering
4
Types of Semantic Markup ● Given a text with complex formatting, we intend to mark up every noticable feature of the text ● For instance, with poetry, you would mark type of poem, stanzas, quatrains, alliterative words, so on and so forth ● Markup can be either standoff or inline
5
Range ● Perform markup over all textual sources, building a freely available corpus of advanced marked-up text ● Allow manual corrections and additions with Kmax ● Train information extraction systems like Mallet using annotated text
6
Progress to Date ● Deployed a 32bit server with installation of FRDCSA (built release instructions for FRDCSA from this install) ● Running out of its small disk space ● Wrote basic infrastructure for Corpus Manager ● Still collecting corpora (disk space issue) ● Debated upgrading server, not enough funds ● NLU system needs to be completed, installed
7
Types of Markup ● Wish to be able to markup the document structure, layout, formatting using a tag hierarchy similar to HTML, UIMA tag hierarchy, etc. ● Building ontology of textual information terms/concepts ● If can get SPADE to work, can annotate for Rhetorical Structure Theory (for documents tagged as argumentative)
8
Types of Markup ● Capability::TextAnalysis is generating a wide range of outputs which we align to the original text and produce standoff annotations in the form of assertions in FreeKBS2 ● Adding new systems daily, such as Knext for world knowledge extraction
9
Conceptual Problems ● Different ways to proceed with marking text – can markup relative to whole text or relative to a part of the the text (i.e. Start from the last paragraph) ● How to fragment text correctly to send to the markup systems ● Currently using Text::Fragment, which, while a quick and useful fix, does not segment according to document type – will split across a sentence, etc.
10
File system and KBFS ● Need to develop KBFS (Knowledge Based File System) to assert theorems about the document files themselves ● File type, if text, what character set, file metadata, md5sum, crypto-signing, file history (where was acquired, which modifications, etc), estimate of contents of file, whether file is an archive and if so, about its subcontents
11
Reasoning w/ KBFS&NLU Markup ● How to properly make inferences from the annotations ● How to interface with other applications – such as Kmax for rapidly interacting with text contents
12
Current Bugs ● Knext not being run correctly ● NLU alignment of POS tagging failing due to misplaced carriage returns ● NLU in general has carriage return alignment problems ● Decision procedure for additional markup needs to be developed, formalized, learned, etc ● Need to incorporate deliberative reasoning, maybe ala NL-Soar, etc.
13
Current Bugs ● Not enough tags actually applied to the data, only a very few tags actually being applied ● Nlu.el handles the tag to text properties pretty poorly, creates lisp symbols that are prefixed with „nlu-” and contain wild characters ● Nlu.el cannot annotate because tag hierarchy/ontology non-existant ● Reasoning with tag data is difficult, owing to the strangeness of the structure of the assertions
14
Sample Outputs
15
Related Projects ● NLU system will help with information extraction efforts by allowing advanced markup of the text and training over already marked-up datasets such as job/resume corpus ● Chris Lampkin's Biblioborg and Kuranatron systems can be partially implemented by translating extracted knowledge to multi-agent deontic logics ● Can analyze all ethical systems in this manner
16
Other Benefits ● Being able to extract this knowledge from the text enables advanced text querying, understanding, reasoning. Difficult to precisely define, but such a corpus, especially a free one, will be a boon to language research and practical knowledge extraction ● Eventually plan to index all properly licensed text ● Can always add new analyses
17
Other Markup ● Shalmanaser semantic role labeling, affect labeling, canonical NLP labeling, named entity, coreference, RST, fact extraction, relation extraction, goal/skill/ability extraction, poetry and other specialized markup, document structure, file system metadata
18
Other Benefits ● Completed system may lazily or dynamically analyze our local text
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.