William Underwood Georgia Tech Research Institute Atlanta, Georgia

Slides:



Advertisements
Similar presentations
Advanced Decision Support for Archival Processing of Presidential E-Records: Results and Demonstration William Underwood, P.I. Georgia Tech Research Institute.
Advertisements

File Format Identification and Archival Processing
Preparation of the Self-Study and Documentation
Leon County Schools Teacher Website Guidelines
The Performance Appraisal Process
Good News and Neutral News Messages
School Community Councils Working Together for School Improvement.
INITIATIVES, REFERENDA AND RECALLS This PowerPoint Covers:
Lecture 5: Writing Effective Business Memos
George W. Bush Presidential Library Electronic Records Alan Lowe April 24, 2012.
Information Sessions Partners in Education Partners in Education –Please follow Tya to the Board Room School Representatives School Representatives –Will.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Speech acts and events. Ctions performed To express themselves, people do not only produce utterances, they perform actions via those Utterances, such.
Advanced Technical Writing Lecture 8 Memorandums 29 June 2008.
Business Communication Report Writing
Identify types of business letters.  Two categories:  Business-To-Business and  Business-To-Customer  Business-to-business: The main purpose of a.
Direct and indirect speech acts
Introduction to linguistics II
WuArchivalContr.ppt-1 Information Technology & Telecommunications Laboratory Presidential Electronic Records Pilot Operating System (PERPOS) William Underwood.
BUSINESS CORRESPONDENCE Beginnings, middles, and endings.
1 Course Review (U1-4)  Key concepts  Phrases and vocabulary  Guide for 1 st assignment (BE1)  Preparation for final exam - By Xiang,Shu.
Evolution of a Prototype Archival System for Preserving & Reviewing Electronic Records 2008 SAA Annual Meeting August 30, 2008.
PTA President’s Course
Last Topic - Constitutions of United States and its silent Features Silent Features 1.Preamble 2. Introduction and Evolution 3. Sources 4. Significance.
Business Letters Business Technology I.
Discipline Flow Chart Verbal Counseling (Site Directors is responsible for this step) PERFORMANCE IMPROVED YESNO WRITTEN WARNING & ACTION PLAN CELEBRATE.
Spoken dialog for e-learning supported by domain ontologies Dario Bianchi, Monica Mordonini and Agostino Poggi Dipartimento di Ingegneria dell’Informazione.
ITTL.ppt-1 Information Technology & Telecommunications Laboratory Document Type Recognition and Content Summarization William Underwood Persistent Archives.
Memorandum Memorandum. How to write memo? How to write memo? General Information About Memos: General Information About Memos: Audience and Purpose: Audience.
ARCHIVISTS’ TOOLKIT WORKSHOP March 13, 2008 Christine de Catanzaro Jody Thompson.
Resolution Drafting Presentation by GA officials
ITTL.ppt-1 Information Technology & Telecommunications Laboratory Semantic Technologies Applied to FOIA Review William Underwood Partnerships in Innovation:
ISM 5316 Week 3 Learning Objectives You should be able to: u Define and list issues and steps in Project Integration u List and describe the components.
Computer Applications I Unit 3 Study Guide 2 Business Documents.
COMMUNICATION STUDIES TWO Lecturer: Chevanese Y. Campbell.
TECHNICAL WRITING [UWB20302 / UMB1042]
© 2005 by Thomson Delmar Learning. All Rights Reserved.1 CALIFORNIA CIVIL LITIGATION DEPOSITIONS.
Advanced Technical Writing Lecture 4 Memorandums.
GTRI.ppt-1 NLP Technology Applied to e-discovery Bill Underwood Principal Research Scientist “The Current Status and.
1 Business Communication Process and Product Brief Canadian Edition, Mary Ellen Guffey Kathleen Rhodes Patricia Rogin (c) 2003 Nelson, a division of Thomson.
THE NUTS AND BOLTS OF ADVISORY COMMITTEES Development of Work-Based Learning Programs Unit 6-- Developing and Maintaining Community and Business Partnerships.
BY: G.P. MBUGUA AG. DEPUTY REGISTRAR, R&T. MEETINGS Definition: A meeting may be defined as the coming together of at least two persons for any lawful.
FOI Retreat: July 2008 USING THE NEXT 6 MONTHS EFFECTIVELY Format: USING THE NEXT 6 MONTHS EFFECTIVELY Format: Likely types of FOI requests – the need.
Basic Encoded Archival Description METRO New York Library Council Workshop Presented by Lara Nicosia December 9, 2011 New York, NY.
15 The Research Report.
Memos and Letters 2/18/2008.  Most routine business writing falls into three categories: memos, letters and . Each type of document has its own.
©2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Powers and Roles of the President 8 Slides after this.
REPORTS.
1. 1.To examine the information included in business reports. 2.To understand how to organize documents in order to ensure clear communication. 3.To analyze.
DAR2012 Professional English Meetings. DAR2012 Introduction A meeting is a planned assembly, or An arranged gathering of people for a certain purpose.
Discourse and Pragmatics Speech Acts Lecture 4: Paltridge, pp
Cover Letters Your first impression – make it good!
Correspondence Media for Engineers Which to use? Phone Memo Business letter Tweet Text message Instant Message.
PANTHER DISTRICT VENTURING ROUNDTABLE CREW YOUTH POSITIONS AND ELECTION OF OFFICERS VENTURING BSA.
HU113: Technical Report Writing
Preparation of the Self-Study and Documentation
7th Annual Hong Kong Innovative Users Group Meeting
Grammar-based Specification and Parsing for Binary File Formats
The Federal programs department September 26, 2017
Advanced Technical Writing
Welcome to our presentation
1915(c) WAIVER REDESIGN 2019 Brain Injury Summit
Direct and indirect speech acts
1.03 Write internal and external business correspondence to convey and obtain information effectively.
Presentation transcript:

Natural Language Processing Applied to Archival Description of Textual E-records William Underwood Georgia Tech Research Institute Atlanta, Georgia WVU/NETL/ERA Workshop on Digital Preservation of Complex Engineering Data WVU NRCCE, Morgantown, West Virginia April 20-21, 2009 We are grateful for the support of this research by the Army Research Laboratory and ERA program of NARA.

Overview Archival Description Method for extracting metadata from textual e-records Use of the metadata in archival description Next Steps Overview

Archival Description Archival Description includes: The titling of records that do not have titles The summary of the content of records, folders of records and series of records. When time allows, the creation of other finding aids such as subject indexes to record series. Archival Description

Archival Description: Research Motivation Archivists cannot describe a series until the record series has been manually read and reviewed. With increasing volumes of e-records, it may be decades, even centuries, before new acquisitions are described. In responding to FOIA requests, Archivists need to be able to search collections of e-records with high precision and recall. However, at the time of responding to FOIA requests, archivists have not read all of the records, so cannot index the records and search on document types, dates of records, author’s and addressee’s names and the topics of records. The results set of a query is a list of file names, not record titles and summaries of content Archival Description: Research Motivation

Archival Description: Item Scope and Content Note Descriptions of records include names of author(s) and addressees, topics, actions and sometimes dates. Example of an item (record) description from NARA’s Archival Research Catalog (ARC) This letter was typewritten by President George H. W. Bush and addressed to his children: George, Jeb, Neil, Marvin, and Doro. He expresses his happiness at their Christmas celebration held at Camp David, then writes concerning his conflicted feelings as he prepares for the possibility of war with Iraq. Archival description of textual records is necessary to support access to records and understanding of the records. Description is of individual items, file units and record series and of the context of the records. If one reviews guidelines for archival description such as ISAD-G or NARA’s Life Cycle Data Requirements, one will find that the actions of the record should be indicated in the description. In this example, “expresses his happiness and “expresses his conflicted feelings” are the actions. Archival Description: Item Scope and Content Note

A Method for Extracting Metadata for Archival Description Input: Textual Document Information Extraction Document Type Recognition Speech Act Transducer Discourse Analysis for Topic Recognition Output: [document(e1), author(e1, S), addressee(e1, H), act(e1 F(P)), topic(e1, T), date(e1, D)] 90% performance in identifying person’s names, organization and location names, and dates. Automatic identification of documentary forms such as memos correspondence, agency, minutes of meetings, press releases Metadata extraction such as date, author(s), addressee(s) and topics. A Method for Extracting Metadata for Archival Description

Information Extraction: Method Information extraction (semantic tagging) is a technology used to identify and annotate semantic categories in text (e.g. names of persons, organizations and locations, job titles, dates). Document Reader English Tokenizer Wordlist Lookup + enhanced wordlists Sentence Splitter Hepple POS Tagger + lexicon Semantic Tagger + Named Entity Rules Information Extraction: Method

Information Extraction: Wordlist Lookup Person_female_first.lst (8263) Person_female_first_ambig.lst (117) Person_male_first.lst (3704) Person_male_first_ambig.lst (1,117) Person_surname.lst (83,805) Person_surname_ambig.lst (6,802) Person_headofstate_90.lst (478) Location_city_US.lst (33,017) Location_city_us_ambig.lst (5,478) Location_foreign_city.lst (3802) Information Extraction: Wordlist Lookup

Java Annotation Pattern Engine (JAPE) Rules

Annotated Person Names and Job Titles

Information Extraction: Performance

Document Types Agenda Bar Chart Biography Briefing Memo Decision Memo Correspondence Diary Executive Order Information Memo Job Application List of Candidates for Federal Office Mailing List Memo Minutes of Meeting National Security Directive (NSD) Newsletter Nomination to Federal Office Notes Presidential Statement Press Pool Report Press Release Referral Memo Resume Schedule Signature Memo Situation Report Summary Transcript of Speech Telephone Call Recommendation Transcript of News Conference Document Types

Document Type Recognition Input: Annotated text from Information Extractor Intellectual Element Annotator + Intellectual Element Rules SUPPLE Parser/Interpreter + Document Type Grammars augmented with Semantics Extract Metadata Output: [document(e1), author(e1, S), addressee(e1, H), topic(e1, T), date(e1, D)] Document Type Recognition

Document Types: Intellectual Element Recognition The illustration at the left shows a document whose dates, times , person, location and organization names have been annotated by the first six steps of the method. The illustration on the right shows the same document after the recognition of the intellectual elements of the document./ Document Types: Intellectual Element Recognition

Document Types: Grammar for the Structure of a Memorandum

Document Types: Grammar for Memorndum with Semantic Rules

Parse Tree and Semantics of a Document

Extracted Metadata and Item Description Document_Type = memo Date = April 27, 1992 Author = SAM SKINNER Addressee = EDE HOLIDAY Topic = California Earthquake A memorandum dated April 27, 1992 from EDE Holiday to Sam Skinner regarding California Earthquake. Extracted Metadata and Item Description

Speech Act Transducer Annotation of Explicit Speech Acts Annotation of Implicit Speech Acts Annotation of Speech Acts Indicated by Text Structure Annotation of Indirect Speech Acts Annotation of the Primary Speech Acts Speech Act Transducer

Speech Acts I recommend that you attend the conference. recommend Performative verb - Verb whose action is accomplished merely by saying it or writing it. I recommend that you attend the conference. Illocutionary force of a message. recommend Propositional content of a message you attend the conference An explicit performative sentence is a sentence in which the illocutionary force is made explicit by naming the force. I promise to be there An implicit performative sentence is a sentence in which the illocutionary force is not made explicit by naming the force. I shall be there John Austin, a philosopher of language, observed that language is not only used to describe acts, but to perform acts. Verbs such as recommend, request, and promise whose action is performed by meerly saying them are termed performative verbs. Austin also distinguished the propositional content of a message from the illocutionary force of a message. For instance, (see slide) An explicit performative sentence is one in which the illocutionary force is made explicit by the use of a performative verb, for example, “I promise to be there.” An implicit performative sentence is a sentence in which an illocutionary force is not make explicit by a performative verb, for example, “ I shall be there.”. The illocutionary formce is still promise. Speech Acts

Declarative, imperative and interrogative sentences also express speech acts. Declarative (state) You completed the report. Imperative (request) Please, complete the report. Interrogative (ask) Did you complete the report? These are additional ways in which implicit speech acts are expressed. Speech Acts: Implicit

An indirect speech act is a speech act that is performed indirectly by way of performing another. Can you pass the salt? (ask) in the appropriate context means Please, pass the salt. (request) Textual structure can also indicate illocutionary force. Example: a section heading RECOMMENDATIONS can indicate the sentences in a section have the illocutionary force recommend. Speech Acts

Speech Acts in Presidential Records assert, deny, state, declare(1), tell(1), report, advise(1), remind, inform, certify(1), agree(1), acknowledge, praise(1), commit, pledge, direct, request, ask(1), ask(2), urge, encourage, invite, order(1), prohibit, suggest(2), propose, recommend, declare(2), resign, confirm, nominate, appoint, authorize, pray, terminate, veto, approve(1), disapprove, revoke, mourn, congratulate, thank, apologize, and welcome(2). concur, salute, amend, counsel, welcome(1), tender(2), call on, block, retire, proclaim, delegate, designate, determine, find, reject(2), endorse, appreciate, regret, trust(1) , believe, want, desire, and intend. 44 of the speech acts were previously identified by Vanderveken I identified 23 additional speech acts which I defined. Speech Acts in Presidential Records

Uses of Extracted Metadata in Automatic Description Signature Memorandum from Boyden Gray to the President recommending the nomination of Ronald B. Leighton to be a US District Judge. Letter from President Bush to President Mikhail Gorbachev suggesting an informal meeting. Memorandum from President Bush to Boyden Gray requesting an analysis of the War Powers Resolution. Letter from Susan Black to President Bush expressing appreciation for nomination and commitment to serve. Referral Memorandum from Sally Kelley to FEMA requesting appropriate action to a letter from Beryl Anthony to the President. Uses of Extracted Metadata in Automatic Description

Next Steps Inducing grammars for documentary form from samples Create rules for annotating implicit speech acts and speech acts indicated by textual structure. Evaluate performance of Speech act recognition method Recognition of the topics of sentences Discourse Analysis to identify primary topic(s) of records Generate item, folder and series descriptions and evaluate the method Implications for Digital Curriculum To the extent that Digital curators and digital curation curriculum addresses textual records, the curators and curriculum must also address description of textual records. Many descriptions include an indication of the acts carried out by the records and these acts are explicit or implicit or indirect speech acts. If we are successful in our research, digital currators will have tools to automatically create descriptions of large volumes of records, thereby alleviating digital currators of the work needed to manually describe those records, and allowing them to focus on other digital curration needs. Next Steps

Additional Information Website: perpos.gtri.gatech.edu W. Underwood and S. Isbell, Semantic Annotation of Presidential E-Records, Technical Report ITTL/CSITD 08-01, May 2008 W. Underwood and S. Laib. Automatic Recognition of Documentary Forms, Technical Report ITTL/CSITD 08-02, May 2008 W. Underwood. Recognizing Communication Acts in Presidential E-Records. Technical Report ITTL/CSITD 08-03, October 2008 Additional Information