Information Extraction from Single and Multiple Sentences Mark Stevenson Department of Computer Science University of Sheffield, UK
Introduction Information Extraction often viewed as the process of identifying events described in text Generally accepted that an event may be described across more than one sentence “Pace American Group Inc. said it notified two top executives it intends to dismiss them because an internal investigation found evidence of “self-dealing” and “undisclosed financial relationships”. “The executives are Don H. Pace, cofounder, president and chief executive officer; and Greg S. Kaplan, senior vice president and chief financial officer.”
Sentence-limited Approaches Some approaches have treated each sentence in isolation and extracting only the events described within them –Zelenko et. al. (2003) - SVM –Soderland (1999) – rule generalisation –Chieu and Ng (2002) – maximum entropy –Yangarber et. al. (2000) – pattern learning This restriction often makes IE more practical for ML How can results be compared against systems which extract all events? How much can be achieved by just analysing within sentences?
Experiment Compare two alternative annotations of the same corpus –Complete annotation identifies all events described in a document –Within sentence annotation marks only events described within a sentence Corpus used are the MUC-6 evaluation texts –Documents describe management succession events Complete annotation produced as part of formal evaluation Within sentence annotation due to Soderland (1999).
Event Definition Two annotations of this corpus have different definitions of what constitutes an event Events in both annotations are transformed into a common representation scheme Contains information encoded by both schemes Allows comparison Provides method for defining what constitutes an event Each event is stored as a database entry consisting of four fields: type person_in or person_out person, post, organisation Minimal event description – at least two elements
MUC Annotation Annotations stored in complex nested template structure Core SUCCESSION_EVENT, refers to specific position Contains IN_AND_OUT events, movement of a single executive relative to that position Aliases list alternative ways of referring to event objects Representation does not directly link event objects to text so difficult to compute the proportion of events described with a sentence directly
:= SUCCESSION_ORG: POST: "chairman" IN_AND_OUT: := IO_PERSON: NEW_STATUS: IN OTHER_ORG: := ORG_NAME: "McCann-Erickson" ORG_ALIAS: "McCann" := PER_NAME: "John J. Dooner Jr." PER_ALIAS: "John Dooner“, "Dooner" type(person_in) post(chairman) org(‘McCann-Erickson’|`McCann’) person(`John J. Dooner Jr’| ‘John Dooner’| `Dooner’)
Within Sentence Annotation Soderland (1999) produced an alternative annotation of the same corpus Annotation is linked directly to source sentence so only events described within a single sentence are included Annotations use a flat structure inspired by case frames
“Daniel Glass was named president and chief executive officer of EMI Record Group” Succession {PersonIn DANIEL GLASS} {Post PRESIDENT AND CHIEF EXECUTIVE OFFICER} {Org EMI RECORD GROUP} event 1 type(person_in) event 2 type(person_in) person(`Daniel Glass’) org(‘EMI Records Group’) post(‘president’)post(‘chief executive officer’) Within Sentence Annotation: Example
Matching Allow two levels of matches between events in the two sets Full match –Events contain the same fields and each field shares at least one filler Partial match –Events share some fields and each of those fields share at least one filler Matching process compares each event in the within sentence events with each event in the MUC event set Allow only one-to-one mapping for full matches but many within sentence events can partially match onto a single MUC event
type(person_in) person(‘R. Wayne Diesel’|’Diesel’)person(‘R. Wayne Diesel’) org(‘Mechanical Technology Inc.’| ‘Mechanical Technology’) post(‘chief executive officer’) type(person_in) person(‘R. Wayne Diesel’|’Diesel’)person(‘R. Wayne Diesel’) org(‘Mechanical Technology Inc.’| ‘Mechanical Technology’) org(‘Mechanical Technology’) post(‘chief executive officer’) Fully matching events Partially matching events
Event Analysis All events Within sentence events Count Full40.6% (112) 45.2% (112) Partial39.1% (108) 47.6% (118) No match20.3% (56) 7.3% (18)
Mismatching Events Spurious events in limited annotation set – not matched to any event in MUC corpus 9 Events mentioned in limited annotation and text but not in MUC data (strict guidelines) 8 Event mentioned in limited annotation and text but not in MUC data 1 All events Within sentence events
Event Field Analysis Full match Partial match No matchTOTAL Type112/112100/1080/5676.8% Person112/112100/1080/5676.8% Org112/1126/1080/5343.2% Post111/11174/1080/5068.8% Total447/447280/4320/ %
Text Style Variation between event field can be explained by the structure of documents in this corpus Succession event often introduced at start of document. Generally complete –“Washington Post Co. said Katherine Graham stepped down as chairman and will be succeeded by her son, Donald E. Graham, the company’s chief executive.” Further events may not be described fully: –“Alan G. Spoon, 42, will succeed Mr. Graham as chief executive of the company”. –“Mr. Jones is succeeded by Mr. Green.” –“Mr. Smith assumed the role of CEO.”
Conclusion Analysis of a commonly used IE evaluation corpus showed that only 40.6% of events fully described within a single sentence. A larger proportion of the events are partially described but wide variation between the event fields; due to document style. These results should be borne in mind during the design/evaluation of IE systems. Additional implications for summarisation systems which select sentences and question answering systems.