Information Extraction from Biomedical Text Jerry R. Hobbs Artificial Intelligence Center SRI International
Introduction Information Extraction: Extract entities, relations, events Capture structured information Domain specific Focus only relevant parts Mainly on economic and military interest? Biomedical domain
Cascaded Finite-State Transducers Separate Processing into several stages FASTUS (Finite-State Automaton Text Understanding System) Earlier Stages: Smaller linguistic objects Domain independent Later Stages: Domain dependent patterns
Cascaded Finite-State Transducers Complex Words Basic Phrases Complex phrases Domain Patterns Merging Structures
Example gamma-Glutamyl kinase, the 1 st enzyme of the proline biosynthetic pathway, was puried to a homogeneity from an Escherichia coli strain resistant to the proline analog 3,4-dehydroproline. The enzyme had a native molecular weight of 236,000 and was apparently comprised of six identical 40,000-dalton subunits.
Target Database Reaction Object: Attributes ID Pathway Enzyme .. Enzyme Object Attribute ID Name Molecular-Weight Subunit-Component Subunit-Number
Complex Words gamma-Glutamyl kinase, the 1 st enzyme of the proline biosynthetic pathway, was purified to a homogeneity from an Escherichia coli strain resistant to the proline analog 3,4- dehydroproline. The enzyme had a native molecular weight of 236,000 and was apparently comprised of six identical 40,000-dalton subunits. gamma-Glutamyl kinase, the 1 st enzyme of the proline biosynthetic pathway, was purified to a homogeneity from an Escherichia coli strain resistant to the proline analog 3,4- dehydroproline. The enzyme had a native molecular weight of 236,000 and was apparently comprised of six identical 40,000-dalton subunits. Recognizes multiword fixed phrases proper names Rich in the biological domain Use lexicon or ML and Statistic methods
Basic Phrases Segment a sentence into noun groups, verb groups, and particles Use Sager 1981 grammar
Complex Phrases Appositives with their Head none groups “of” prepositional phrases to Their head noun groups
Complex Phrases Structures of basic and complex phrases, entities and events
Clause-Level Domain Patterns The enzyme had a native molecular weight of 236,000 and was apparently comprised of six identical 40,000- dalton subunits.
Clause-Level Domain Patterns The enzyme had a native molecular weight of 236,000 and was apparently comprised of six identical 40,000- dalton subunits.
Merging Structures First 4 levels: processes within single sentence This level: collect and combine information for on entity or relationship Three Criteria: The internal structure of noun groups The nearness along some metric Consistency and compatibility of the 2 structures
Compile – Time Transformations Subject-Verb-Object pattern linguistic patterns (passive, relative clauses, etc)
Types of Specialized Domains “noun-driven” approach The type of an entity is highly predictive of its role in event Loose S-V-O patterns “verb-driven” approach The role of the entities in events cannot be predicted from their type Tight S-V-O patterns
Limitation of IE Technology MUC (1990): Name recognition: ~95% recall and precision Event recognition: ~60% recall and precision Possible reasons: Process of merging Only works with explicit information Common cases are covered, how about those rare cases?