Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Extraction from Biomedical Text Jerry R. Hobbs Artificial Intelligence Center SRI International.

Similar presentations


Presentation on theme: "Information Extraction from Biomedical Text Jerry R. Hobbs Artificial Intelligence Center SRI International."— Presentation transcript:

1 Information Extraction from Biomedical Text Jerry R. Hobbs Artificial Intelligence Center SRI International

2 Introduction Information Extraction:  Extract entities, relations, events  Capture structured information  Domain specific  Focus only relevant parts  Mainly on economic and military interest?  Biomedical domain

3 Cascaded Finite-State Transducers Separate Processing into several stages FASTUS (Finite-State Automaton Text Understanding System) Earlier Stages:  Smaller linguistic objects  Domain independent Later Stages:  Domain dependent patterns

4 Cascaded Finite-State Transducers Complex Words Basic Phrases Complex phrases Domain Patterns Merging Structures

5 Example gamma-Glutamyl kinase, the 1 st enzyme of the proline biosynthetic pathway, was puried to a homogeneity from an Escherichia coli strain resistant to the proline analog 3,4-dehydroproline. The enzyme had a native molecular weight of 236,000 and was apparently comprised of six identical 40,000-dalton subunits.

6 Target Database Reaction Object:  Attributes ID  Pathway  Enzyme .. Enzyme Object  Attribute ID  Name  Molecular-Weight  Subunit-Component  Subunit-Number

7 Complex Words gamma-Glutamyl kinase, the 1 st enzyme of the proline biosynthetic pathway, was purified to a homogeneity from an Escherichia coli strain resistant to the proline analog 3,4- dehydroproline. The enzyme had a native molecular weight of 236,000 and was apparently comprised of six identical 40,000-dalton subunits. gamma-Glutamyl kinase, the 1 st enzyme of the proline biosynthetic pathway, was purified to a homogeneity from an Escherichia coli strain resistant to the proline analog 3,4- dehydroproline. The enzyme had a native molecular weight of 236,000 and was apparently comprised of six identical 40,000-dalton subunits. Recognizes multiword fixed phrases proper names Rich in the biological domain Use lexicon or ML and Statistic methods

8 Basic Phrases Segment a sentence into noun groups, verb groups, and particles Use Sager 1981 grammar

9 Complex Phrases Appositives with their Head none groups “of” prepositional phrases to Their head noun groups

10 Complex Phrases Structures of basic and complex phrases, entities and events

11 Clause-Level Domain Patterns The enzyme had a native molecular weight of 236,000 and was apparently comprised of six identical 40,000- dalton subunits.

12 Clause-Level Domain Patterns The enzyme had a native molecular weight of 236,000 and was apparently comprised of six identical 40,000- dalton subunits.

13 Merging Structures First 4 levels: processes within single sentence This level: collect and combine information for on entity or relationship Three Criteria:  The internal structure of noun groups  The nearness along some metric  Consistency and compatibility of the 2 structures

14

15 Compile – Time Transformations Subject-Verb-Object pattern  linguistic patterns (passive, relative clauses, etc)

16 Types of Specialized Domains “noun-driven” approach  The type of an entity is highly predictive of its role in event  Loose S-V-O patterns “verb-driven” approach  The role of the entities in events cannot be predicted from their type  Tight S-V-O patterns

17 Limitation of IE Technology MUC (1990):  Name recognition: ~95% recall and precision  Event recognition: ~60% recall and precision Possible reasons:  Process of merging  Only works with explicit information  Common cases are covered, how about those rare cases?


Download ppt "Information Extraction from Biomedical Text Jerry R. Hobbs Artificial Intelligence Center SRI International."

Similar presentations


Ads by Google