A Probabilistic Term Variant Generator for Biomedical Terms Yoshimasa Tsuruoka and Jun ichi Tsujii CREST, JST The University of Tokyo
Outline Probabilistic Term Variant Generator Generation Algorithm Application: Dictionary expansion
Information extraction from biomedical documents Recognizing technical terms (e.g. DNA, protein names) We measured glucocorticoid receptors ( GR ) in mononuclear leukocytes ( MNL ) isolated … Background
Technical Term Recognition Machine learning based Identifying the regions of terms No ID information Dictionary-based Comparing the strings with each entry in the dictionary ID information
Problems of Dictionary-based approaches Spelling variation degrades recall Approximate string searching False positives degrade precision Filtering by machine learning
Exact String Searching Example Text Phorbol myristate acetate induced Egr-1 mRNA … Dictionary EGP EGR-1 EGR-1 binding protein : Any of them does not match
Edit Distance Defines the distance of two strings by the sequence of three kinds of operations. Substitution Insertion Deletion Ex.) board abord Cost = 2 (delete `a and add `a )
Automatic Generation of Spelling Variants Variant Generator NF-Kappa B(1.0) NF Kappa B (0.9) NF kappa B(0.6) NF kappaB(0.5) NFkappaB(0.3) : Generator NF-Kappa B Each generated variant is associated with its generation probability
Generation Algorithm T cell (1.0) T-cell (0.5)T cells (0.2) T-cells (0.1) Recursive generation P = P x P op
Collecting Examples of Spelling Variation Abbreviation Extraction Schwartz 2003 Extracts short and long form pairs Short formLong form AAAlcoholic Anonymous American Americans Arachidonic acid arachidonic acid anaemia anemia :
Learning Operation Rules Operations for generating variants Substitution Deletion Insertion Context Character-level context: preceding (following) two characters Operation Probability
Probabilistic Rules Probability Left- context Target Right- context Operation 0.96* End of String Delete 0.96 Start of String ImReplace I with i 0.95*HydReplace H with h ::::: 0.75ph End of String Insert y :::::
Example (1) Generation Probability Generated VariantsFrequency 1.0 (Input)NF-kappa B NF-kappaB nF-kappa B Nf-kappa B NF kappa B NF-kappa b0 :::
Example (2) Generation Probability Generated VariantsFrequency 1.0 (input)antiinflammatory effect anti-inflammatory effect antiinflammatory effects Antiinflammatory effect antiinflammatory-effect anti-inflammatory effects23 :::
Example (3) Generation Probabilitiy Generated VariantsFrequency 1.0 (Input)tumour necrosis factor alpha tumor necrosis factor alpha tumour necrosis factor-alpha Tumour necrosis factor alpha tumor necrosis factor alpha Tumor necrosis factor alpha8 :::
Application: Dictionary Expansion Expanding each entry in the dictionary Threshold of Generation Probability: 0.1 Max number of variants for each entry: 20
Protein Name Recognition Information Extraction Longest match GENIA corpus
Results of Dictionary Expansion a
Conclusion Probabilistic Variant Generator Learning from actual examples Dictionary expansion by the generator improves recall without the loss of precision.