Download presentation
Presentation is loading. Please wait.
Published byLillian Steele Modified over 10 years ago
1
A Probabilistic Term Variant Generator for Biomedical Terms Yoshimasa Tsuruoka and Jun ichi Tsujii CREST, JST The University of Tokyo
2
Outline Probabilistic Term Variant Generator Generation Algorithm Application: Dictionary expansion
3
Information extraction from biomedical documents Recognizing technical terms (e.g. DNA, protein names) We measured glucocorticoid receptors ( GR ) in mononuclear leukocytes ( MNL ) isolated … Background
4
Technical Term Recognition Machine learning based Identifying the regions of terms No ID information Dictionary-based Comparing the strings with each entry in the dictionary ID information
5
Problems of Dictionary-based approaches Spelling variation degrades recall Approximate string searching False positives degrade precision Filtering by machine learning
6
Exact String Searching Example Text Phorbol myristate acetate induced Egr-1 mRNA … Dictionary EGP EGR-1 EGR-1 binding protein : Any of them does not match
7
Edit Distance Defines the distance of two strings by the sequence of three kinds of operations. Substitution Insertion Deletion Ex.) board abord Cost = 2 (delete `a and add `a )
8
Automatic Generation of Spelling Variants Variant Generator NF-Kappa B(1.0) NF Kappa B (0.9) NF kappa B(0.6) NF kappaB(0.5) NFkappaB(0.3) : Generator NF-Kappa B Each generated variant is associated with its generation probability
9
Generation Algorithm T cell (1.0) T-cell (0.5)T cells (0.2) T-cells (0.1) 0.5 0.2 Recursive generation P = P x P op
10
Collecting Examples of Spelling Variation Abbreviation Extraction Schwartz 2003 Extracts short and long form pairs Short formLong form AAAlcoholic Anonymous American Americans Arachidonic acid arachidonic acid anaemia anemia :
11
Learning Operation Rules Operations for generating variants Substitution Deletion Insertion Context Character-level context: preceding (following) two characters Operation Probability
12
Probabilistic Rules Probability Left- context Target Right- context Operation 0.96* End of String Delete 0.96 Start of String ImReplace I with i 0.95*HydReplace H with h ::::: 0.75ph End of String Insert y :::::
13
Example (1) Generation Probability Generated VariantsFrequency 1.0 (Input)NF-kappa B857 0.417NF-kappaB692 0.417nF-kappa B0 0.337Nf-kappa B0 0.275NF kappa B25 0.226NF-kappa b0 :::
14
Example (2) Generation Probability Generated VariantsFrequency 1.0 (input)antiinflammatory effect7 0.462anti-inflammatory effect33 0.393antiinflammatory effects6 0.356Antiinflammatory effect0 0.286antiinflammatory-effect0 0.181anti-inflammatory effects23 :::
15
Example (3) Generation Probabilitiy Generated VariantsFrequency 1.0 (Input)tumour necrosis factor alpha15 0.492tumor necrosis factor alpha126 0.356tumour necrosis factor-alpha30 0.235Tumour necrosis factor alpha2 0.175tumor necrosis factor alpha182 0.115Tumor necrosis factor alpha8 :::
16
Application: Dictionary Expansion Expanding each entry in the dictionary Threshold of Generation Probability: 0.1 Max number of variants for each entry: 20
17
Protein Name Recognition Information Extraction Longest match GENIA corpus
18
Results of Dictionary Expansion a
19
Conclusion Probabilistic Variant Generator Learning from actual examples Dictionary expansion by the generator improves recall without the loss of precision.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.