Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Probabilistic Term Variant Generator for Biomedical Terms Yoshimasa Tsuruoka and Jun ichi Tsujii CREST, JST The University of Tokyo.

Similar presentations


Presentation on theme: "A Probabilistic Term Variant Generator for Biomedical Terms Yoshimasa Tsuruoka and Jun ichi Tsujii CREST, JST The University of Tokyo."— Presentation transcript:

1 A Probabilistic Term Variant Generator for Biomedical Terms Yoshimasa Tsuruoka and Jun ichi Tsujii CREST, JST The University of Tokyo

2 Outline Probabilistic Term Variant Generator Generation Algorithm Application: Dictionary expansion

3 Information extraction from biomedical documents Recognizing technical terms (e.g. DNA, protein names) We measured glucocorticoid receptors ( GR ) in mononuclear leukocytes ( MNL ) isolated … Background

4 Technical Term Recognition Machine learning based Identifying the regions of terms No ID information Dictionary-based Comparing the strings with each entry in the dictionary ID information

5 Problems of Dictionary-based approaches Spelling variation degrades recall Approximate string searching False positives degrade precision Filtering by machine learning

6 Exact String Searching Example Text Phorbol myristate acetate induced Egr-1 mRNA … Dictionary EGP EGR-1 EGR-1 binding protein : Any of them does not match

7 Edit Distance Defines the distance of two strings by the sequence of three kinds of operations. Substitution Insertion Deletion Ex.) board abord Cost = 2 (delete `a and add `a )

8 Automatic Generation of Spelling Variants Variant Generator NF-Kappa B(1.0) NF Kappa B (0.9) NF kappa B(0.6) NF kappaB(0.5) NFkappaB(0.3) : Generator NF-Kappa B Each generated variant is associated with its generation probability

9 Generation Algorithm T cell (1.0) T-cell (0.5)T cells (0.2) T-cells (0.1) 0.5 0.2 Recursive generation P = P x P op

10 Collecting Examples of Spelling Variation Abbreviation Extraction Schwartz 2003 Extracts short and long form pairs Short formLong form AAAlcoholic Anonymous American Americans Arachidonic acid arachidonic acid anaemia anemia :

11 Learning Operation Rules Operations for generating variants Substitution Deletion Insertion Context Character-level context: preceding (following) two characters Operation Probability

12 Probabilistic Rules Probability Left- context Target Right- context Operation 0.96* End of String Delete 0.96 Start of String ImReplace I with i 0.95*HydReplace H with h ::::: 0.75ph End of String Insert y :::::

13 Example (1) Generation Probability Generated VariantsFrequency 1.0 (Input)NF-kappa B857 0.417NF-kappaB692 0.417nF-kappa B0 0.337Nf-kappa B0 0.275NF kappa B25 0.226NF-kappa b0 :::

14 Example (2) Generation Probability Generated VariantsFrequency 1.0 (input)antiinflammatory effect7 0.462anti-inflammatory effect33 0.393antiinflammatory effects6 0.356Antiinflammatory effect0 0.286antiinflammatory-effect0 0.181anti-inflammatory effects23 :::

15 Example (3) Generation Probabilitiy Generated VariantsFrequency 1.0 (Input)tumour necrosis factor alpha15 0.492tumor necrosis factor alpha126 0.356tumour necrosis factor-alpha30 0.235Tumour necrosis factor alpha2 0.175tumor necrosis factor alpha182 0.115Tumor necrosis factor alpha8 :::

16 Application: Dictionary Expansion Expanding each entry in the dictionary Threshold of Generation Probability: 0.1 Max number of variants for each entry: 20

17 Protein Name Recognition Information Extraction Longest match GENIA corpus

18 Results of Dictionary Expansion a

19 Conclusion Probabilistic Variant Generator Learning from actual examples Dictionary expansion by the generator improves recall without the loss of precision.


Download ppt "A Probabilistic Term Variant Generator for Biomedical Terms Yoshimasa Tsuruoka and Jun ichi Tsujii CREST, JST The University of Tokyo."

Similar presentations


Ads by Google