Download presentation
Presentation is loading. Please wait.
Published byPercival Benson Modified over 9 years ago
1
Computational Linguistic Techniques Applied to Drugname Matching Bonnie J. Dorr, University of Maryland Greg Kondrak, University of Alberta June 26, 2003
2
Drugname Matching String matching to rank similarity between drug names Two classes of string matching –orthographic: Compare strings in terms of spelling without reference to sound –phonological: Compare strings on the basis of a phonetic representation Two methods of matching –distance: How far apart are two strings? –similarity: How close are two strings?
3
Distance and Similarity Measures: Orthographic/ Phonological Orthographic –Distance: string-edit Ex: contac / zantac = 2/6 = 0.33 –Similarity: LCSR, DICE Ex: contac / zantac = 4/6= 0.66 Ex: co on nt ta ac / za an nt ta ac = 6/12 = 0.50 Phonological –Distance: Soundex Ex: contac/zantac = 1/4 = 0.25 –Similarity: ALINE Ex: contac/zantac = 0.64
4
Distance vs. Similarity: Examples Example 1: hordes vs lords –Distance = 2 (replace h with l, and delete e ). –Similarity = 2 (bigrams or and rd in common). Example 2: water vs wine –Distance = 3 (replace a w/ i, t w/ n, delete r ). –Similarity = 0 (no bigrams in common). We can compare (global) similarity and distance: –sim(w 1,w 2 )/length –1 − dist(w 1,w 2 )/length
5
Orthographic Distance: string-edit Count up the number of steps it takes to transform one string into another Examples: Distance between hordes and lords is 2. Distance between water and wine is 3. For “global distance”, we can divide by length of longest string : 2/6 and 3/5 above
6
Orthographic Similarity: LCSR, DICE LCSR: Divide length of longest common sub- sequence by length of longest string –Example: reagir and repair have longest common subsequence reair.S imilarity score = 5/max(6,6)= 5/6 = 0.83 DICE: Double the number of shared character bigrams and divide by total number of bigrams in each string –Example: reagir and repair have bigram sets {re,ea,ag,gi,ir} and {re,ep,pa,ai,ir}, respectively, and shared bigrams are {re,ir}. Similarity score = (2 ∙ 2)/(5+5) = 2/5 = 0.40
7
Phonological Matching Distance-based phonological matching –Soundex Similarity-based phonological matching –ALINE
8
Phonological Distance Soundex Examples: –king and khyngge reduce to k52 –knight and night reduce to k523 and n23 –pulpit and phlebotomy reduce to p413 CodeCharacters 01234560123456 a e h i o u w y b f p v c g j k q s x z d t l m n r
9
What went wrong? Truncation of word to four characters –Alternative: Use entire string Ignoring vowels –Use more sophisticated phonetic rules Using numbers instead of decomposable features –Use decomposable features
10
Phonological Similarity Another possible approach: Compare syllable count, initial/final sounds, stress locations –Misses frequently confused pairs Alternative: Use phonological features to compare two words by their sounds. –x#→k(s): +consonantal, +velar, +stop, -voice –#x→z: +consonantal, +alveolar, +fricative, +voice Phonological similarity of two words: Optimal match between their phonological features. –Zantac –Xanax
11
Kondrak – ALINE (2000) Two fundamental components of ALINE: –Similarity Function: Uses linguistic feature analysis measurements based on salience, e.g., ±alveolar and ±stop more salient than ±voice –Method for choosing optimal alignment: creates alignment based on a weighted multi-feature analysis Designed to align phonetic sequences for many different CL applications –Developed originally for identifying cognates in vocabularies of related languages (e.g., colour, couleur) –Feature weights can be fine-tuned for specific application. Efficient: Dynamic programming algorithm: quadratic
12
ALINE Features: Weights and Values
13
Places of Articulation: Numerical Values
14
Manner of Articulation: Numerical Values stop1.0 Example: p, b affricate0.9 Example: th fricative0.8 Example: f, v
15
Tuning of ALINE Parameters Parameters have default settings for cognate matching task, but not appropriate for drugname matching Parameter tuning: –calculate weights for drugname matching –“Hill Climbing” search against gold standard Tuned parameters for drugname task –maximum score –insertion/deletion penalty –vowel penalty –phonological feature values
16
Comparison of Outputs ALINE:0.792 zantac xanax 0.639 zantac contac 0.486 xanax contac EDIT:0.500 zantac xanax 0.667 zantac contac 0.333 xanax contac LCSR:0.545 zantac xanax 0.667 zantac contac 0.364 xanax contac DICE:0.222 zantac xanax 0.600 zantac contac 0.000 xanax contac
17
Evaluation Precision and recall against online gold standard: USP Quality Review, Mar, 2001. 582 unique drug names, 399 true confusion pairs, 169,071 possible pairs (combinatorically induced) Example (using DICE): + 0.889 atgamratgam + 0.875 herceptinperceptin - 0.870 zolmitriptanzolomitriptan + 0.857 quinidinequinine - 0.857 cytosarcytosar-u + 0.842 amantadinerimantadine : : : : - 0.800 erythrocinerythromycin
18
Comparison of Precision at Different Recall Values
19
Precision of Techniques with Phonetic Transcription
20
Experimentation with different algorithms and their combinations against gold standard. ALINE: Strong foundation for search modules in automating the minimization of medication errors Fine-tuning based on comparisons with gold standard (e.g., re-weighting of phonological features). Related to pattern recognition: Discover patterns of predictable matches based on feature values Conclusion
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.