Computational Linguistic Techniques Applied to Drugname Matching Bonnie J. Dorr, University of Maryland Greg Kondrak, University of Alberta June 26, 2003.

Slides:

Advertisements

Similar presentations

1 Speech Sounds Introduction to Linguistics for Computational Linguists.

Advertisements

PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:

Longest Common Subsequence

Dynamic Programming Nithya Tarek. Dynamic Programming Dynamic programming solves problems by combining the solutions to sub problems. Paradigms: Divide.

Sequence Assembly for Single Molecule Methods Steven Skiena, Alexey Smirnov Department of Computer Science SUNY at Stony Brook {skiena,

Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Phonetic String Matching:Lessons from Information Retrieval.

Overview What is Dynamic Programming? A Sequence of 4 Steps

Phonological Intervention Options: Variations of Minimal Pair Contrasts Minimal Pairs Maximal Oppositions Empty Set Multiple Oppositions.

Automatic Identification of Cognates and False Friends in French and English Diana Inkpen and Oana Frunza University of Ottawa and Greg Kondrak University.

Clinical Phonetics.

A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.

Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.

Introduction to Bioinformatics Algorithms Block Alignment and the Four-Russians Speedup Presenter: Yung-Hsing Peng Date:

Data Quality Class 7. Agenda Record Linkage Data Cleansing.

Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.

Research on teaching and learning pronunciation

Minimum Error Rate Training in Statistical Machine Translation By: Franz Och, 2003 Presented By: Anna Tinnemore, 2006.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.

UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.

Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.

Phylogenetic Tree Construction and Related Problems Bioinformatics.

Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –

Cognates and Word Alignment in Bitexts Greg Kondrak University of Alberta.

L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.

Developing Pairwise Sequence Alignment Algorithms

Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.

Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

1 TEMPLATE MATCHING  The Goal: Given a set of reference patterns known as TEMPLATES, find to which one an unknown pattern matches best. That is, each.

An Algorithm to Align Words for Historical Comparison Michael A. Covington (The University of Georgia) Journal of Computational Linguistics 1996 February.

Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.

Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.

Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.

Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.

Ch 3 Slide 1 Is there a connection between phonemes and speakers’ perception of phonetic differences? (audibility of fine distinctions) Due to phonology,

Extracting bilingual terminologies from comparable corpora By: Ahmet Aker, Monica Paramita, Robert Gaizauskasl CS671: Natural Language Processing Prof.

Prof. Swarat Chaudhuri COMP 482: Design and Analysis of Algorithms Spring 2012 Lecture 16.

Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.

Iterative Translation Disambiguation for Cross Language Information Retrieval Christof Monz and Bonnie J. Dorr Institute for Advanced Computer Studies.

Melodic Similarity Presenter: Greg Eustace. Overview Defining melody Introduction to melodic similarity and its applications Choosing the level of representation.

Web Mining Issues Size Size –>350 million pages –Grows at about 1 million pages a day Diverse types of data Diverse types of data.

CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.

A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.

Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.

CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-19: Speech: Phonetics (Using Ananthakrishnan’s presentation.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.

Week 3 – Part 2 Phonology The following PowerPoint is to be used as a guideline for the important vocabulary and terminology to know as you do your readings,

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

How We Organize the Sounds of Speech 김종천 김완제 위이.

Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.

Mining Data Streams with Periodically changing Distributions Yingying Tao, Tamer Ozsu CIKM’09 Supervisor Dr Koh Speaker Nonhlanhla Shongwe April 26,

Conditional Random Fields for ASR

Midterm Review (closed book)

Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2

Sequence Alignment 11/24/2018.

Improved Word Alignments Using the Web as a Corpus

Scale-Space Representation for Matching of 3D Models

Bioinformatics Algorithms and Data Structures

Presentation transcript:

Computational Linguistic Techniques Applied to Drugname Matching Bonnie J. Dorr, University of Maryland Greg Kondrak, University of Alberta June 26, 2003

Drugname Matching String matching to rank similarity between drug names Two classes of string matching –orthographic: Compare strings in terms of spelling without reference to sound –phonological: Compare strings on the basis of a phonetic representation Two methods of matching –distance: How far apart are two strings? –similarity: How close are two strings?

Distance and Similarity Measures: Orthographic/ Phonological Orthographic –Distance: string-edit Ex: contac / zantac = 2/6 = 0.33 –Similarity: LCSR, DICE Ex: contac / zantac = 4/6= 0.66 Ex: co on nt ta ac / za an nt ta ac = 6/12 = 0.50 Phonological –Distance: Soundex Ex: contac/zantac = 1/4 = 0.25 –Similarity: ALINE Ex: contac/zantac = 0.64

Distance vs. Similarity: Examples Example 1: hordes vs lords –Distance = 2 (replace h with l, and delete e ). –Similarity = 2 (bigrams or and rd in common). Example 2: water vs wine –Distance = 3 (replace a w/ i, t w/ n, delete r ). –Similarity = 0 (no bigrams in common). We can compare (global) similarity and distance: –sim(w 1,w 2 )/length –1 − dist(w 1,w 2 )/length

Orthographic Distance: string-edit Count up the number of steps it takes to transform one string into another Examples: Distance between hordes and lords is 2. Distance between water and wine is 3. For “global distance”, we can divide by length of longest string : 2/6 and 3/5 above

Orthographic Similarity: LCSR, DICE LCSR: Divide length of longest common subsequence by length of longest string –Example: reagir and repair have longest common subsequence reair.S imilarity score = 5/max(6,6)= 5/6 = 0.83 DICE: Double the number of shared character bigrams and divide by total number of bigrams in each string –Example: reagir and repair have bigram sets {re,ea,ag,gi,ir} and {re,ep,pa,ai,ir}, respectively, and shared bigrams are {re,ir}. Similarity score = (2 ∙ 2)/(5+5) = 2/5 = 0.40

Phonological Matching Distance-based phonological matching –Soundex Similarity-based phonological matching –ALINE

Phonological Distance Soundex Examples: –king and khyngge reduce to k52 –knight and night reduce to k523 and n23 –pulpit and phlebotomy reduce to p413 CodeCharacters a e h i o u w y b f p v c g j k q s x z d t l m n r

What went wrong? Truncation of word to four characters –Alternative: Use entire string Ignoring vowels –Use more sophisticated phonetic rules Using numbers instead of decomposable features –Use decomposable features

Phonological Similarity Another possible approach: Compare syllable count, initial/final sounds, stress locations –Misses frequently confused pairs Alternative: Use phonological features to compare two words by their sounds. –x#→k(s): +consonantal, +velar, +stop, -voice –#x→z: +consonantal, +alveolar, +fricative, +voice Phonological similarity of two words: Optimal match between their phonological features. –Zantac –Xanax

Kondrak – ALINE (2000) Two fundamental components of ALINE: –Similarity Function: Uses linguistic feature analysis measurements based on salience, e.g., ±alveolar and ±stop more salient than ±voice –Method for choosing optimal alignment: creates alignment based on a weighted multi-feature analysis Designed to align phonetic sequences for many different CL applications –Developed originally for identifying cognates in vocabularies of related languages (e.g., colour, couleur) –Feature weights can be fine-tuned for specific application. Efficient: Dynamic programming algorithm: quadratic

ALINE Features: Weights and Values

Places of Articulation: Numerical Values

Manner of Articulation: Numerical Values stop1.0 Example: p, b affricate0.9 Example: th fricative0.8 Example: f, v

Tuning of ALINE Parameters Parameters have default settings for cognate matching task, but not appropriate for drugname matching Parameter tuning: –calculate weights for drugname matching –“Hill Climbing” search against gold standard Tuned parameters for drugname task –maximum score –insertion/deletion penalty –vowel penalty –phonological feature values

Comparison of Outputs ALINE:0.792 zantac xanax zantac contac xanax contac EDIT:0.500 zantac xanax zantac contac xanax contac LCSR:0.545 zantac xanax zantac contac xanax contac DICE:0.222 zantac xanax zantac contac xanax contac

Evaluation Precision and recall against online gold standard: USP Quality Review, Mar, unique drug names, 399 true confusion pairs, 169,071 possible pairs (combinatorically induced) Example (using DICE): atgamratgam herceptinperceptin zolmitriptanzolomitriptan quinidinequinine cytosarcytosar-u amantadinerimantadine : : : : erythrocinerythromycin

Comparison of Precision at Different Recall Values

Precision of Techniques with Phonetic Transcription

Experimentation with different algorithms and their combinations against gold standard. ALINE: Strong foundation for search modules in automating the minimization of medication errors Fine-tuning based on comparisons with gold standard (e.g., re-weighting of phonological features). Related to pattern recognition: Discover patterns of predictable matches based on feature values Conclusion