Using String Similarity Metrics for Terminology Recognition Jonathan Butters March 2008 LREC 2008 – Marrakech, Morocco.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Pattern Matching against Distributed Datasets within DAME Andy Pasley University of York.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Running a model's adjoint to obtain derivatives, while more efficient and accurate than other methods, such as the finite difference method, is a computationally.
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Clustering Categorical Data The Case of Quran Verses
Large-Scale Entity-Based Online Social Network Profile Linkage.
Process Control Charts An Overview. What is Statistical Process Control? Statistical Process Control (SPC) uses statistical tools to observe the performance.
Proposed concepts illustrated well on sets of face images extracted from video: Face texture and surface are smooth, constraining them to a manifold Recognition.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
Maurice Hermans.  Ontologies  Ontology Mapping  Research Question  String Similarities  Winkler Extension  Proposed Extension  Evaluation  Results.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Aki Hecht Seminar in Databases (236826) January 2009
Automatic Discovery of Technology Trends from Patent Text Youngho Kim, Yingshi Tian, Yoonjae Jeong, Ryu Jihee, Sung-Hyon Myaeng School of Engineering Information.
Statistical Analysis of the Social Network and Discussion Threads in Slashdot Vicenç Gómez, Andreas Kaltenbrunner, Vicente López Defended by: Alok Rakkhit.
Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.
Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web Mathew Michelson and Craig A. Knoblock.
 Image Search Engine Results now  Focus on GIS image registration  The Technique and its advantages  Internal working  Sample Results  Applicable.
Patent Search QUERY Log Analysis Shariq Bashir Department of Software Technology and Interactive Systems Vienna.
FACE RECOGNITION, EXPERIMENTS WITH RANDOM PROJECTION
CHAPTER 6 Statistical Analysis of Experimental Data
Comparison and Combination of Ear and Face Images in Appearance-Based Biometrics IEEE Trans on PAMI, VOL. 25, NO.9, 2003 Kyong Chang, Kevin W. Bowyer,
3-1 Introduction Experiment Random Random experiment.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Mining the Medical Literature Chirag Bhatt October 14 th, 2004.
University of Florida Mechanical and Aerospace Engineering 1 Useful Tips for Presenting Data and Measurement Uncertainty Analysis Ben Smarslok.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Paper by Craig Stuart Sapp 2007 & 2008 Presented by Salehe Erfanian Ebadi QMUL ELE021/ELED021/ELEM021 5 March 2012.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
1 Pengjie Ren, Zhumin Chen and Jun Ma Information Retrieval Lab. Shandong University 报告人:任鹏杰 2013 年 11 月 18 日 Understanding Temporal Intent of User Query.
Presented by Tienwei Tsai July, 2005
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG, CHUNG.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
3.2 Least Squares Regression Line. Regression Line Describes how a response variable changes as an explanatory variable changes Formula sheet: Calculator.
© 2005 McGraw-Hill Ryerson Ltd. 5-1 Statistics A First Course Donald H. Sanders Robert K. Smidt Aminmohamed Adatia Glenn A. Larson.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 4 Describing Numerical Data.
Alexey Kolosoff, Michael Bogatyrev 1 Tula State University Faculty of Cybernetics Laboratory of Information Systems.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
Identifying Patterns in Time Series Data Daniel Lewis 04/06/06.
Sample Size Determination in Studies Where Health State Utility Assessments Are Compared Across Groups & Time Barbara H Hanusa 1,2 Christopher R H Hanusa.
Chapter 2: Getting to Know Your Data
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
Operations Fall 2015 Bruce Duggan Providence University College.
Trust Me, I’m Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster Shengliang Dai.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Copyright © 2011 Pearson Education, Inc. Describing Numerical Data Chapter 4.
Cluster validation Integration ICES Bioinformatics.
Face Image-Based Gender Recognition Using Complex-Valued Neural Network Instructor :Dr. Dong-Chul Kim Indrani Gorripati.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
Distance functions and IE - 3 William W. Cohen CALD.
Queensland University of Technology
CHAPTER 1 Exploring Data
Showcasing the use of Factor Analysis in data reduction: Research on learner support for In-service teachers Richard Ouma University of York SPSS Users.
Issues in Decision-Tree Learning Avoiding overfitting through pruning
A weight-incorporated similarity-based clustering ensemble method based on swarm intelligence Yue Ming NJIT#:
Lecture 2 – Monte Carlo method in finance
NoDupe algorithm to detect and group similar mass spectra.
Timescales of Inference in Visual Adaptation
Statistics Definitions
Data Pre-processing Lecture Notes for Chapter 2
Draw Scatter Plots and Best-Fitting Lines
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

Using String Similarity Metrics for Terminology Recognition Jonathan Butters March 2008 LREC 2008 – Marrakech, Morocco

Introduction - Terms Objects are discrete, the terms people use to describe objects are usually not! Different groups of people tend to use different terms to refer to identical objects – Sublanguage (Harris, 1968) Terms can differ due to: orthographical differences, abbreviations, acronyms and synonyms Football Foot-ball Soccerball Footy Icosahedron DesignPersonnel MaintenanceOther Countries!

Introduction – Relating Terms There are many applications where the ability to relate the different terms would be useful String similarity metrics can be used to relate terms String similarity metrics inherently take into consideration aspects such as: Word Order Acronyms Abbreviations Predictive text suggestionsMatching component concepts Reducing lists of options

An Example Application We have a list of locations Some are similar Most are dissimilar (irrelevant) How do we choose the most similar? Top 10? Top 100? Top 1000? Top 10%?

Introduction – Selecting Terms Background In Aerospace Engineering Specialising in Avionics Electronic noise is a problem But can be filtered! Can dissimilar string matches be identified as noise? Can this noise be removed?... Automatically

String Similarity Metrics

Introduction – Similarity Metrics String metrics automatically calculate how similar (or dissimilar) two strings are: Two strings are identical if they have the same characters in the same order Each similarity measure assigns a numeric value based upon the relative similarity between the two strings Vector based Cost based

Metric Selection - Examples Query String = “language resources and evaluation conference 2008” String A = “language resources and evaluation conference 2009” String B = “lrec 2008” Metric NameString A scoreString B score Levenshtein Monge Elkan Jaro Winkler Euclidean Distance Jaccard Similarity

Metric Selection - SimMetrics SimMetrics – Java library of 23 string similarity metrics Developed at the University of Sheffield (Chapman, 2004) Outputs a normalised similarity score! Metric NameString A scoreString B score Levenshtein Monge Elkan Jaro Winkler Euclidean Distance Jaccard Similarity

Metric Selection

Metric Selection - Investigation Investigation focused on Aerospace domain terms Reduce list of components presented to user 298 automatically extracted sublanguage engine component terms 513 official component terms The similarity of each combination of 298 terms was calculated C 2 = comparisons Carried out for each of the 23 metrics in SimMetrics

Metric Selection - Investigation For each metric - each string pair (and score) was ordered by decreasing similarity Few string pairs scored high results - wide similarity band Vast majority scored low scores Bands of similarity score were made, the number of strings that scored within those bands were totalled Distribution graphs were Gaussian or Dirac Depending on the scoring mechanism of the similarity metric

Metric Selection - Results Dirac distributions Gaussian distributions

Metric Selection - Levenshtein Because: Jaro-Winkler gave consistently relatively high scores to unrelated strings Levenshtein grouped dissimilar strings further towards the lower end of the scale - More similar strings over a wider range

Metric Selection - Example “Air Oil Heat Exchanger” & “Air/Oil Heat Exchanger” “Starter Valve” & “Tail Bearing Oil Feed Tube ASSY.”

Noise Detection & Removal The peak is formed by the strings that are dissimilar If two random strings are compared, they will have a random similarity score As there are many randomly similar string pairs their scores form a Gaussian noise pattern... Approximately 100% of a randomly distributed variable falls below approximately four standard deviations above the mean

Noise Detection & Removal Strings that scored outside the randomly distributed scores were... by definition, not randomly distributed! Strings that were not randomly distributed tended to include terms that were relevant to one another!... The noise peak can be located and isolated by disregarding all similarities below four standard deviations above the mean:

Noise Detection & Removal A standard Gaussian (normal) distribution

Shorter Terms Although the dataset used contained mostly long strings, noise removal method remains effective for shorter strings within the dataset Shorter terms constitute a small, random match of longer and more typical strings longer strings are now randomly distributed! The mean similarity tends to be lower, and hence, the cut-off similarity automatically reduces, now similar shorter strings fall above the automatic cut off

Noise Detection & Removal Advantages of this automatic method: 1.Scales with source data size 2.Selecting top 10 may include or exclude relevant results! 3.Can be used to pick out strings that are more similar than, or stand out from the rest of the strings

Results The 298 extracted terms were compared against each of the 513 official terms. After noise was automatically removed, in some cases more than one relevant result suggested, in this case, the first n results were considered as follows: nRecall at nPrecision at n % %88.83% %89.04% %91.56% %92.08%

Example – List Reduction List of unique UK locations 1.Query checked against list 2.Noise removed QueryAutomatic Cut Off# of Results above cut off “Bradford” (0.745%) “Huddersfield” (1.005%) “Chipping Norton” (0.054%)

Conclusions Dissimilar string matches can be modelled as a noise pattern The noise pattern can be removed! Methodology is applicable to any set of strings Not only for Aerospace domain terms! Method is scalable Can be used to automatically remove obviously incorrect matches Provides users with fewer options – faster selection! Can be used to extract strings that are more similar than, or stand out from the rest

Future Work Integrate approach into many apps Form Filling Improved similarity metrics Domain specific datasets (Aerospace) Stop words, mutually exclusive words Combine metrics to break ties

Thank you

Refs Butters, Jonathan (2007) - A Terminology Recognizer for the Aerospace Domain. Masters’ Thesis, The University of Sheffield Harris, Z. (1968). Mathematical Structures of Language. John Wiley & Sons, New York. Sam Chapman – SimMetrics