TAXAMATCH, a “fuzzy” matching algorithm for taxon names, and potential applications in taxonomic databases Tony Rees CSIRO Marine and Atmospheric Research,

Slides:



Advertisements
Similar presentations
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
BLAST Sequence alignment, E-value & Extreme value distribution.
Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.
TAXAMATCH: Overview for Nomina IV Workshop Tony Rees CSIRO Marine and Atmospheric Research, Australia May 2009.
Evaluating Search Engine
Guide to Oracle10G1 Introduction To Forms Builder Chapter 5.
Tony Rees and Glenelg Smith Divisional Data Centre + Remote Sensing Facility CSIRO Marine Research, Australia Application of c-squares.
Aki Hecht Seminar in Databases (236826) January 2009
Chapter 5 Normalization of Database Tables
CS107 Introduction to Computer Science Lecture 7, 8 An Introduction to Algorithms: Efficiency of algorithms.
Sequence alignment, E-value & Extreme value distribution
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.
MS Access: Database Concepts Instructor: Vicki Weidler.
Chapter Seven Advanced Shell Programming. 2 Lesson A Developing a Fully Featured Program.
Advanced Shell Programming. 2 Objectives Use techniques to ensure a script is employing the correct shell Set the default shell Configure Bash login and.
Planning for Divisions. Meeting Goals  Provide Baseline Overview of Divisions  Review Divisions Plan & Testing To Date.
Input for the Bayesian Phylogenetic Workflow All Input values could be loaded as text file or typing directly. Only for the multifasta file is advised.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Search Engines and Information Retrieval Chapter 1.
Topics covered: Memory subsystem CSE243: Introduction to Computer Architecture and Hardware/Software Interface.
Classification. An Example (from Pattern Classification by Duda & Hart & Stork – Second Edition, 2001)
OBIS Portal Architecture Concepts plus potential for utilization as a basis for Regional OBIS Nodes Tony Rees, CSIRO Marine Research, Hobart (and OBIS.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Query optimization in relational DBs Leveraging the mathematical formal underpinnings of the relational model.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Database Systems: Design, Implementation, and Management Tenth Edition
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Siebel 8.0 Module 5: EIM Processing Integrating Siebel Applications.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Views In some cases, it is not desirable for all users to see the entire logical model (that is, all the actual relations stored in the database.) In some.
A continuously updated All Genera Index: an achievable goal for Biodiversity Informatics? Tony Rees – CSIRO Marine and Atmospheric Research, Australia.
CSIRO Marine Research Data Centre linked databases - CAAB, MarLIN and Divisional Data Warehouse.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
“Principles of Soft Computing, 2 nd Edition” by S.N. Sivanandam & SN Deepa Copyright  2011 Wiley India Pvt. Ltd. All rights reserved. CHAPTER 12 FUZZY.
Access Chapter 1: Intro to Access Objectives Navigate among objects in Access database Difference between working in storage and memory Good database file.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
XP New Perspectives on Microsoft Access 2002 Tutorial 31 Microsoft Access 2002 Tutorial 3 – Querying a Database.
Finding a PersonBOS Finding a Person! Building an algorithm to search for existing people in a system Rahn Lieberman Manager Emdeon Corp (Emdeon.com)
Mr C Johnston ICT Teacher
8 Chapter Eight Server-side Scripts. 8 Chapter Objectives Create dynamic Web pages that retrieve and display database data using Active Server Pages Process.
CAAB and taxon management at CSIRO Marine Research Tony Rees Divisional Data Centre CSIRO Marine Research, Hobart
Karen Cannell APEX: Tight Tabular Forms Karen Cannell
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Module D: Hashing.
Document Clustering and Collection Selection Diego Puppin Web Mining,
WNSpell: A WordNet-Based Spell Corrector BILL HUANG PRINCETON UNIVERSITY Global WordNet Conference 2016Bucharest, Romania.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Record Linkage & Fuzzy Matching (More on "Blocking" for Performance Improvement) Joseph Vertido Melissa Data Fuzzy.
Web Programming Week 14 Old Dominion University Department of Computer Science CS 418/518 Fall 2006 Michael L. Nelson 11/27/06.
Concepts of Database Management, Fifth Edition Chapter 3: The Relational Model 2: SQL.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.
Harnessing the Deep Web : Present and Future -Tushar Mhaskar Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, Alon Halevy January 7,
Optimizing Parallel Algorithms for All Pairs Similarity Search
Physical Database Design and Performance
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Database Performance Tuning and Query Optimization
Lecture 12: Data Wrangling
Chapter 11 Database Performance Tuning and Query Optimization
Data Warehousing Concepts
ICT Database Lesson 2 Designing a Database.
Operating Systems: Internals and Design Principles, 6/E
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

TAXAMATCH, a “fuzzy” matching algorithm for taxon names, and potential applications in taxonomic databases Tony Rees CSIRO Marine and Atmospheric Research, Australia TDWG 2008 Annual Conference – Perth, October 2008

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names The problem A given taxon name can exist in multiple variants (legitimate and / or misspellings), for example… (from uBio site) : (etc., etc…)

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names The problem (other parts) Genus discrepancies… …need to consider potential errors in species epithet alone, genus alone, or both (and also authority similarity). Authority discrepancies… same?

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names Error types (simple classification for this study) - all real examples Type 1: single character error (in genus or species epithet alone): Type 1a: extra / missing / different character (except at word start) flaveolata / faveolata (extra character) antactica / antarctica (missing character) tricarinatus / tricarinatum (different character) Type 1b: transposed character (except at word start) Acropaginula / Arcopaginula abrohlensis / abrolhensis Type 1c: error at word start Meosarmatium / Neosarmatium janthina / ianthina Type 2: 2 character error (in genus or species epithet alone) (excl. 2-char transpositions) carchias / carcharias triangulatum / triangulum Type 3: multi character error (in genus or species epithet alone), plus 2-char transpositions capricornicus / capricornensis serrulatus / serratulus (2-char transposition) Type 4: error in both genus and species epithet Soleniscus stolonifera / Soleneiscus stolonifer Eogynodiastylus aganaktilos / Eogynodastylis aganaktikos (NB, each type potentially includes both phonetic + non-phonetic errors.)

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names Error types (simple classification for this study) - all real examples Type 1: single character error (in genus or species epithet alone): Type 1a: extra / missing / different character (except at word start) flaveolata / faveolata (extra character) antactica / antarctica (missing character) tricarinatus / tricarinatum (different character) Type 1b: transposed character (except at word start) Acropaginula / Arcopaginula abrohlensis / abrolhensis Type 1c: error at word start Meosarmatium / Neosarmatium janthina / ianthina Type 2: 2 character error (in genus or species epithet alone) (excl. 2-char transpositions) carchias / carcharias triangulatum / triangulum Type 3: multi character error (in genus or species epithet alone), plus 2-char transpositions capricornicus / capricornensis serrulatus / serratulus (2-char transposition) Type 4: error in both genus and species epithet Soleniscus stolonifera / Soleneiscus stolonifer Eogynodiastylus aganaktilos / Eogynodastylis aganaktikos (NB, each type potentially includes both phonetic + non-phonetic errors.) - Types 3, 4 are rarest (5% or less), but arguably as important to detect as the others (if not more so) - Phonetic errors are rapid to detect, but typically comprise only 40-50% of all errors, i.e. need edit distance type approach as well (slow!!)

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names The perfect algorithm… Maximum recall (find all “true” target near matches) and high precision (few false hits) Traps both phonetic and non-phonetic errors Executes in (e.g.) <2 sec. (average) per input name in real-world use (e.g. web interface against 1.4m target names), faster for deduplication runs Available off-the-shelf methods inadequate in either recall, precision, or efficiency (e.g. Edit Distance tests typically slow if all names tested, large nos. of false hits as threshold widened to catch “all” hits) Result of this work: hybrid approach developed over , termed “TAXAMATCH” – based on 2 custom comparison methods: “Rees near match 2007” phonetic algorithm, and “Modified Damerau-Levenshtein Distance” [MDLD] test (Boehmer & Rees in press, 2008) …plus rule-based filtering, in a cascading model (i.e. test genus portion first, then species as second / contingent step).

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names Key components used in this approach Pre-filtering (a.k.a. “blocking”) Avoid testing all names (e.g. test ~2% of genera, 0.02% of species) – to avoid long process times Testing Use of a custom edit distance-based test pulls in some of the more complex matches; phonetic algorithm traps others Post-filtering Use heuristic rules to improve precision (discriminate “true” from “false” matches of equal similarity) Result shaping (dynamic filter) Look for more distant hits only if no close ones detected (can disable if needed, for more complete result set, but with increase in false hits) Authority similarity measure Can be useful in distinguishing between homonyms, or near homonyms of same numeric similarity … plus initial pre-processing (parsing and normalization) – split into correct name elements, remove bad char’s and other qualifiers (cf., aff., etc.), + more.

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names TAXAMATCH block diagram (developer’s view) Normalized input genus Available genus names Available species Species near matches displayed Normalized input species (genus pre- filter) Genus names tested (genus post- filter) Genus near matches Species tested Species near matches (species pre- filter) (species post-filter) (ranking + result shaping) Available genus + species names (+ auth’s) Input genus + species (+ auth.) Normalized input authority Species authorities (auth. comparator) (genus test) (species test)

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names TAXAMATCH block diagram (user’s / deployer’s view) Normalized input genus Available genus names Available species Species near matches displayed Normalized input species (genus pre- filter) Genus names tested (genus post- filter) Genus near matches Species tested Species near matches (species pre- filter) (species post- filter) (ranking + result shaping) Available genus + species names (+ auth’s) Input genus + species (+ auth.) Normalized input authority Species authorities (auth. comparator) (genus test) (species test) Input name what you actually wanted magic stuff

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names …Testbed is the author’s “IRMNG” database, mainly for genera, but also holds 1.45m species names from a range of (generally) “reliable” sources Web access point (taxamatch-enabled) is at : Does it work?

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names Sample TAXAMATCH performance (via IRMNG web interface) Type 1a error (= 1-character mismatch) (NB, initial access time can be slow while data loads into memory, subsequent accesses are fast)

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names Sample TAXAMATCH performance (via IRMNG web interface) Type 1a error (= 1-character mismatch)

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names Sample TAXAMATCH performance (via IRMNG web interface) Type 2 error (= 2 character mismatch)

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names Sample TAXAMATCH performance (via IRMNG web interface) Type 2 error (= 2 character mismatch)

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names Sample TAXAMATCH performance (via IRMNG web interface) Type 3 error (= 3+ character mismatch)

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names Sample TAXAMATCH performance (via IRMNG web interface) Type 3 error (= 3+ character mismatch)

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names Sample TAXAMATCH performance (via IRMNG web interface) Type 4 error (= error in both genus and species)

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names Sample TAXAMATCH performance (via IRMNG web interface) Type 4 error (= error in both genus and species)

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names Indicative performance… Finds 99.7% of known errors in “normal” mode, 100% with result shaping disabled (where multiple near matches exist) False hits <20% of total, <5% with result shaping on (for genuine misspellings) (these figures are for binomens; values for genera alone are considerably higher as genus level results are only lightly filtered in the present configuration) cf… True phonetic algorithms: <40% of known errors detected Soundex (sloppy phonetic algorithm): more true hits found, but many more false ones too; performs worst with complex and/or non-phonetic errors Off-the-shelf Levenshtein Distance, n-gram tests: tradeoff between recall and precision (high recall -> low precision and vice versa) Google API: 50% of true hits at best, no concept of taxonomic names / dependencies, no control over reference database consulted (or term frequency therein)

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names Use as a “taxonomic spell checker”?? Need to deploy over an “authoritative, complete” reference database, ideally covering all groups / habitats / extant taxa + fossils Currently using IRMNG database (= Cat. of Life + more), could deploy over other DB’s as desired Potential to offer result as web service if suitable interchange format designed (Need to be aware, however, that there will always be taxa not in the reference database, unless this is locally or thematically complete).

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names Range of use cases… Misspelled user web input 548 ways to spell “Britney Spears” Query expansion for distributed queries (potential variants & misspellings in provider DB’s) – already a fact of life for GBIF, OBIS, etc. Review pre data aggregation / ingestion assign data held under misspelled names to desired “correct” home (avoid creating near-duplicate rows, e.g. with relevant content split / replicated) Review, deduplication of names post data aggregation a.k.a. “merge-purge” (common in other domains e.g. customer databases, business names + street addresses, etc.) Another parallel is “record linkage” in medical domain find all records of 1 patient through time (names, addresses, date of birth, social security numbers can be variously represented, some can change as well) …Deduplication example shown with IRMNG database (species table, 1.4m names)… (NB, extra clause in genus pre-filter reduces processing time from ~400 to ~100 hrs)

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names Real-world deduplication example

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names true false ? false Real-world deduplication example

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names Real-world deduplication example true false ? false NB, candidate name pairs do not always sort together (e.g. when a genus error is involved, or leading character error)

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names Summary Fuzzy matching for taxonomic databases needs to be able to cope satisfactorily with errors of a range of complexity Phonetic errors comprise only ~half of all errors encountered Cannot presume that initial letter is always correct, or that there will not be errors in both genus and species epithet Need to assess algorithm performance on recall (are all “true” near matches retrieved), precision (minimize false hits), and efficiency (time taken to test any one name), against multiple error types TAXAMATCH seems to be the best solution developed to date, although speed is a potential area for further improvement (e.g. ~100 hours (+) to deduplicate very large existing systems) Manual review of offered suggestions is still required (not all false hits are eliminated, although most are) Use as “spell checker” is promising option, contingent on availability of adequate reference database/s.

Tony Rees, CSIRO: TAXAMATCH and fuzzy matching applications for taxon names TAXAMATCH on test (versus 8 other algorithms) effectiveness = harmonic mean of recall and precision, on 0-1 scale

CSIRO Marine and Atmospheric Research Hobart, Tasmania, Australia Tony Rees Manager, Divisional Data Centre Phone: Web: Contact Us Phone: or Web: Thank you