Indexing Methods for Faster and More Effective Person Name Search Mark Arehart MITRE Corporation

Slides:



Advertisements
Similar presentations
Sanjay Agrawal Microsoft Research Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research DBXplorer: A System for Keyword Based Search over.
Advertisements

File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Aki Hecht Seminar in Databases (236826) January 2009
Chapter 8 File organization and Indices.
Sequence Alignment vs. Database Task: Given a query sequence and millions of database records, find the optimal alignment between the query and a record.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
1 Algorithms and Analysis CS 2308 Foundations of CS II.
Modern Information Retrieval Chapter 4 Query Languages.
Text Search and Fuzzy Matching
To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.
Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.
Searching Searching: –Mainly used for: Fetching / Retrieving the Information such as, –Select query on a database. –Important thing is: Retrieval of Information.
1 Pattern Matching Using n-grams With Algebraic Signatures Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas Schwarz[2] [1] Université Paris Dauphine.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
File Organizations and Indexing Lecture 4 R&G Chapter 8 "If you don't find it in the index, look very carefully through the entire catalogue." -- Sears,
Oracle PL/SQL Programming Steven Feuerstein All About the (Amazing) Function Result Cache of Oracle Database 11g.
 A databases is a collection of data organized to make it easy to search and easy to retrieve in a useful, usable form.
Mean Field Inference in Dependency Networks: An Empirical Study Daniel Lowd and Arash Shamaei University of Oregon.
Pattern Matching in DAME using AURA technology Jim Austin, Robert Davis, Bojian Liang, Andy Pasley University of York.
Discussion of the main data management or database building issues that may be involved in the early stages of designing a new multicentre, clinical trial.
Filter Algorithms for Approximate String Matching Stefan Burkhardt.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Lesson 2.  To help ensure accurate data, rules that check entries against specified values can be applied to a field. A validation rule is applied to.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
SEARCHING. Vocabulary List A collection of heterogeneous data (values can be different types) Dynamic in size Array A collection of homogenous data (values.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
DBMS Implementation Chapter 6.4 V3.0 Napier University Dr Gordon Russell.
1 Pattern Matching Using n-gram Sampling Of Cumulative Algebraic Signatures : Preliminary Results Witold Litwin[1], Riad Mokadem1, Philippe Rigaux1 & Thomas.
RecBench: Benchmarks for Evaluating Performance of Recommender System Architectures Justin Levandoski Michael D. Ekstrand Michael J. Ludwig Ahmed Eldawy.
Chapter 3 Computational Molecular Biology Michael Smith
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL.
Distance functions and IE – 4? William W. Cohen CALD.
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
INTERACTIVELY BROWSING LARGE IMAGE DATABASES Ronald Richter, Mathias Eitz and Marc Alexa.
Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
An Approximate Nearest Neighbor Retrieval Scheme for Computationally Intensive Distance Measures Pratyush Bhatt MS by Research(CVIT)
Performance of Compressed Inverted Indexes. Reasons for Compression  Compression reduces the size of the index  Compression can increase the performance.
CS4432: Database Systems II Query Processing- Part 2.
Performance Measurement. 2 Testing Environment.
Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the hood!) Bernard Marshall Chief Technical Officer KE Software.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
Adjudicator Agreement and System Rankings for Person Name Search Mark Arehart, Chris Wolf, Keith Miller The MITRE Corporation {marehart, cwolf,
CPSC 404, Laks V.S. Lakshmanan1 Overview of Query Evaluation Chapter 12 Ramakrishnan & Gehrke (Sections )
B+ Trees: An IO-Aware Index Structure Lecture 13.
CSCE Database Systems Chapter 15: Query Execution 1.
Doug Raiford Phage class: introduction to sequence databases.
Augmenting (personal) IR Readings Review Evaluation Papers returned & discussed Papers and Projects checkin time.
THRio Database Linkage and THRio Database Issues.
Query Optimization Cases. D. ChristozovINF 280 DB Systems Query Optimization: Cases 2 Executable Block 1 Algorithm using Indices (if available) Temporary.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
1 A Methodology for automatic retrieval of similarly shaped machinable components Mark Ascher - Dept of ECE.
CS4432: Database Systems II Query Processing- Part 1 1.
String Searching 2 of 2. String search Simple search –Slide the window by 1 t = t +1; KMP –Slide the window faster t = t + s – M[s] –Never recheck the.
Distance functions and IE - 3 William W. Cohen CALD.
Spatial Approximate String Search. Abstract This work deals with the approximate string search in large spatial databases. Specifically, we investigate.
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
CSC 380: Design and Analysis of Algorithms
Minwise Hashing and Efficient Search
LSH-based Motion Estimation
Presentation transcript:

Indexing Methods for Faster and More Effective Person Name Search Mark Arehart MITRE Corporation

Goals Not about NER per se. Assume NER is already done. Make output useful to users – Searchable with approximate matching – Not an offline process: fast response time Balance search effectiveness and speed. 2

Context: DARPA TIGR system 3

Person Names in TIGR Entered by soldiers in reports. Users lack linguistic expertise. Spelling/transliteration variation. Data entry errors. Generic text search provided by IR system does not compensate. Name index created by NER (Miller et al 10). 4

Approximate Name Matching Research community: – phonetic keys – n-gram matching – edit-based measures (with fixed, variable, or learned edit costs) – Frequency-based measures – String based and token-based – Refs: Winkler 90, Zobel and Dart95, Ristad and Yianilos 98, Bilenko and Mooney 03, Cohen et al 03, Christen 06. Commercial systems (expensive) 5

Performance Problem Fuzzy-matching is slow comps/sec sounds fast, right? Match query to every database name: query_time = size_db * avg_match_time 0.5 ms times db size of 100,000 = 50 seconds per query. Not fast. 6

Solution Part 1 Make comparison function faster. Say you more than double the speed through code optimization. 0.18ms * 100,000 records = 18 seconds. Much better, but… 7

Solution Part 2 Pass 1: blocking – developed in record linkage (Winkler 06 for overview) – quick (dumb) retrieval of candidates. Pass 2: matching – slow (smart) comparison function. Blocking function must: – Retrieve a small subset of the db. – Do so quickly. – Include all the true matches. 8

Two-Pass Matching Create text index of database names. Each name is indexed by one or more keys. At query time, generate keys for query name. Retrieve candidates using direct key lookup. Apply comparison function to candidates. 9

Ways to Make Keys Original name = Saddam Hussein Al Tikriti Exact  [SADDAM, HUSSEIN, (AL), TIKRITI] Substring  [SADD, HUSS, (AL), TIKR] Phonetic  [STM, HSN, (AL), TKRT] Better to not index particles like AL, ABU, BIN 10

Key-based Index STM  [Saddam Hussein Al Tikriti, Saddam Husein, …] HSM  [Saddam Hussein Al Tikriti, Hosein Mohamed, Ahmed Hassan, …] TKRT  [Saddam Hussein Al Tikriti, Uday Hussein Al Tikriti, …] 11

Retrieval Using Keys Generate keys from query name. – Refinement: don’t index particles (using stoplist). Return names associated with each key. – Refinement: for longer names, require more than one key match. Do fuzzy matching on the retrieved candidates. 12

Evaluation Existing datasets not appropriate. – String matching research: too small or not right kinds of variations (Pfeifer 95, Zobel and Dart 95, Cohen et al 03, Bilenko and Mooney 03) – Record linkage: multiple data fields (Winkler 06) Our test set (previously developed) of approx 700 queries run against 70,000 names. – Test data is noisy and multicultural. – Contains many kinds of Arabic name variants. Runs evaluated for accuracy and speed. 13

Matching Functions JaroWinkler: generic string matching baseline Level 2 JaroWinkler: tokenized Romarabic: custom algorithm (Freeman 06) – dictionary of common variants – name part similarity backs off to edit distance – aware of multi-segment name parts – finds optimal alignment 14

JaroWinkler IndexingStopwordsms per queryprf Nonen/a Substring no yes Custom phon no yes Exact no yes Metaphone no yes

Level 2 JaroWinkler IndexingStopwordsms per queryprf Nonen/a Substring no yes Custom phon no yes Exact no yes Metaphone no yes

Romarabic IndexingStopwordsms per queryprf Nonen/a13, Substring no yes Custom phon no yes Exact no yes Metaphone no yes

Conclusion For NER to be useful, system performance must be considered. – Most accurate matcher may be impractical Multiple pass algorithm – Speed/accuracy not a tradeoff here. Very simple methods are often the best. – custom phonetic key did worse than prefix Important to use large and realistic test set. 18