eHS AI component roadmap: Step I: prototype with fuzzy matching

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
1 NatQuery 3/05 An End-User Perspective On Using NatQuery To Extract Data From ADABAS Presented by Treehouse Software, Inc.
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Best-First Search: Agendas
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation MongoDB Write Lecturer.
The Project AH Computing. Functional Requirements  What the product must do!  Examples attractive welcome screen all options available as clickable.
Database testing Prepared by Saurabh sinha. Database testing mainly focus on: Data integrity test Data integrity test Stored procedures test Stored procedures.
© The McGraw-Hill Companies, 2006 Chapter 4 Implementing methods.
Chapter 2 - Algorithms and Design
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi,
Database Applications – Microsoft Access Lesson 4 Working with Queries 36 Slides in Presentation.
Product Training Stock Counting Where “Lean” principles are considered common sense and are implemented with a passion!
1 Chapter 4: Creating Simple Queries 4.1 Introduction to the Query Task 4.2 Selecting Columns and Filtering Rows 4.3 Creating New Columns with an Expression.
EPICS Release 3.15 Bob Dalesio May 19, Features for 3.15 Support for large arrays - done for rsrv in 3.14 Channel access priorities - planned to.
An Introduction to Forms. The Major Steps of a MicroSoft Access Database  Tables  Queries  Forms  Macros  Reports  Modules On our road map, we are.
EPICS Release 3.15 Bob Dalesio May 19, Features for 3.15 Support for large arrays Channel access priorities Portable server replacement of rsrv.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Finding Regular Simple Paths Sept. 2013Yangjun Chen ACS Finding Regular Simple Paths in Graph Databases Basic definitions Regular paths Regular simple.
1 Chapter 4 Unordered List. 2 Learning Objectives ● Describe the properties of an unordered list. ● Study sequential search and analyze its worst- case.
ETRIKS Harmonization System October Fabien Richard CNRS, EISBM 1.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
GroupMap Starter’s Guide Think Better Together Plan, brainstorm, discuss and prioritise for action. © GroupMap Pty Ltd |
Unit Testing CLUE PLAYERS.  How much design do we do before we begin to code?  Waterfall: Design it all! (slight exaggeration… but not much)  Agile:
Marr CollegeHigher Software DevelopmentSlide 1 Higher Computing Software Development Topic 4: Standard Algorithms.
V 0.1Slide 1 Security - User Account How to maintain user account ? Access Control Other Information Configuration  maintain user group and access rights.
Advanced Higher Computing Science
Hashing (part 2) CSE 2011 Winter March 2018.
Week 1, Day 3 Lazy Coding 29 June 2016.
eHS AI component roadmap: Step I: prototype with fuzzy matching
Creating Oracle Business Intelligence Interactive Dashboards
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Binary Search Trees One of the tree applications in Chapter 10 is binary search trees. In Chapter 10, binary search trees are used to implement bags.
Introduction to Programming for Mechanical Engineers (ME 319)
Clocks A clock is a free-running signal with a cycle time.
Ishan Sharma Abhishek Mittal Vivek Raj
Alternative group process at large units using the «world cafe» method
New in RSA BPMN: Data Configuration
Reports: Pivot Table ©2015 SchoolCity, Inc. All rights reserved.
MATLAB: Structures and File I/O
Learning to Program in Python
Lecture 12: Data Wrangling
Information Retrieval
Binary Search Trees One of the tree applications in Chapter 10 is binary search trees. In Chapter 10, binary search trees are used to implement bags.
Navya Thum February 13, 2013 Day 7: MICROSOFT EXCEL Navya Thum February 13, 2013.
Indexing and Hashing Basic Concepts Ordered Indices
Purchasing & Accounts Payable Tips and Tricks
This presentation has been prepared by Vault Intelligence Limited (“Vault") and is intended for off line demonstration, presentation and educational purposes.
This presentation document has been prepared by Vault Intelligence Limited (“Vault") and is intended for off line demonstration, presentation and educational.
Using the Online Compare Tool
Welcome to WebCRD.
Appendix D: Network Model
LINGUA INGLESE 2A – a.a. 2018/2019 Computer-Aided Translation Technology LESSON 3 prof. ssa Laura Liucci –
This presentation document has been prepared by Vault Intelligence Limited (“Vault") and is intended for off line demonstration, presentation and educational.
3Q08 Update Phase 4: Data Entry
Chapter 22, Part
Precise Condition Synthesis for Program Repair
CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT
 .
ECE 352 Digital System Fundamentals
Presentation transcript:

eHS AI component roadmap: Step I: prototype with fuzzy matching Maria BIRYUKOV University of Luxembourg

Premises Fuzzy matching step is intended to help user in finding controlled vocabulary (CV) terms, corresponding to his/her terminology. Input: User-provided term (one or multiword expression) Output: |N| best corresponding terms, along with their unique identifiers, from standard vocabulary. |N| can be specified by the user.

Overall view and Timeline AI mechanisms for standardizing vocabulary based on eMDR User Input [Unmapped Variable name or Value] User output [ranked list of candidate controlled terms] Fuzzy string matching Contextual disambiguation Semantic inference eMDR [Synonyms mapping dictionary] Timeline: 15:10:2015 15:12:2015 31:01:2016 Prototype with Fussy matching Contextual disambiguation Semantic inference

Achievements DONE Timeline 15:10:2015 Prototype with Fussy matching 15:12:2015 Contextual disambiguation Timeline 31:01:2016 Semantic inference

Resources Thematic dictionaries, ontologies, databases In absence of eMDR, Entrez gene data base is used as a resource to test and demonstrate functionality of the prototype Locally created and regularly updated Question/Answers Data Base (QADB) which stores user queries along with the answers, selected as appropriate by users . Eventually stored in eMDR when eMDR is ready

Procedure User is prompted to : 1. Introduce his/her query 2. Specify method he/she would like to use in order to find corresponding standard terms 3. Specify the max number of candidates to return (matches to show) 4. Specify the similarity threshold Step (1) is obligatory. Steps (2 – 4) are optional. If not provided, default parameters are used. Steps 2 –> 4 = “regular search”

Procedure Once user has typed in the query (Q), QADB lookup is performed. If Q is in QADB: Answers to Q are displayed, from most to less popular User is prompted to select the answer which corresponds to his/her intention, if there is one If the Q is answered, user may either introduce new query or quit If the displayed answers do not satisfy the user, he/she may either proceed for the regular DB search (go to step 2, see previous slide) or quit. If Q is not in QADB, the procedure continues from step 2 ( see previous slide)

Local Resource Maintenance All the queries are stored along with the information about how frequent they are in “All-Queries Data Base” (AQDB) Queries for which no answer was found in the resources can be worked off-line and serve for the resource enrichment.

Fuzzy String Matching Methods Three methods for fuzzy string matching are implemented: ‘Gestalt pattern matching’ (1) Ngram-based cosine similarity (2) Word-based cosine similarity (3) (1-3) are appropriate for fuzzy string matching and often produce similar results. However: (1, 2) better handle spelling mistakes, (3) is more robust for word order changes or word omission. We will test the methods with real data and, depending on the results, keep, remove or add methods.

Example For illustration purpose let’s assume that user’s queries are some protein names which he/she would like to map to standard Entrez Gene names and identifiers. Query 1: ‘steroid hormone receptor’.

Search and Results The DB was already searched for that term earlier. The candidate answers are proposed in the order of their “popularity”: the highest # of votes first “Other aliases/designations” are alternative spellings of the standard name “Organism” illustrates the ambiguity User may select the best option or ‘Nothing’ if no one suits his/her needs In this example, user opts for answer number 2 User’s choice is recorded and the QA database is updated.

QA Database Update User’s choice is accounted for immediately as suggested by the order of the candidate answers (compare with the previous slide) If user does not like any of the proposed answers, his/her query can be processed against the whole system database, i.e. ‘regular search’ Note, when user’s query is not found in the QADB, the procedure follows “regular search” path directly

Searching Process Regular search elements: query, string comparison method, max number of suggestions to display, similarity threshold User may select one, two or three methods String similarity threshold 0.00 = threshold value will be applied internally depending on the method.

Results Methods are applied one after another String similarity score ‘Show more’ option Answers are displayed from the highest to lowest string similarity score N top-ranked items are shown, N = ‘max matches to display’ (see previous slide) If ‘show more’ option is chosen, the answers are displayed by batches ‘Best answer’ can be selected from any/all batch (es) It is not contradictory, as the query accumulates ‘votes’

End of the session Continue with the same query 2nd method from the user’s method selection

Local resources after user session(s) Fragment of the “All-Queries Data Base” (AQDB) Query Query frequency Queries are systematically stored and their overall counter is maintained. It allows for: Local resource enrichment Grouping of queries by projects

Local resources after user session(s) Fragment of the “Question/Answers Data Base ” (QADB) Query Term ID : Votes QADB stores query and answers which have been selected by users as most suitable “Query” is a one or multi-word expression, “answer” is the standard term name and unique identifier Assuming (potential) ambiguity, many standard IDs may correspond to the same query Votes = how many times users have selected specific ID for the term. Votes are accumulated throughout all sessions. In the example above, “steroid hormone receptor” was mostly selected as “ESRRA” of human (5 times); and equally as “Esrra” of Norway rat, and “esrra” of zebrafish.

Next session: already seen query If this term is searched again, the QADB suggestions will be displayed in following order:

Challenges Need real data and eMDR in order to test the implemented, fuzzy string matching, step. Adjust according to the results and with respect to the data Implement next two steps. Need real data and eMDR The command line demo will be provided later as an API, once the form of the API is agreed with other WPs.

Thanks To Reinhard Schneider, Wei Gu, Venkata Satagopam @uni.lu for fruitful discussions and advise To Fabien Richard @eismb, Chris Marshall @biosciconsulting.com for valuable comments Thank you for your attention