eHS AI component roadmap: Step I: prototype with fuzzy matching

eHS AI component roadmap: Step I: prototype with fuzzy matching
Maria BIRYUKOV University of Luxembourg

Premises Fuzzy matching step is intended to help user in finding controlled vocabulary (CV) terms, corresponding to his/her terminology. Input: User-provided term (one or multiword expression) Output: |N| best corresponding terms, along with their unique identifiers, from standard vocabulary. |N| can be specified by the user.

Overall view and Timeline
AI mechanisms for standardizing vocabulary based on eMDR User Input [Unmapped Variable name or Value] User output [ranked list of candidate controlled terms] Fuzzy string matching Contextual disambiguation Semantic inference eMDR [Synonyms mapping dictionary] Timeline: 15:10:2015 15:12:2015 31:01:2016 Prototype with Fussy matching Contextual disambiguation Semantic inference

Achievements DONE Timeline 15:10:2015 Prototype with Fussy matching
15:12:2015 Contextual disambiguation Timeline 31:01:2016 Semantic inference

Resources Thematic dictionaries, ontologies, databases
In absence of eMDR, Entrez gene data base is used as a resource to test and demonstrate functionality of the prototype Locally created and regularly updated Question/Answers Data Base (QADB) which stores user queries along with the answers, selected as appropriate by users . Eventually stored in eMDR when eMDR is ready

Procedure User is prompted to :
1. Introduce his/her query 2. Specify method he/she would like to use in order to find corresponding standard terms 3. Specify the max number of candidates to return (matches to show) 4. Specify the similarity threshold Step (1) is obligatory. Steps (2 – 4) are optional. If not provided, default parameters are used. Steps 2 –> 4 = “regular search”

Procedure Once user has typed in the query (Q), QADB lookup is performed. If Q is in QADB: Answers to Q are displayed, from most to less popular User is prompted to select the answer which corresponds to his/her intention, if there is one If the Q is answered, user may either introduce new query or quit If the displayed answers do not satisfy the user, he/she may either proceed for the regular DB search (go to step 2, see previous slide) or quit. If Q is not in QADB, the procedure continues from step 2 ( see previous slide)

Local Resource Maintenance
All the queries are stored along with the information about how frequent they are in “All-Queries Data Base” (AQDB) Queries for which no answer was found in the resources can be worked off-line and serve for the resource enrichment.

Fuzzy String Matching Methods
Three methods for fuzzy string matching are implemented: ‘Gestalt pattern matching’ (1) Ngram-based cosine similarity (2) Word-based cosine similarity (3) (1-3) are appropriate for fuzzy string matching and often produce similar results. However: (1, 2) better handle spelling mistakes, (3) is more robust for word order changes or word omission. We will test the methods with real data and, depending on the results, keep, remove or add methods.

Example For illustration purpose let’s assume that user’s queries are some protein names which he/she would like to map to standard Entrez Gene names and identifiers. Query 1: ‘steroid hormone receptor’.

Search and Results The DB was already searched for that term earlier.
The candidate answers are proposed in the order of their “popularity”: the highest # of votes first “Other aliases/designations” are alternative spellings of the standard name “Organism” illustrates the ambiguity User may select the best option or ‘Nothing’ if no one suits his/her needs In this example, user opts for answer number 2 User’s choice is recorded and the QA database is updated.

QA Database Update User’s choice is accounted for immediately as suggested by the order of the candidate answers (compare with the previous slide) If user does not like any of the proposed answers, his/her query can be processed against the whole system database, i.e. ‘regular search’ Note, when user’s query is not found in the QADB, the procedure follows “regular search” path directly

Searching Process Regular search elements: query, string comparison method, max number of suggestions to display, similarity threshold User may select one, two or three methods String similarity threshold 0.00 = threshold value will be applied internally depending on the method.

Results Methods are applied one after another String similarity score ‘Show more’ option Answers are displayed from the highest to lowest string similarity score N top-ranked items are shown, N = ‘max matches to display’ (see previous slide) If ‘show more’ option is chosen, the answers are displayed by batches ‘Best answer’ can be selected from any/all batch (es) It is not contradictory, as the query accumulates ‘votes’

End of the session Continue with the same query
2nd method from the user’s method selection

Local resources after user session(s)
Fragment of the “All-Queries Data Base” (AQDB) Query Query frequency Queries are systematically stored and their overall counter is maintained. It allows for: Local resource enrichment Grouping of queries by projects

Local resources after user session(s)
Fragment of the “Question/Answers Data Base ” (QADB) Query Term ID : Votes QADB stores query and answers which have been selected by users as most suitable “Query” is a one or multi-word expression, “answer” is the standard term name and unique identifier Assuming (potential) ambiguity, many standard IDs may correspond to the same query Votes = how many times users have selected specific ID for the term. Votes are accumulated throughout all sessions. In the example above, “steroid hormone receptor” was mostly selected as “ESRRA” of human (5 times); and equally as “Esrra” of Norway rat, and “esrra” of zebrafish.

Next session: already seen query
If this term is searched again, the QADB suggestions will be displayed in following order:

Challenges Need real data and eMDR in order to test the implemented, fuzzy string matching, step. Adjust according to the results and with respect to the data Implement next two steps. Need real data and eMDR The command line demo will be provided later as an API, once the form of the API is agreed with other WPs.

Thanks To Reinhard Schneider, Wei Gu, Venkata for fruitful discussions and advise To Fabien Chris for valuable comments Thank you for your attention

eHS AI component roadmap: Step I: prototype with fuzzy matching

Similar presentations

Presentation on theme: "eHS AI component roadmap: Step I: prototype with fuzzy matching"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

eHS AI component roadmap: Step I: prototype with fuzzy matching

Similar presentations

Presentation on theme: "eHS AI component roadmap: Step I: prototype with fuzzy matching"— Presentation transcript:

Similar presentations

About project

Feedback