Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.

Slides:



Advertisements
Similar presentations
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Advertisements

Assignment 2: Full text search with Lucene Mathias Mosolf, Alexander Frenzel.
Lucene Tutorial Based on Lucene in Action Michael McCandless, Erik Hatcher, Otis Gospodnetic.
Introduction to Information Retrieval Introduction to Information Retrieval Lucene Tutorial Chris Manning and Pandu Nayak.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Advanced Indexing Techniques with Apache Lucene - Payloads Advanced Indexing Techniques with Michael Busch
Advanced Indexing Techniques with
The Lucene Search Engine Kira Radinsky Modified by Amit Gross to Lucene 4 Based on the material from: Thomas Paul and Steven J. Owens.
Lucene in action Information Retrieval A.A – P. Ferragina, U. Scaiella – – Dipartimento di Informatica – Università di Pisa –
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
The Lucene Search Engine Kira Radinsky Based on the material from: Thomas Paul and Steven J. Owens.
Introduction to Databases CIS 5.2. Where would you find info about yourself stored in a computer? College Physician’s office Library Grocery Store Dentist’s.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
Introduction to Lucene Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Introduction to Information Retrieval Introduction to Information Retrieval Lucene Tutorial Chris Manning, Pandu Nayak, and Prabhakar Raghavan.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands.
Full-Text Search with Lucene Yonik Seeley 02 May 2007 Amsterdam, Netherlands slides:
NLM-Semantic Medline Data Science Data Publication Commons Dr. Brand Niemann Director and Senior Data Scientist/Data Journalist Semantic Community Data.
1 Introduction to Lucene Rong Jin. What is Lucene ?  Lucene is a high performance, scalable Information Retrieval (IR) library Free, open-source project.
Unified Medical Language System® (UMLS®) NLM Presentation Theater MLA 2007 National Library of Medicine National Institutes of Health U.S. Dept. of Health.
Apache Lucene in LexGrid. Lucene Overview High-performance, full-featured text search engine library. Written entirely in Java. An open source project.
Softvérové knižnice a systémy Vyhľadávanie informácií Michal Laclavík.
1 Lucene Jianguo Lu School of Computer Science University of Windsor.
Lucene Performance Grant Ingersoll November 16, 2007 Atlanta, GA.
Lucene Part2. Lucene Jarkarta Lucene ( is a high- performance, full-featured, java, open-source, text search engine.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
1 st June 2006 St. George’s University of LondonSlide 1 Using UMLS to map from a Library to a Clinical Classification: Improving the Functionality of a.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Revolutionizing enterprise web development Searching with Solr.
Lucene Part1 ‏. Lucene Use Case Store data in a 2 dimensional way How do we do this. Spreadsheet Relational Database X/Y.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Lucene-Demo Brian Nisonger. Intro No details about Implementation/Theory No details about Implementation/Theory See Treehouse Wiki- Lucene for additional.
1 CS 430 Database Theory Winter 2005 Lecture 2: General Concepts.
“ Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and.NET ”
Retrieval 1/2 BDK12-5 Information Retrieval William Hersh, MD Department of Medical Informatics & Clinical Epidemiology Oregon Health & Science University.
NCIEVS Metaphrase API Presented to the National Cancer Institute (NCI) Kim Ong 12/07/2001.
Iccha Sethi Serdar Aslan Team 1 Virginia Tech Information Storage and Retrieval CS 5604 Instructor: Dr. Edward Fox 10/11/2010.
IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,
Search Tools and Search Engines Searching for Information and common found internet file types.
LexGrid Philosophy, Model and Interfaces Harold R Solbrig Division of Biomedical Statistics and Informatics Mayo Clinic.
Design a full-text search engine for a website based on Lucene
MedKAT Medical Knowledge Analysis Tool December 2009.
Lesson 13 Databases Unit 2—Using the Computer. Computer Concepts BASICS - 22 Objectives Define the purpose and function of database software. Identify.
Lucene Jianguo Lu.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
Apache Solr Dima Ionut Daniel. Contents What is Apache Solr? Architecture Features Core Solr Concepts Configuration Conclusions Bibliography.
Analyzing Text with SQL Server 2014, R, AND Azure ML Dejan Sarka.
September 2003, 7 th EDG Conference, Heidelberg – Roberta Faggian, CERN/IT CERN – European Organization for Nuclear Research The GRACE Project GRid enabled.
Study on the Design for Consumer Health Knowledge Organization in China Institute of Medical Information Chinese Academy of Medical Sciences Jul. 10th,
Introduction to Information Retrieval Introduction to Information Retrieval ΜΕ003-ΠΛΕ70: Ανάκτηση Πληροφορίας Διδάσκουσα: Ευαγγελία Πιτουρά Εισαγωγή στο.
Lucene : Text Search IG5 – TILE Esther Pacitti. Basic Architecture.
ΠΛΕ70: Ανάκτηση Πληροφορίας
Lucene Tutorial Chris Manning and Pandu Nayak
UNIFIED MEDICAL LANGUAGE SYSTEMS (UMLS)
Jianguo Lu School of Computer Science University of Windsor
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Searching AND INDEXING Big data
Adam Koehler Index Speed Demons - How To Turbo-Charge Your Text Based Queries Using Full-Text Indexing.
CS276 Lucene Section.
Searching and Indexing
CS 430: Information Discovery
PRG 421 MART Knowledge is divine-- prg421mart.com.
Thanks to Bill Arms, Marti Hearst
Lucene in action Information Retrieval A.A
Getting Started With Solr
Introduction to Search Engines
Presentation transcript:

Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics

Outline Goals Unified Medical Language System (UMLS) Apache Lucene Get to work!

Goals Build a dictionary lookup module for NLP pipelines – Input: string (e.g. “diabetes”, “breast cancer”, “warfarin”) – Output: list of concepts (e.g. “C083562”) Application examples: – Unstructured clinical document coding – (Semi)automated literature indexing Pre-processing necessary for free text (not covered today): – Tokenization – Sentence detection – Part-of-speech tagging (e.g. to lookup only noun phrases)

UMLS Unified Medical Language System (NLM) – Millions of organized biomedical concepts – Over 150 sources (e.g. SNOMED-CT, LOINC, NCI, MESH) – Good source to index biomedical concept! – UMLS Terminology Services: Content – Concepts, synonymous names, relationships – Semantic network (high-level classification) Organism, anatomical structure, biologic function, chemical, … Distribution – Files with concept and relationship description data – Loadable into a database for querying – Files/columns:

UMLS schema 19 files to describe: – Concepts – Relationships – The files (columns and content) MRCONSO – Concepts names and sources MRSTY – Concept semantic types Terminology (source) codes – ch/umls/knowledge_sources/m etathesaurus/release/source_v ocabularies.html ch/umls/knowledge_sources/m etathesaurus/release/source_v ocabularies.html

Concept table (MRCONSO) CUI: concept unique ID; LAT: language of term; LUI: term unique ID; SAB: Source; STR: string MySQL database – mysql -u [user] -h [host] -D [database] –p – Replace with provided info (thanks Kristina!!) Query example: CUILATLUISABSTR… C ENGL MSHAcquired Immunodeficiency Syndromes … C ENGL SNOMEDCTAIDS… C FREL SNOMEDCTSIDA… select * from MRCONSO where STR like ‘my favorite disease’;

Apache Lucene Relational databases are not optimized for string search (e.g. partial matches, phrases) Apache Lucene – – High-performance text search engine library Ranked searching (score) Phrase queries, wildcard queries, proximity queries… – Java API to: build indexes perform lookups – Integrate nicely into UIMA

Apache Lucene index Indexes stored on disk and loaded at runtime Documents – Index entries with indexable fields – The set of fields does not need to be the same for each document – Searches target one field at a time and return the whole matching document Default match scoring – Higher ranks = good overlap, non-frequent words, short fields CUILATSABSTREXTRA C MSHAcquired Immunodeficiency Syndromes - C ENGSNOMEDCTAIDSgenial C FRESNOMEDCTSIDA- Field Document

Apache Lucene Analyzer Defines the pre-processing step applied to – Strings indexed by Lucene – Strings that are looked up in the index Components – Tokenizer : creates token stream (e.g. based on white spaces) – Filter: applied to token stream (e.g. lower case, stop words) This is a good place to customize the matching algorithm, but see also: – Language-specific analyzers (e.g. Arabic, Chinese, Catalan) – CustomScoreQuery (to customize scoring function) – WildcardQuery, FuzzyQuery, RegexpQuery – KeywordQuery (no tokenization)

Building an index //create reference to Lucene index to be stored on disk Directory dir = FSDirectory.open(new File(indexPath)); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);//tokenizer,filter IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40, analyzer); IndexWriter writer = new IndexWriter(dir, iwc); //get index writer … Document doc = new Document(); //create new entry (i.e. document) Field myfield = new TextField(“term", term, Field.Store.YES); //create field doc.add(pathField); //add field to document … writer.addDocument(doc); //add document to index … writer.close(); //save updated index //create reference to Lucene index to be stored on disk Directory dir = FSDirectory.open(new File(indexPath)); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);//tokenizer,filter IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40, analyzer); IndexWriter writer = new IndexWriter(dir, iwc); //get index writer … Document doc = new Document(); //create new entry (i.e. document) Field myfield = new TextField(“term", term, Field.Store.YES); //create field doc.add(pathField); //add field to document … writer.addDocument(doc); //add document to index … writer.close(); //save updated index StandardAnalyzer = StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of English stop words. Other analyzer examples: WhitespaceAnalyzer, KeywordAnalyzer. Field.Store.YES = this field will be indexed StandardAnalyzer = StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of English stop words. Other analyzer examples: WhitespaceAnalyzer, KeywordAnalyzer. Field.Store.YES = this field will be indexed

Creating index queries //create reference to existing Lucene index stored on disk IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index))); //prepare search IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40); //create query on the “term” field QueryParser parser = new QueryParser(Version.LUCENE_40, “term”, analyzer); Query query = parser.parse(“hello*”);//search for terms that start with ‘hello’ //search TopDocs results = searcher.search(query, 5); //search for top 5 matches //create reference to existing Lucene index stored on disk IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index))); //prepare search IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40); //create query on the “term” field QueryParser parser = new QueryParser(Version.LUCENE_40, “term”, analyzer); Query query = parser.parse(“hello*”);//search for terms that start with ‘hello’ //search TopDocs results = searcher.search(query, 5); //search for top 5 matches //collect results ScoreDoc[] hits = results.scoreDocs; //collect matches int numTotalHits = results.totalHits; //count number of results … Document doc = searcher.doc(hits[0].doc); //retrieve first matching entry int score = hits[0].score; //retrieve score of first matching entry String term = doc.get(“term"); //retrieve value of field “term” //collect results ScoreDoc[] hits = results.scoreDocs; //collect matches int numTotalHits = results.totalHits; //count number of results … Document doc = searcher.doc(hits[0].doc); //retrieve first matching entry int score = hits[0].score; //retrieve score of first matching entry String term = doc.get(“term"); //retrieve value of field “term”

Lets get to work! Download necessary files – Apache Lucene Core API – MySQL Java connector – Files for this tutorial Create Eclipse project – Add necessary JAR files to build path – Copy source files to project src folder Complete code to: – Build index from MySQL query (don’t use all concepts!!) – Create search function that returns the CUIs of matching terms

Merci! [C ] Thank you (NCI) Julien Thibault University of Utah Department of Biomedical Informatics