Presentation is loading. Please wait.

Presentation is loading. Please wait.

Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics.

Similar presentations


Presentation on theme: "Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics."— Presentation transcript:

1 Indexing UMLS concepts with Apache Lucene Julien Thibault jcv.thibault@gmail.com University of Utah Department of Biomedical Informatics

2 Outline Goals Unified Medical Language System (UMLS) Apache Lucene Get to work!

3 Goals Build a dictionary lookup module for NLP pipelines – Input: string (e.g. “diabetes”, “breast cancer”, “warfarin”) – Output: list of concepts (e.g. “C083562”) Application examples: – Unstructured clinical document coding – (Semi)automated literature indexing Pre-processing necessary for free text (not covered today): – Tokenization – Sentence detection – Part-of-speech tagging (e.g. to lookup only noun phrases)

4 UMLS Unified Medical Language System (NLM) – Millions of organized biomedical concepts – Over 150 sources (e.g. SNOMED-CT, LOINC, NCI, MESH) – Good source to index biomedical concept! – UMLS Terminology Services: https://uts.nlm.nih.gov/home.htmlhttps://uts.nlm.nih.gov/home.html Content – Concepts, synonymous names, relationships – Semantic network (high-level classification) Organism, anatomical structure, biologic function, chemical, … Distribution – Files with concept and relationship description data – Loadable into a database for querying – Files/columns: http://www.ncbi.nlm.nih.gov/books/NBK9685/http://www.ncbi.nlm.nih.gov/books/NBK9685/

5 UMLS schema 19 files to describe: – Concepts – Relationships – The files (columns and content) MRCONSO – Concepts names and sources MRSTY – Concept semantic types Terminology (source) codes – http://www.nlm.nih.gov/resear ch/umls/knowledge_sources/m etathesaurus/release/source_v ocabularies.html http://www.nlm.nih.gov/resear ch/umls/knowledge_sources/m etathesaurus/release/source_v ocabularies.html

6 Concept table (MRCONSO) CUI: concept unique ID; LAT: language of term; LUI: term unique ID; SAB: Source; STR: string MySQL database – mysql -u [user] -h [host] -D [database] –p – Replace with provided info (thanks Kristina!!) Query example: CUILATLUISABSTR… C0001175ENGL0001175MSHAcquired Immunodeficiency Syndromes … C0001175ENGL0001842SNOMEDCTAIDS… C0001175FREL0162173SNOMEDCTSIDA… select * from MRCONSO where STR like ‘my favorite disease’;

7 Apache Lucene Relational databases are not optimized for string search (e.g. partial matches, phrases) Apache Lucene – http://lucene.apache.org/ http://lucene.apache.org/ – High-performance text search engine library Ranked searching (score) Phrase queries, wildcard queries, proximity queries… – Java API to: build indexes perform lookups – Integrate nicely into UIMA

8 Apache Lucene index Indexes stored on disk and loaded at runtime Documents – Index entries with indexable fields – The set of fields does not need to be the same for each document – Searches target one field at a time and return the whole matching document Default match scoring – Higher ranks = good overlap, non-frequent words, short fields CUILATSABSTREXTRA C0001175-MSHAcquired Immunodeficiency Syndromes - C0001175ENGSNOMEDCTAIDSgenial C0001175FRESNOMEDCTSIDA- Field Document

9 Apache Lucene Analyzer Defines the pre-processing step applied to – Strings indexed by Lucene – Strings that are looked up in the index Components – Tokenizer : creates token stream (e.g. based on white spaces) – Filter: applied to token stream (e.g. lower case, stop words) This is a good place to customize the matching algorithm, but see also: – Language-specific analyzers (e.g. Arabic, Chinese, Catalan) – CustomScoreQuery (to customize scoring function) – WildcardQuery, FuzzyQuery, RegexpQuery – KeywordQuery (no tokenization)

10 Building an index //create reference to Lucene index to be stored on disk Directory dir = FSDirectory.open(new File(indexPath)); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);//tokenizer,filter IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40, analyzer); IndexWriter writer = new IndexWriter(dir, iwc); //get index writer … Document doc = new Document(); //create new entry (i.e. document) Field myfield = new TextField(“term", term, Field.Store.YES); //create field doc.add(pathField); //add field to document … writer.addDocument(doc); //add document to index … writer.close(); //save updated index //create reference to Lucene index to be stored on disk Directory dir = FSDirectory.open(new File(indexPath)); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);//tokenizer,filter IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40, analyzer); IndexWriter writer = new IndexWriter(dir, iwc); //get index writer … Document doc = new Document(); //create new entry (i.e. document) Field myfield = new TextField(“term", term, Field.Store.YES); //create field doc.add(pathField); //add field to document … writer.addDocument(doc); //add document to index … writer.close(); //save updated index http://lucene.apache.org/core/4_6_0/demo/src-html/org/apache/lucene/demo/IndexFiles.html StandardAnalyzer = StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of English stop words. Other analyzer examples: WhitespaceAnalyzer, KeywordAnalyzer. Field.Store.YES = this field will be indexed StandardAnalyzer = StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of English stop words. Other analyzer examples: WhitespaceAnalyzer, KeywordAnalyzer. Field.Store.YES = this field will be indexed

11 Creating index queries //create reference to existing Lucene index stored on disk IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index))); //prepare search IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40); //create query on the “term” field QueryParser parser = new QueryParser(Version.LUCENE_40, “term”, analyzer); Query query = parser.parse(“hello*”);//search for terms that start with ‘hello’ //search TopDocs results = searcher.search(query, 5); //search for top 5 matches //create reference to existing Lucene index stored on disk IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index))); //prepare search IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40); //create query on the “term” field QueryParser parser = new QueryParser(Version.LUCENE_40, “term”, analyzer); Query query = parser.parse(“hello*”);//search for terms that start with ‘hello’ //search TopDocs results = searcher.search(query, 5); //search for top 5 matches http://lucene.apache.org/core/4_6_0/demo/src-html/org/apache/lucene/demo/SearchFiles.html //collect results ScoreDoc[] hits = results.scoreDocs; //collect matches int numTotalHits = results.totalHits; //count number of results … Document doc = searcher.doc(hits[0].doc); //retrieve first matching entry int score = hits[0].score; //retrieve score of first matching entry String term = doc.get(“term"); //retrieve value of field “term” //collect results ScoreDoc[] hits = results.scoreDocs; //collect matches int numTotalHits = results.totalHits; //count number of results … Document doc = searcher.doc(hits[0].doc); //retrieve first matching entry int score = hits[0].score; //retrieve score of first matching entry String term = doc.get(“term"); //retrieve value of field “term”

12 Lets get to work! Download necessary files – Apache Lucene Core API http://lucene.apache.org/core/mirrors-core-latest-redir.html? http://lucene.apache.org/core/mirrors-core-latest-redir.html – MySQL Java connector http://dev.mysql.com/downloads/connector/j/ – Files for this tutorial Create Eclipse project – Add necessary JAR files to build path – Copy source files to project src folder Complete code to: – Build index from MySQL query (don’t use all concepts!!) – Create search function that returns the CUIs of matching terms

13 Merci! [C2986674] Thank you (NCI) Julien Thibault jcv.thibault@gmail.com University of Utah Department of Biomedical Informatics


Download ppt "Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics."

Similar presentations


Ads by Google