Indexing UMLS concepts with Apache Lucene Julien Thibault University of Utah Department of Biomedical Informatics
Outline Goals Unified Medical Language System (UMLS) Apache Lucene Get to work!
Goals Build a dictionary lookup module for NLP pipelines – Input: string (e.g. “diabetes”, “breast cancer”, “warfarin”) – Output: list of concepts (e.g. “C083562”) Application examples: – Unstructured clinical document coding – (Semi)automated literature indexing Pre-processing necessary for free text (not covered today): – Tokenization – Sentence detection – Part-of-speech tagging (e.g. to lookup only noun phrases)
UMLS Unified Medical Language System (NLM) – Millions of organized biomedical concepts – Over 150 sources (e.g. SNOMED-CT, LOINC, NCI, MESH) – Good source to index biomedical concept! – UMLS Terminology Services: Content – Concepts, synonymous names, relationships – Semantic network (high-level classification) Organism, anatomical structure, biologic function, chemical, … Distribution – Files with concept and relationship description data – Loadable into a database for querying – Files/columns:
UMLS schema 19 files to describe: – Concepts – Relationships – The files (columns and content) MRCONSO – Concepts names and sources MRSTY – Concept semantic types Terminology (source) codes – ch/umls/knowledge_sources/m etathesaurus/release/source_v ocabularies.html ch/umls/knowledge_sources/m etathesaurus/release/source_v ocabularies.html
Concept table (MRCONSO) CUI: concept unique ID; LAT: language of term; LUI: term unique ID; SAB: Source; STR: string MySQL database – mysql -u [user] -h [host] -D [database] –p – Replace with provided info (thanks Kristina!!) Query example: CUILATLUISABSTR… C ENGL MSHAcquired Immunodeficiency Syndromes … C ENGL SNOMEDCTAIDS… C FREL SNOMEDCTSIDA… select * from MRCONSO where STR like ‘my favorite disease’;
Apache Lucene Relational databases are not optimized for string search (e.g. partial matches, phrases) Apache Lucene – – High-performance text search engine library Ranked searching (score) Phrase queries, wildcard queries, proximity queries… – Java API to: build indexes perform lookups – Integrate nicely into UIMA
Apache Lucene index Indexes stored on disk and loaded at runtime Documents – Index entries with indexable fields – The set of fields does not need to be the same for each document – Searches target one field at a time and return the whole matching document Default match scoring – Higher ranks = good overlap, non-frequent words, short fields CUILATSABSTREXTRA C MSHAcquired Immunodeficiency Syndromes - C ENGSNOMEDCTAIDSgenial C FRESNOMEDCTSIDA- Field Document
Apache Lucene Analyzer Defines the pre-processing step applied to – Strings indexed by Lucene – Strings that are looked up in the index Components – Tokenizer : creates token stream (e.g. based on white spaces) – Filter: applied to token stream (e.g. lower case, stop words) This is a good place to customize the matching algorithm, but see also: – Language-specific analyzers (e.g. Arabic, Chinese, Catalan) – CustomScoreQuery (to customize scoring function) – WildcardQuery, FuzzyQuery, RegexpQuery – KeywordQuery (no tokenization)
Building an index //create reference to Lucene index to be stored on disk Directory dir = FSDirectory.open(new File(indexPath)); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);//tokenizer,filter IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40, analyzer); IndexWriter writer = new IndexWriter(dir, iwc); //get index writer … Document doc = new Document(); //create new entry (i.e. document) Field myfield = new TextField(“term", term, Field.Store.YES); //create field doc.add(pathField); //add field to document … writer.addDocument(doc); //add document to index … writer.close(); //save updated index //create reference to Lucene index to be stored on disk Directory dir = FSDirectory.open(new File(indexPath)); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);//tokenizer,filter IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_40, analyzer); IndexWriter writer = new IndexWriter(dir, iwc); //get index writer … Document doc = new Document(); //create new entry (i.e. document) Field myfield = new TextField(“term", term, Field.Store.YES); //create field doc.add(pathField); //add field to document … writer.addDocument(doc); //add document to index … writer.close(); //save updated index StandardAnalyzer = StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of English stop words. Other analyzer examples: WhitespaceAnalyzer, KeywordAnalyzer. Field.Store.YES = this field will be indexed StandardAnalyzer = StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of English stop words. Other analyzer examples: WhitespaceAnalyzer, KeywordAnalyzer. Field.Store.YES = this field will be indexed
Creating index queries //create reference to existing Lucene index stored on disk IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index))); //prepare search IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40); //create query on the “term” field QueryParser parser = new QueryParser(Version.LUCENE_40, “term”, analyzer); Query query = parser.parse(“hello*”);//search for terms that start with ‘hello’ //search TopDocs results = searcher.search(query, 5); //search for top 5 matches //create reference to existing Lucene index stored on disk IndexReader reader = DirectoryReader.open(FSDirectory.open(new File(index))); //prepare search IndexSearcher searcher = new IndexSearcher(reader); Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_40); //create query on the “term” field QueryParser parser = new QueryParser(Version.LUCENE_40, “term”, analyzer); Query query = parser.parse(“hello*”);//search for terms that start with ‘hello’ //search TopDocs results = searcher.search(query, 5); //search for top 5 matches //collect results ScoreDoc[] hits = results.scoreDocs; //collect matches int numTotalHits = results.totalHits; //count number of results … Document doc = searcher.doc(hits[0].doc); //retrieve first matching entry int score = hits[0].score; //retrieve score of first matching entry String term = doc.get(“term"); //retrieve value of field “term” //collect results ScoreDoc[] hits = results.scoreDocs; //collect matches int numTotalHits = results.totalHits; //count number of results … Document doc = searcher.doc(hits[0].doc); //retrieve first matching entry int score = hits[0].score; //retrieve score of first matching entry String term = doc.get(“term"); //retrieve value of field “term”
Lets get to work! Download necessary files – Apache Lucene Core API – MySQL Java connector – Files for this tutorial Create Eclipse project – Add necessary JAR files to build path – Copy source files to project src folder Complete code to: – Build index from MySQL query (don’t use all concepts!!) – Create search function that returns the CUIs of matching terms
Merci! [C ] Thank you (NCI) Julien Thibault University of Utah Department of Biomedical Informatics