Applications of Text Mining Ewan Klein School of Informatics & NeSC
Text Mining Goals Three Areas: Extract useful information from large bodies of unstructured or semi-structured documents Looks for patterns in natural language text Driven by application needs Three Areas: Adding Metadata E.g., identify Dublin Core elements from document headers Information Extraction Identify nuggets of text data and marshall them into a fixed format Assisting Curation
Text mining and Curation Example workflow: Make an observation Search the research literature for knowledge Incorporate relevant information into database Challenges: Current Information Retrieval (IR) techniques often too imprecise Which enzymes act as catalysts in the glycolysis pathway? We want to identify a relation between two entities Move to augmenting IR with more knowledge of text structure Mostly supervised machine learning techniques Still need training data for each domain Need to integrate text mining into Grid applications
BlueDwarf for Text Mining BioCreative Competitioin Joint entry with Stanford Recognition of drug names, chemical names, and protein names in MEDLINE abstracts Java maximum entropy tagger Used roughly 700,000 features in the early stages Java memory size of 1950 Mb Died on available Informatics and Stanford machines BlueDwarf Arrived at 1,247,77 features, memory: 2560 Mb Several experiments running in parallel Provisional results: we obtained top-scoring results