Download presentation
Presentation is loading. Please wait.
Published byDerick Miller Modified over 9 years ago
1
Flexible Text Mining using Interactive Information Extraction David Milward david.milward@linguamatics.com
2
2 Text mining vs. Data Mining Text mining –getting nuggets of information from text –extracting relationships –structured results to feed into data mining, visualisation or databases Data mining –getting new knowledge from databases –suggesting new relationships, trends, patterns
3
3 Text Data Mining Emphasizes finding new knowledge from text Typically knowledge that is implicit within multiple documents
4
4 What is the relationship to IR? IR finds the most relevant documents Text mining finds information from within documents, or across documents –What drugs are used for psoriasis treatment? –Who are associated directly or indirectly with the Board of Exxon? There is overlap … –we often search to answer a question, not to find a document
5
5 Traditional Information Extraction Uses natural language processing to distinguish –Sanofi bid for Aventis –Aventis bid for Sanofi Provides structured results for easy review and analysis Uses normalised terminology to allow integration with databases e.g. –Preferred term: Sanofi, –Synonyms: Sanofi Pasteur, Sanofi Synthelabo, Sanofi Synthélabo … But: –typically limited to patterns on a single sentence –constructing, testing and running queries can take days Appropriate if you always have the same question e.g. want to run over a newsfeed every night
6
6 I2E: Interactive Information Extraction A new concept Encompasses –keywords → documents –patterns → relationships (structured output) Queries ranging from: –General Motors –General Motors & acquisition in the same document –Automotive companies & acquisitions in the same sentence –What companies is General Motors associated with? Not limited to patterns within sentences e.g. –Merger and acquisition activity in documents mentioning Japan Fast, scalable, versatile I2E Information Extraction NLP Taxonomies/ Ontologies Taxonomies/ Ontologies Text Search Structured Output
7
7 Linguistic Processing We find that p42mapk phosphorylates c-Myb on serine and threonine. Purified recombinant p42 MAPK was found to phosphorylate Wee1. sentences Groups words into meaningful units Morphology allows search for different forms of words morphology - different forms noun phrases match entities verb groups match actions
8
8 Monitoring Merger and Acquisition Activity
9
9 Company Positions
10
10 Using I2E in the Life Sciences Good resources –Scientific abstracts are readily available in XML –Large number of existing taxonomies/terminologies Very large scale –16 million abstracts relevant to life sciences. Growing ???? a year –Large numbers of internal reports and full-text articles –Internal documents often > 1000 pages, may be PDF images –Taxonomies/terminologies are large, often deeply structured e.g. 350K nodes, ??? synonyms –Still need to augment terminology for specific areas Relatively large scale –17 million abstracts –Large numbers of internal reports and full-text articles –Internal documents can be >1000 pages, may be PDF images –Taxonomies/terminologies are large, often deeply structured > 100K concepts > 400K synonyms –Still need to augment terminology for specific areas
11
11 Examples of Pharma Questions R&D –Which proteins interact with metabolite X? –What are the reaction kinetics for canonical pathway Y? –What attributes are common to sets of biomarker genes –What are the known associations between expressed genes and environmental factors. –What dosages of compound B cause adverse reactions? Competitive Intelligence –Which companies are working on technology C? –What compounds are available for in-licensing in a disease area? –Which research groups are my competitors collaborating with?
12
12 Linking Drugs to Adverse Events
13
13 Measurements Extraction of numerical parameters, –e.g. amounts, dosages, concentrations
14
14 Benefits of Flexible Text Mining The ideal final query may use –co-occurrence of terms within a document or sentence –a precise linguistic pattern –a mixture of both It depends on –the nature of the task –the availability of terminologies –the kind of documents (news vs. science, abstract vs. full text) –the time available to check results Flexibility to mix different techniques is also critical for fast development of queries –e.g. start with broad queries to explore the “results space”, then home in
15
15 Fast query creation I2E: Better Results, Faster Fast return of results Fast review and analysis
16
16 Impact of I2E Significant reduction in time spent searching/reading the literature –weeks reduced to days or hours Structure the unstructured to –provide systematic and comprehensive review of information content –enable integration with traditional structured data –allow complex analysis of literature derived information –generate hypotheses, gain insight
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.