Flexible Text Mining using Interactive Information Extraction David Milward

Slides:



Advertisements
Similar presentations
Classification & Your Intranet: From Chaos to Control Susan Stearns Inmagic, Inc. E-Libraries E204 May, 2003.
Advertisements

PolyAnalyst Data and Text Mining tool Your Knowledge Partner TM www
13 th September 2007 UK e-Science All Hands Meeting Text Mining Services to Support e-Research Brian Rea and Sophia Ananiadou National Centre for Text.
Welcome to Quertle Find What Matters ™ © 2011 Quertle, LLC.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Information Retrieval in Practice
Search Engines and Information Retrieval
The Informative Role of WordNet in Open-Domain Question Answering Marius Paşca and Sanda M. Harabagiu (NAACL 2001) Presented by Shauna Eggers CS 620 February.
Information Retrieval in Practice
INFO 624 Week 3 Retrieval System Evaluation
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Overview of Search Engines
B IOMEDICAL T EXT M INING AND ITS A PPLICATION IN C ANCER R ESEARCH Henry Ikediego
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Yuri de Lugt Collexis Karin Clavel TU Delft Library.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
CEDROM-SNi’s DITA- based Project From Analysis to Delivery By France Baril Documentation Architect.
Amarnath Gupta Univ. of California San Diego. An Abstract Question There is no concrete answer …but …
Data Mining Techniques
CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Search Engines and Information Retrieval Chapter 1.
1 The BT Digital Library A case study in intelligent content management Paul Warren
Ontology Development in the Sciences Some Fundamental Considerations Ontolytics LLC Topics:  Possible uses of ontologies  Ontologies vs. terminologies.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
1 Information Literacy Program Module 6 Emalus Campus.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Satish Ramanan April 16, AGENDA Context Why - Integrate Search with BI? How - do we get there? - Tool Strategy What - is in it for me ? - Outcomes.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
NLP And The Semantic Web Dainis Kiusals COMS E6125 Spring 2010.
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Lars Juhl Jensen Biomedical text mining. exponential growth.
©2003 Paula Matuszek CSC 9010: Text Mining Applications Document Summarization Dr. Paula Matuszek (610)
Improve your R&D Effectiveness and Manage Your Intellectual Property Assets with Luxid ® for Life Sciences.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Correlating Knowledge Using NLP: Relationships between the concepts of blood cancers, stem cell transplantation, and biomarkers Katy Zou and Weizhong Zhu.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
GTRI.ppt-1 NLP Technology Applied to e-discovery Bill Underwood Principal Research Scientist “The Current Status and.
NLP ? Natural Language is one of fundamental aspects of human behaviors. One of the final aim of human-computer communication. Provide easy interaction.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
By N.Gopinath AP/CSE. There are 5 categories of Decision support tools, They are; 1. Reporting 2. Managed Query 3. Executive Information Systems 4. OLAP.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Ontology-Centered Personalized Presentation of Knowledge Extracted from the Web Ralitsa Angelova.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
Pathway: a collection of genes, proteins, and /or small molecules that modulate a cellular process or disease state Growing demand in biological sciences.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
1 Information Retrieval LECTURE 1 : Introduction.
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Clinical research data interoperbility Shared names meeting, Boston, Bosse Andersson (AstraZeneca R&D Lund) Kerstin Forsberg (AstraZeneca R&D.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Semantic Wiki: Automating the Read, Write, and Reporting functions Chuck Rehberg, Semantic Insights.
Information Retrieval in Practice
DATA COLLECTION METHODS IN NURSING RESEARCH
Search Engine Architecture
Biomedical Text Mining and Its Applications
STRING Large-scale data and text mining
Applications of Text Mining
PolyAnalyst Data and Text Mining tool
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
CS246: Information Retrieval
Information Retrieval and Web Design
PolyAnalyst™ text mining tool Allstate Insurance example
Presentation transcript:

Flexible Text Mining using Interactive Information Extraction David Milward

2 Text mining vs. Data Mining Text mining –getting nuggets of information from text –extracting relationships –structured results to feed into data mining, visualisation or databases Data mining –getting new knowledge from databases –suggesting new relationships, trends, patterns

3 Text Data Mining Emphasizes finding new knowledge from text Typically knowledge that is implicit within multiple documents

4 What is the relationship to IR? IR finds the most relevant documents Text mining finds information from within documents, or across documents –What drugs are used for psoriasis treatment? –Who are associated directly or indirectly with the Board of Exxon? There is overlap … –we often search to answer a question, not to find a document

5 Traditional Information Extraction Uses natural language processing to distinguish –Sanofi bid for Aventis –Aventis bid for Sanofi Provides structured results for easy review and analysis Uses normalised terminology to allow integration with databases e.g. –Preferred term: Sanofi, –Synonyms: Sanofi Pasteur, Sanofi Synthelabo, Sanofi Synthélabo … But: –typically limited to patterns on a single sentence –constructing, testing and running queries can take days Appropriate if you always have the same question e.g. want to run over a newsfeed every night

6 I2E: Interactive Information Extraction A new concept Encompasses –keywords → documents –patterns → relationships (structured output) Queries ranging from: –General Motors –General Motors & acquisition in the same document –Automotive companies & acquisitions in the same sentence –What companies is General Motors associated with? Not limited to patterns within sentences e.g. –Merger and acquisition activity in documents mentioning Japan Fast, scalable, versatile I2E Information Extraction NLP Taxonomies/ Ontologies Taxonomies/ Ontologies Text Search Structured Output

7 Linguistic Processing We find that p42mapk phosphorylates c-Myb on serine and threonine. Purified recombinant p42 MAPK was found to phosphorylate Wee1. sentences Groups words into meaningful units Morphology allows search for different forms of words morphology - different forms noun phrases match entities verb groups match actions

8 Monitoring Merger and Acquisition Activity

9 Company Positions

10 Using I2E in the Life Sciences Good resources –Scientific abstracts are readily available in XML –Large number of existing taxonomies/terminologies Very large scale –16 million abstracts relevant to life sciences. Growing ???? a year –Large numbers of internal reports and full-text articles –Internal documents often > 1000 pages, may be PDF images –Taxonomies/terminologies are large, often deeply structured e.g. 350K nodes, ??? synonyms –Still need to augment terminology for specific areas Relatively large scale –17 million abstracts –Large numbers of internal reports and full-text articles –Internal documents can be >1000 pages, may be PDF images –Taxonomies/terminologies are large, often deeply structured > 100K concepts > 400K synonyms –Still need to augment terminology for specific areas

11 Examples of Pharma Questions R&D –Which proteins interact with metabolite X? –What are the reaction kinetics for canonical pathway Y? –What attributes are common to sets of biomarker genes –What are the known associations between expressed genes and environmental factors. –What dosages of compound B cause adverse reactions? Competitive Intelligence –Which companies are working on technology C? –What compounds are available for in-licensing in a disease area? –Which research groups are my competitors collaborating with?

12 Linking Drugs to Adverse Events

13 Measurements Extraction of numerical parameters, –e.g. amounts, dosages, concentrations

14 Benefits of Flexible Text Mining The ideal final query may use –co-occurrence of terms within a document or sentence –a precise linguistic pattern –a mixture of both It depends on –the nature of the task –the availability of terminologies –the kind of documents (news vs. science, abstract vs. full text) –the time available to check results Flexibility to mix different techniques is also critical for fast development of queries –e.g. start with broad queries to explore the “results space”, then home in

15 Fast query creation I2E: Better Results, Faster Fast return of results Fast review and analysis

16 Impact of I2E Significant reduction in time spent searching/reading the literature –weeks reduced to days or hours Structure the unstructured to –provide systematic and comprehensive review of information content –enable integration with traditional structured data –allow complex analysis of literature derived information –generate hypotheses, gain insight