© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. Lynette Hirschman The MITRE Corporation Bedford, MA, USA RegCreative Jamboree Nov 29-Dec 1, 2006 Text.

Slides:

Advertisements

Similar presentations

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.

Advertisements

Language Technologies Reality and Promise in AKT Yorick Wilks and Fabio Ciravegna Department of Computer Science, University of Sheffield.

Chapter 5: Introduction to Information Retrieval

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji, and.

Annotation standards in ORegAnno (Draft) Obi Griffith The RegCreative Jamboree Nov 29, 2006 Ghent, Belgium.

© 2002 The MITRE Corporation. ALL RIGHTS RESERVED. Co-Chair: Alexander Yeh, MITRE Corp. Data: FlyBase ( July 2002 KDD Cup 2002 Task1:

Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

ANALYSIS OF INTER-ANNOTATOR AGREEMENT (TEXT MINING & REG. ANNOTATION) RegCreative Jamboree, Friday, December, 1st, (2006) MARTIN KRALLINGER, 2006 TEXT.

Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

LESSONS FROM THE BIOCREATIVE PROTEIN- PROTEIN INTERACTION (PPI) TASK RegCreative Jamboree, Friday, December, 1st, (2006) MARTIN KRALLINGER, 2006 LESSONS.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

Linking Text Mentions to Biological Identifiers Alexander A. Morgan MITRE Corporation

Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)

© 2003 The MITRE Corporation. ALL RIGHTS RESERVED. MITRE Critical Assessment of Information Extraction Systems in Biology (BioCreAtIvE) Marc Colosimo Lynette.

1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.

Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.

BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman.

Chapter 2 The Algorithmic Foundations of Computer Science

Using Information Extraction for Question Answering Done by Rani Qumsiyeh.

Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.

QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.

DI FC UL1 Gene Function Prediction by Mining Biomedical Literature Pooja Jain Master in Bioinformatics Supervisor - Mário Jorge Costa Gaspar.

UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.

Evaluating the Performance of IR Sytems

1 CS 430: Information Discovery Lecture 20 The User in the Loop.

CBioC: Massive Collaborative Curation of Biomedical Literature Future Directions.

CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.

Mining the Medical Literature Chirag Bhatt October 14 th, 2004.

1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,

Cis-Regulatory/ Text Mining Interface Discussion.

Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.

Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.

RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation Hu ZZ 1, Yuan X 1, Torii M 2, Vijay-Shanker K 3, and Wu CH 1 1 Protein Information.

Towards Improving Classification of Real World Biomedical Articles Kostas Fragos TEI of Athens Christos Skourlas TEI of Athens

©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.

Accomplishments and Challenges in Literature Data Mining for Biology L. Hirschman et al. Presented by Jing Jiang CS491CXZ Spring, 2004.

Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.

IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,

Question Answering From Zero to Hero Elena Eneva 11 Oct 2001 Advanced IR Seminar.

Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.

Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,

Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.

A Language Independent Method for Question Classification COLING 2004.

Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.

Gene Ontology TM (GO) Consortium Jennifer I Clark EMBL Outstation - European Bioinformatics Institute (EBI), Hinxton, Cambridge CB10 1SD, UK Objectives:

Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Benchmarking ontology-based annotation tools for the Semantic Web Diana Maynard University of Sheffield, UK.

WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.

BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.

Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.

Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.

Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.

Performance Measurement. 2 Testing Environment.

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.

Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.

Automatically Labeled Data Generation for Large Scale Event Extraction

Development of the Amphibian Anatomical Ontology

Wei Wei, PhD, Zhanglong Ji, PhD, Lucila Ohno-Machado, MD, PhD

Applying Key Phrase Extraction to aid Invalidity Search

CS246: Information Retrieval

Presentation transcript:

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. Lynette Hirschman The MITRE Corporation Bedford, MA, USA RegCreative Jamboree Nov 29-Dec 1, 2006 Text Mining for Biology

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. Outline Overview of text mining -Retrieval and extraction -Where are we? How text mining can help -Database consistency assessment -Tools to aid curators Conclusions

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. Text Mining Overview Information Extraction: Identify, extract & normalize entities, relations MEDLINE PIR Genbank Collections: Gigabytes Documents: Megabytes Lists,Tables: Kilobytes Protease-resistant prion protein interacts with... Phrases: Bytes Information Retrieval: Retrieve & classify documents via key words Question Answering: question to answer

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. The MOD Curation Pipeline and Text Mining MEDLINE 1. Select papers 2. List genes for curation 3. Curate genes from paper BioCreAtIve: Gene Normalization Extract gene names & normalize: 20 participants BioCreAtIvE II: Protein annotation Find relations & supporting evidence in text: 28 participants KDD 2002 Task 1; TREC Genomics 2004 Task 2 BioCreAtIvE II: PPI article selection

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. ORegAnno Curation Pipeline & Text Mining MEDLINE 1. Select papers 2. List TFBS for curation 3. Curate genes from paper Gene & TF Normalization: Extract gene, protein names & normalize to standard ID Extract evidence passages and map to evidence types/sub-types Curation queue management

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. State of the Art: Document Retrieval Input: query words Output: ranked list of documents Approach -Speed, scalability domain independence and robustness are critical for access to large collections of documents Techniques -Shallow processing provides coarse-grained result (entire documents or passages) -Query is transformed to collection of words, but grammatical relations between words lost -Documents are indexed by word occurrences -Search matches query bag-of-words against indexed documents using Boolean combination of terms, or vector of word occurrences or language model

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. State of the Art: Extraction For news, automated systems exist now that can: -Identify entities (90-95% F-measure*) -Extract relations among entities (70-80% F) (information extraction) -Answer simple factual questions using large document collections at 75-85% accuracy (question answering) How good is text mining applied to biology? -Is biology easier, because it has structured resources (ontology, synonym lists)? -Is it harder because of specialized biological language, complex biological reasoning? F-measure is harmonic mean of precision and recall: 2*P*R/(P+R) Precision = TP/TP+FP; Recall = TP/TP+FN

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. Assessments: Document Classification TREC Genomics track focused on retrieval -Part of Text Retrieval Conf, run by National Institutes of Standards and Technology -Tasks have included retrieval of Documents to identify gene function Documents for MGI curation pipeline Documents, passages to answer queries, e.g., “what effect does the insulin receptor gene have on tumorigenesis?” -40+ groups participating starting 2004 KDD Challenge Cup task Yeh et al, MITRE; Gelbart, Mathew et al, FlyBase task

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. KDD Challenge Cup Task: automate part of FlyBase curation: -Determine which papers need to be curated for Drosophila gene expression information -Curate only those papers containing experimental results on gene products (RNA transcripts and proteins) Teamed with FlyBase, who provided -Data annotation plus biological expertise -Input on the task formulation Venue: ACM conference on Knowledge Discovery and Data Mining (KDD) -Alex Yeh (MITRE) ran Challenge Cup task

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. FlyBase: Evidence for Gene Products

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. Results 18 teams submitted results (32 entries) Winner: a team from ClearForest and Celera -Used manually generated rules and patterns to perform information extraction Subtask results Best Median Ranked-list for curation: 84% 69% Yes/No curate paper: 78% 58% Yes/No gene products: 67% 35% Conclusion: ranking papers for curation promising; open question: would this help curators?

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. BioCreAtIvE I: Workshop March Tasks (Participation) Gene Mention (15) Gene Normalization: Fly, Mouse, Yeast (8) Functional Annotation (8) BioCreAtIvE II: Workshop April Tasks (Participation) Gene Mention (21) Gene Normalization: Human (20) Protein-Protein Interaction (28)

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. List unique gene IDs for Fly, Mouse, Yeast abstracts A locus has been found, an allele of which causes a modification of some allozymes of the enzyme esterase 6 in Drosophila melanogaster. There are two alleles of this locus, one of which is dominant to the other and results in increased electrophoretic mobility of affected allozymes. The locus responsible has been mapped to on the standard genetic map (Est-6 is at ). Of 13 other enzyme systems analyzed, only leucine aminopeptidase is affected by the modifier locus. Neuraminidase incubations of homogenates altered the electrophoretic mobility of esterase 6 allozymes, but the mobility differences found are not large enough to conclude that esterase 6 is sialylated. Gene Normalization Abstract IDOrganism Gene ID fly_00035_trainingFBgn fly_00035_trainingFBgn Sample Gene ID and synonyms: FBgn : Est-6, Esterase 6, CG6917, Est-D, EST6, est-6, Est6, Est, EST-6, Esterase-6, est6, Est-5, Carboxyl ester hydrolase

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. BioCreAtIvE I Results: Gene Normalization Yeast results good: High: 0.93 F Smallest vocab Short names Little ambiguity Fly: 0.82 F High ambiguity Mouse: 0.79 F Large vocabulary Long names Human: ~80% (BioCreAtIvE II)

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. Impact of BioCreAtIvE I BioCreAtIvE showed state of the art: -Gene name mentions: F = Normalized gene IDs: F = Functional annotation:F ~ 0.3 BioCreAtIvE II -Participation 2-3x higher! -Results and workshop April 23-25, Madrid What next? -New model of curator/text mining cooperation Have biological curators contribute data (training and test sets) Text mining developers work on real biological problems -RegCreative is an instance of this model

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. How Text Mining Can Help Quality & Consistency -Assess consistency of annotation -First step is to determine consistency of human performance on classification or annotation tasks -Use agreement studies to improve annotation guidelines and resources (training materials, annotated data) Coverage -Text mining can speed up curation to achieve better coverage Currency -Faster curation improves currency of annotations

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. Inter-Annotator Agreement Thesis: if people cannot do a task consistently, it will be hard to automate the task -Also, data will be less valuable Method -Two humans perform same classification task on a “blind” data set, using classification guidelines (after some designated training) -Results are compared via a scoring metric Outcome: Determine whether guidelines are sufficient to ensure consistent classification Study can be informal -Used to flag places that need improvement - Or more formal, to measure progress over time

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. Checking Interannotator Agreement: An Experiment from BioCreAtIvE I Camon et al did 1st inter-curator agreement expt* -3 EBI GOA annotators annotated 12 overlapping documents for GO terms (4 docs/pair of curators) -Results after developing consensus gold standard: Avg precision (% annotations correct): ~95% Avg recall (% correct annotations found): ~72% Lessons learned -Very few wrong annotations, but some were missed -Annotators differed on specificity of annotation, depending on their biological knowledge -Annotation by paper meant evidence standard was less clear (normal annotation is by protein) -Annotation is a complex task for people! Camon et al.,BMC Bioinformatics 2005, 6(Suppl 1):S17 (2005)

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. Conclusions Text mining can provide a methodology to assess consistency of annotation Text mining can provide tools -To manage the curation queue -To assist curators, particularly in normalization & mapping into ontologies Next steps -Define intended uses of RegCreative data -Establish curator training materials -Identify key bottlenecks in curation -Provide data, user input to develop tools Major stumbling block for text mining -Handling of pdf documents!

© 2006 The MITRE Corporation. ALL RIGHTS RESERVED. Acknowledgements US National Science Foundation for funding of BioCreAtIvE I and BioCreAtIve II* MITRE colleagues who worked on BioCreAtIvE -Alex Morgan (now at Stanford) -Marc Colosimo -Jeff Colombe -Alex Yeh (also KDD Challenge Cup) Collaborators at CNB and CNIO -Alfonso Valencia -Christian Blaschke (now at bioalma) -Martin Krallinger * Contract numbers EIA and IIS