 Text mining for biology and medicine: Glasgow, Feb. 21-22, 2008 Biomedical information extraction at the University of Pennsylvania Mark Liberman

Slides:



Advertisements
Similar presentations
Overview of the TAC2013 Knowledge Base Population Evaluation: English Slot Filling Mihai Surdeanu with a lot help from: Hoa Dang, Joe Ellis, Heng Ji, and.
Advertisements

Social networks, in the form of bibliographies and citations, have long been an integral part of the scientific process. We examine how to leverage the.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
TAP-ET: TRANSLATION ADEQUACY AND PREFERENCE EVALUATION TOOL Mark Przybocki, Kay Peterson, Sébastien Bronsart May LREC 2008 Marrakech, Morocco.
© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Basic guidelines for the creation of a DW Create corporate sponsors and plan thoroughly Determine a scalable architectural framework for the DW Identify.
Information Retrieval in Practice
Systems Analysis and Design in a Changing World, Fourth Edition
6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive.
CS Catching Up CS Porter Stemmer Porter Stemmer (1980) Used for tasks in which you only care about the stem –IR, modeling given/new distinction,
Chapter 2: Algorithm Discovery and Design
4/14/20051 ACE Annotation Ralph Grishman New York University.
Jumping Off Points Ideas of possible tasks Examples of possible tasks Categories of possible tasks.
Computer Science CS425/CS6258/23/20011 The Architecting Phase Class diagrams are further refined in this phase of development Object diagrams are created.
Architectural Design Principles. Outline  Architectural level of design The design of the system in terms of components and connectors and their arrangements.
Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University.
1 Digital Libraries and Evidence in the Developing World Context Dr. Jon Ferguson Senior Health Database Scientist IMMPACT Project University of Aberdeen.
Chapter 2: Algorithm Discovery and Design
Overview of Search Engines
Department of Computer Science 1 CSS 496 Business Process Re-engineering for BS(CS)
Architectural Design.
ELN – Natural Language Processing Giuseppe Attardi
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
1 An Analytical Evaluation of BPMN Using a Semiotic Quality Framework Terje Wahl & Guttorm Sindre NTNU, Norway Terje Wahl, 14. June 2005.
These slides are designed to accompany Web Engineering: A Practitioner’s Approach (The McGraw-Hill Companies, Inc.) by Roger Pressman and David Lowe, copyright.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
Survey of Semantic Annotation Platforms
 Knowledge Acquisition  Machine Learning. The transfer and transformation of potential problem solving expertise from some knowledge source to a program.
Data Mining Process A manifestation of best practices A systematic way to conduct DM projects Different groups has different versions Most common standard.
IProLINK – A Literature Mining Resource at PIR (integrated Protein Literature INformation and Knowledge ) Hu ZZ 1, Liu H 2, Vijay-Shanker K 3, Mani I 4,
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,
Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
EU Project proposal. Andrei S. Lopatenko 1 EU Project Proposal CERIF-SW Andrei S. Lopatenko Vienna University of Technology
University of Sheffield NLP Teamware: A Collaborative, Web-based Annotation Environment Kalina Bontcheva, Milan Agatonovic University of Sheffield.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
UHD::3320::CH121 DESIGN PHASE Chapter 12. UHD::3320::CH122 Design Phase Two Aspects –Actions which operate on data –Data on which actions operate Two.
7 Systems Analysis and Design in a Changing World, Fifth Edition.
A Use Case Primer 1. The Benefits of Use Cases  Compared to traditional methods, use cases are easy to write and to read.  Use cases force the developers.
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
Systems Analysis and Design in a Changing World, Fourth Edition
Object-Oriented Software Engineering using Java, Patterns &UML. Presented by: E.S. Mbokane Department of System Development Faculty of ICT Tshwane University.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
1 Yield Analysis and Increasing Engineering Efficiency Spotfire Users Conference 10/15/2003 William Pressnall, Scott Lacey.
TimeML compliant text analysis for Temporal Reasoning Branimir Boguraev and Rie Kubota Ando.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Web Technologies for Bioinformatics Ken Baclawski.
Systems Analysis and Design in a Changing World, Fourth Edition
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
Business Registers Recommendations Manual Presentation by Eurostat
5/6/04Biolink1 Integrated Annotation for Biomedical IE Mining the Bibliome: Information Extraction from the Biomedical Literature NSF ITR grant EIA
Towards Semi-Automated Annotation for Prepositional Phrase Attachment Sara Rosenthal William J. Lipovsky Kathleen McKeown Kapil Thadani Jacob Andreas Columbia.
Architectural Mismatch: Why reuse is so hard? Garlan, Allen, Ockerbloom; 1994.
7 Systems Analysis – ITEC 3155 The Object Oriented Approach – Use Cases.
Semantic Media Wiki Open Terminology Development - Initial Steps - Frank Hartel, Ph.D. Associate Director, Enterprise Vocabulary Services National Cancer.
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
PAIR project progress report Yi-Ting Chou Shui-Lung Chuang Xuanhui Wang.
An Overview of Requirements Engineering Tools and Methodologies*
Roberta Roth, Alan Dennis, and Barbara Haley Wixom
Development of the Amphibian Anatomical Ontology
Decision Matrices Business Economics.
Object-Oriented Analysis
Vocabulary Algorithm - A precise sequence of instructions for processes that can be executed by a computer Low level programming language: A programming.
Information Retrieval
Architectural Mismatch: Why reuse is so hard?
Presentation transcript:

 Text mining for biology and medicine: Glasgow, Feb , 2008 Biomedical information extraction at the University of Pennsylvania Mark Liberman Linguistic Data Consortium

 Text mining for biology and medicine: Glasgow, Feb , 2008 Outline  The PennBioIE project: Background, accomplishments, future  Public service announcement: Publishing data via the LDC  The parable of Yang Jin  Annotation as “common law semantics”  a serviceable technology that will improve  are there better long-term alternatives?

 Text mining for biology and medicine: Glasgow, Feb , 2008 PennBioie Project  Goals:  Learn to strip-mine the bibliome: better NLP tools for text datamining  Publish biomedical text annotation: Treebanks, entities, relations  Participants:  Penn NLP researchers  Biomedical researchers (Penn, GSK, CHoP)

 Text mining for biology and medicine: Glasgow, Feb , 2008 Penn BioIE Project Domains:  CYP inhibition of cytochrome P-450 enzymes 1100 abstracts collaboration with GSK  Onco genomic variations associated with cancer 1158 abstracts collaboration with Children’s Hospital of Philadelphia

 Text mining for biology and medicine: Glasgow, Feb , 2008 Annotation sequence 1. pretagging (document segmentation etc.) 2. named entities 3. POS 4. treebanking 5. relations

 Text mining for biology and medicine: Glasgow, Feb , 2008 Penn BioIE Project Results:  Some improved techniques  Some published data get rel. 0.9 from rel. 1.0 soon to be published by LDChttp://bioie.ldc.upenn.edu  Some applications -- e.g. FABLEFABLE  Some questions How to break the F-measure ceiling? How to decrease annotation burden? How to increase semantic coverage?

 Text mining for biology and medicine: Glasgow, Feb , 2008

A note on the LDC

 Text mining for biology and medicine: Glasgow, Feb , 2008 The Linguistic Data Consortium is an open consortium of universities, companies, and government laboratories; founded in 1992 with seed money from DARPA; run by the University of Pennsylvania with 45 full-time staff in Philadelphia.

 Text mining for biology and medicine: Glasgow, Feb , 2008 But really, the LDC is… a specialized digital publisher, which has distributed >50,000 copies of >750 corpora and other resources750 corpora to ~2,500 research organizations in 62 countries. … and might want to publish your data.

 Text mining for biology and medicine: Glasgow, Feb , 2008 Why publish with LDC?  It’s a publication!  LDC pubs have: authors ISBN numbers standard bibliographic citation formats editions  IPR, licensing are handled your way (from “all rights reserved” to open access)  LDC deals with the hassle of reproduction, distribution, maintenance

 Text mining for biology and medicine: Glasgow, Feb , 2008 The parable of Yang Jinparable of Yang Jin

 Text mining for biology and medicine: Glasgow, Feb , 2008 The annotation conundrum  “Natural” annotation is inconsistent  poor agreement for entities, worse for relations  task-internal metrics are noisy  “Top down” specification is even worse (e.g. existing elaborate ontologies)  Solution: iterative refinement of rules via interaction with annotation practice  result: complex accretion of “common law”  slow to develop, hard to learn  more consistent -- but is it correct?  complexity may re-create inconsistency new types and sub-types  ambiguity, confusion

 Text mining for biology and medicine: Glasgow, Feb , 2008  1P vs. 1P independent first passes by junior annotator, no QC  ADJ vs. ADJ output of two parallel, independent dual first pass annotations are adjudicated by two independent senior annotators ACE 2005 consistency

 Text mining for biology and medicine: Glasgow, Feb , 2008 Iterative improvement From ACE 2005 (Ralph Weischedel): Repeat until criteria met or until time has expired: 1.Analyze performance of previous task & guidelines Scores, confusion matrices, etc. 2.Hypothesize & implement changes to tasks/guidelines 3.Update infrastructure as needed DTD, annotation tool, and scorer 4.Annotate texts 5.Evaluate inter-annotator agreement

 Text mining for biology and medicine: Glasgow, Feb , 2008 NLP as Law School Many complex rules  Plus Wiki  Plus Listserv Rules, Notes, Fiats and Exceptions Task#Pages#Rules Entity 3420 Value 105 TIMEX Relations 3625 Events 7750 Total Example Decision Rule (Event p33) Note: For Events that where a single common trigger is ambiguous between the types LIFE (i.e. INJURE and DIE) and CONFLICT (i.e. ATTACK), we will only annotate the Event as a LIFE Event in case the relevant resulting state is clearly indicated by the construction. The above rule will not apply when there are independent triggers.

 Text mining for biology and medicine: Glasgow, Feb , 2008 BioIE case law Guidelines for oncology tagging Guidelines for oncology tagging (local)local

 Text mining for biology and medicine: Glasgow, Feb , 2008 Discussion  How to make it better  Integrating multiple information sources text, bioinformatic databases, microarray data, …  less-supervised learning inferring useful features from untagged text active learning, information markets, etc.  create a “basis set” of ready-made entity types  How to make it different  the analogy to translation  the lure of systematic semantics  (machine) learning: who is learning what?