5/6/04Biolink1 Integrated Annotation for Biomedical IE Mining the Bibliome: Information Extraction from the Biomedical Literature NSF ITR grant EIA-0205448.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
BioContrasts: Extracting and Exploiting Protein-protein Contrastive Relations from Biomedical Literature Jung-jae Kim 1, Zhuo Zhang 2, Jong C. Park 1 and.
Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,
LingPipe Does a variety of tasks  Tokenization  Part of Speech Tagging  Named Entity Detection  Clustering  Identifies.
Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
An Introduction to Machine Learning and Natural Language Processing Tools Vivek Srikumar, Mark Sammons (Some slides from Nick Rizzolo)
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 22, 2004.
Empirical Methods in Information Extraction - Claire Cardie 자연어처리연구실 한 경 수
1 I256: Applied Natural Language Processing Marti Hearst Sept 25, 2006.
Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University.
 Text mining for biology and medicine: Glasgow, Feb , 2008 Biomedical information extraction at the University of Pennsylvania Mark Liberman
BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen.
NATURAL LANGUAGE TOOLKIT(NLTK) April Corbet. Overview 1. What is NLTK? 2. NLTK Basic Functionalities 3. Part of Speech Tagging 4. Chunking and Trees 5.
1 iProLINK: An integrated protein resource for literature mining and literature-based curation 1. Bibliography mapping - UniProt mapped citations 2. Annotation.
ELN – Natural Language Processing Giuseppe Attardi
April 2005CSA2050:NLTK1 CSA2050: Introduction to Computational Linguistics NLTK.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Authors: Ting Wang, Yaoyong Li, Kalina Bontcheva, Hamish Cunningham, Ji Wang Presented by: Khalifeh Al-Jadda Automatic Extraction of Hierarchical Relations.
GALE Banks 11/9/06 1 Parsing Arabic: Key Aspects of Treebank Annotation Seth Kulick Ryan Gabbard Mitch Marcus.
Information Extraction From Medical Records by Alexander Barsky.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
Syntactically annotated corpora of Estonian Heli Uibo Institute of Computer Science University of Tartu
Ling 570 Day 17: Named Entity Recognition Chunking.
10/12/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 10 Giuseppe Carenini.
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan.
Lars Juhl Jensen Biomedical text mining. exponential growth.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Information extraction 2 Day 37 LING Computational Linguistics Harry Howard Tulane University.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
CSA2050 Introduction to Computational Linguistics Parsing I.
Effective Reranking for Extracting Protein-protein Interactions from Biomedical Literature Deyu Zhou, Yulan He and Chee Keong Kwoh School of Computer Engineering.
CPSC 503 Computational Linguistics
MedKAT Medical Knowledge Analysis Tool December 2009.
Supertagging CMSC Natural Language Processing January 31, 2006.
Automatic Grammar Induction and Parsing Free Text - Eric Brill Thur. POSTECH Dept. of Computer Science 심 준 혁.
1 Semantic Relations for Interpreting DNA Microarray Data and for Novel Hypotheses Generation Dimitar Hristovski, 1 PhD, Andrej Kastrin, 2 Borut Peterlin,
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Support Vector Machines and Kernel Methods for Co-Reference Resolution 2007 Summer Workshop on Human Language Technology Center for Language and Speech.
6/27/031 Integrating Syntactic and Semantic Annotation of Biomedical Text Seth Kulick, Mark Liberman, Martha Palmer and Andrew Schein The University of.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Annotating and measuring Temporal relations in texts Philippe Muller and Xavier Tannier IRIT,Université Paul Sabatier COLING 2004.
Literature Mining and Database Annotation of Protein Phosphorylation Using a Rule-based System Z. Z. Hu 1, M. Narayanaswamy 2, K. E. Ravikumar 2, K. Vijay-Shanker.
Effect of Alcohol on Brain Development NormalFetal Alcohol Syndrome.
Integrating linguistic knowledge in passage retrieval for question answering J¨org Tiedemann Alfa Informatica, University of Groningen HLT/EMNLP 2005.
Natural Language Processing Information Extraction Jim Martin (slightly modified by Jason Baldridge)
University of Sheffield, NLP Introduction to Text Mining Module 4: Development Lifecycle (Part 1)
Language Identification and Part-of-Speech Tagging
Natural Language Processing (NLP)
CIS Term Project Proposal November 1, 2002 Sharon Diskin
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Presentation transcript:

5/6/04Biolink1 Integrated Annotation for Biomedical IE Mining the Bibliome: Information Extraction from the Biomedical Literature NSF ITR grant EIA –5-year grant, now 1.5 years from start –University of Pennsylvania Institute for Research in Cognitive Science (IRCS) – subcontract to Children’s Hospital of Philadelphia (CHOP) – cooperation with GlaxoSmithKline (GSK)

5/6/04Biolink2 Two Areas of Exploration 1.Genetic variation in malignancy (CHOP) Genomic entity X is varied by process Y in malignancy Z Ki-ras mutations were detected in 17.2% of the adenomas. Entities: Gene, Variation*, Malignancy* (*relations among sub-components) 2.Cytochrome P450 inhibition (GSK) Compound X inhibits CYP450 protein Y to degree Z Amiodarone weakly inhibited CYP3A4-mediated activities with Ki = 45.1 μM Entities: Cyp450, Substance, quant-name, quant-value, quant-units

5/6/04Biolink3 Approach Build hand-annotated corpora in order to train automated analyzers Mutual constraint of form and content: –parsing helps overcome diversity and complexity of relational expressions –entity types and relations help constrain parsing Shallow semantics integrated with syntax –entity types, standardized reference, co-reference –predicate-argument relations Requires significant changes in both syntactic and semantic annotation Benefits: –automated analysis works better –patterns for “fact extraction” are simpler

5/6/04Biolink4 Project Goals Create and publish corpora integrating different kinds of annotation: –Part of Speech tags –Treebanking (labelled constituent structure) –Entities and relations (relevant to oncology and enzyme inhibition projects) –Predicate/argument relations, co-reference –Integration: textual entity-mentions ≈ syntactic constituents Develop IE tools using the corpus Integrate IE with existing bioinformatics databases

5/6/04Biolink5 Project Workflow Tokenization and POS Annotation Entity Annotation Treebanking Merged Represent ation (recently revised to a flat pipeline) TaskStartedabstractswordsSoftwaretagger Tok + POS8/22/ KWordfreakyes Entity9/12/ KWordfreakstarting Treebanking1/8/ KTreeEditorretraining

5/6/04Biolink6 Integration Issues (1) Modifications to Penn Treeebank guidelines (for tokenization, POS tagging, treebanking) –to deal with biomedical text –to allow for syntactic/semantic integration –to be correct! Example: Prenominal Modifiers old way: the breast cancer-associated autoimmune antigen DT NN JJ JJ NN (NP ) new way: the breast cancer - associated autoimmune antigen DT NN NN - VBN JJ NN (NML ) (ADJP ) (NML )* (NP ) *implicit

5/6/04Biolink7 Integration Issues (2) Coordinated entities –“point mutations at codons 12, 13 or 61 of the human K-, H- and N-ras genes” –Wordfreak allows for discontinous entities –Treebank guidelines modified, e.g.: (NP (NOM-1 codons) 12), (NP (NOM-1 *P* ) 13) or (NP (NOM-1 *P* ) 61) –Modification works recursively

5/6/04Biolink8 Entity Annotation

5/6/04Biolink9 Treebanking

5/6/04Biolink10 Tagger Development (1) POS tagger retrained 2/10: TaggerTraining MaterialTokens OldPTB sections New315 abstracts TaggerOverall Accuracy #Unseen Instances Accuracy Unseen Accuracy Seen Old88.53% %95.53% New97.33% %98.02% (Tokenizer also retrained -- new tokenizer used in both cases)

5/6/04Biolink11 Tagger Development (2) entityPrecisionRecallF Variation type Variation loc Variation state-init Variation state-sub Variation overall Chemical tagger Gene tagger (Precision & recall from 10-fold cross-validation, exact string match) Taggers are being integrated into the annotation process.

5/6/04Biolink12 References Project homepage: Annotation info: Wordfreak: Taggers: Integration analysis (entities and treebanking): LAW