6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

1 Simple Algorithms for Complex Relation Extraction with Applications to Biomedical IE Ryan McDonald Fernando Pereira Seth Kulick CIS and IRCS, University.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Prototype-Driven Grammar Induction Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley.
Layering Semantics (Putting meaning into trees) Treebank Workshop Martha Palmer April 26, 2007.
GLARF-ULA: ULA08 Workshop March 19, 2007 GLARF-ULA: Working Towards Usability Unified Linguistic Annotation Workshop Adam Meyers New York University March.
The SALSA experience: semantic role annotation Katrin Erk University of Texas at Austin.
Semantic Role Labeling Abdul-Lateef Yussiff
Probabilistic Parsing Chapter 14, Part 2 This slide set was adapted from J. Martin, R. Mihalcea, Rebecca Hwa, and Ray Mooney.
Learning with Probabilistic Features for Improved Pipeline Models Razvan C. Bunescu Electrical Engineering and Computer Science Ohio University Athens,
April 26th, 2007 Workshop on Treebanking, HLT/NAACL, Rochester 1 Layering of Annotations in the Penn Discourse TreeBank (PDTB) Rashmi Prasad Institute.
LING 581: Advanced Computational Linguistics Lecture Notes January 19th.
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
Drexel – 4/22/13 1/39 Treebank Analysis Using Derivation Trees Seth Kulick
Using Treebanks tgrep2 Lecture 2: 07/12/2011. Using Corpora For discovery For evaluation of theories For identifying tendencies – distribution of a class.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.
Annotation Types for UIMA Edward Loper. UIMA Unified Information Management Architecture Analytics framework –Consists of components that perform specific.
DS-to-PS conversion Fei Xia University of Washington July 29,
1 More Xkwic and Tgrep LING 5200 Computational Corpus Linguistics Martha Palmer March 2, 2006.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
Biomedical Information Extraction. Outline Intro to biomedical information extraction PASTA [Demetriou and Gaizauskas] Biomedical named entities Name.
Introduction to Syntax, with Part-of-Speech Tagging Owen Rambow September 17 & 19.
June 7th, 2008TAG+91 Binding Theory in LTAG Lucas Champollion University of Pennsylvania
Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University.
 Text mining for biology and medicine: Glasgow, Feb , 2008 Biomedical information extraction at the University of Pennsylvania Mark Liberman
Department of Computer Science 1 CSS 496 Business Process Re-engineering for BS(CS)
March 2006 CLINT-CS 1 Introduction to Computational Linguistics Chunk Parsing.
LING/C SC/PSYC 438/538 Lecture 27 Sandiway Fong. Administrivia 2 nd Reminder – 538 Presentations – Send me your choices if you haven’t already.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
GALE Banks 11/9/06 1 Parsing Arabic: Key Aspects of Treebank Annotation Seth Kulick Ryan Gabbard Mitch Marcus.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
Extracting Semantic Constraint from Description Text for Semantic Web Service Discovery Dengping Wei, Ting Wang, Ji Wang, and Yaodong Chen Reporter: Ting.
Open Health Natural Language Processing Consortium (OHNLP)
Open Information Extraction using Wikipedia
Flexible Text Mining using Interactive Information Extraction David Milward
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
1 Automated recognition of malignancy mentions in biomedical literature BMC Bioinformatics 2006, 7:492 Speaker: Yu-Ching Fang Advisors: Hsueh-Fen Juan.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
Conversion of Penn Treebank Data to Text. Penn TreeBank Project “A Bank of Linguistic Trees” (as of 11/1992) University of Pennsylvania, LINC Laboratory.
CSA2050 Introduction to Computational Linguistics Parsing I.
MedKAT Medical Knowledge Analysis Tool December 2009.
March 2006Introduction to Computational Linguistics 1 CLINT Tokenisation.
NLP. Introduction to NLP Background –From the early ‘90s –Developed at the University of Pennsylvania –(Marcus, Santorini, and Marcinkiewicz 1993) Size.
Supertagging CMSC Natural Language Processing January 31, 2006.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
6/27/031 Integrating Syntactic and Semantic Annotation of Biomedical Text Seth Kulick, Mark Liberman, Martha Palmer and Andrew Schein The University of.
5/6/04Biolink1 Integrated Annotation for Biomedical IE Mining the Bibliome: Information Extraction from the Biomedical Literature NSF ITR grant EIA
Towards Semi-Automated Annotation for Prepositional Phrase Attachment Sara Rosenthal William J. Lipovsky Kathleen McKeown Kapil Thadani Jacob Andreas Columbia.
NLP. Parsing ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
A Syntax-Driven Bracketing Model for Phrase-Based Translation Deyi Xiong, et al. ACL 2009.
NLP. Introduction to NLP #include int main() { int n, reverse = 0; printf("Enter a number to reverse\n"); scanf("%d",&n); while (n != 0) { reverse =
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Prototype-Driven Grammar Induction Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,
CIS Treebanks, Trees, Querying, QC, etc. Seth Kulick Linguistic Data Consortium University of Pennsylvania
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
Natural Language Processing Vasile Rus
Statistical Natural Language Parsing Parsing: The rise of data and statistics.
Treebanks, Trees, Querying, QC, etc.
Health Natural Language Processing Center
Construct State Modification in the Arabic Treebank
LING/C SC 581: Advanced Computational Linguistics
Constraining Chart Parsing with Partial Tree Bracketing
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive Science* Mark Mandel – Linguistic Data Consortium* * University of Pennsylvania Parallel Entity and Treebank Annotation

6/29/052 Mining the Bibliome: Information Extraction from the Biomedical Literature NSF ITR grant EIA Collaboration with Division of Oncology, Children’s Hospital of Philadelpia PubMed abstracts – mining cancer literature for associations that link variations in genes with malignancies - release 0.9 available 1157 abstracts entity annotated, 318 also treebankedhttp://bioie.ldc.upenn.edu

6/29/053 Outline Entity Annotation Treebank Annotation – Modifications from Penn Treebank guidelines Annotation Process and Merged Format Entity-Constituent Mapping – How successful?

6/29/054 Entity Annotation Gene X with genomic Variation event Y is correlated with Malignancy Z Gene – composite entity, can refer to gene or protein : Gene-generic, Gene-protein, Gene-RNA (Malignancy – under development, not included in release 0.9) Variation Event – Relation between entities representing different aspects of a variation

6/29/055 Entity Annotation - Variations Variation – A relation between variation component entities “a single nucleotide substitution at codon 249, predicting a serine to cysteine amino acid substitution” Var-type – substitution Var-location –codon 249 Var-state-orig –serine Var-state-altered –cysteine

6/29/056 A Change in Tokenization Tokenization – Many hyphenated words treated as separate tokens “New York-based” Old (Penn Treebank) tokenization: [New] [York-based] New tokenization: [New][York][-][based]

6/29/057 Discontinuous Entities E.g.: “K- and N-ras” Tokenization: [K][-][and][N][-][ras] Entity annotation: [K][-]… [ras] – “chain” of discontinuous tokens [N][-][ras] – Contiguous tokens Splitting up not always done, depends on coordination

6/29/058 Treebank Annotation Default NP right-branching structure (NP (JJ primary) (NN liver) (NN cancer)) Simplifies multi-token nominal annotation Allows recovery of implicit constituents: (NP (JJ primary) (newnode (NN liver) (NN cancer))) Entities sometimes map to such implicit constituents

6/29/059 Treebank Annotation Exceptions to right-branching marked by NML So: Any two or more non-final elements that form a constituent are a NML (ADJP (NML (NNP New) (NNP York)) (HYPH -) (VBN based)) (ADJP (NML (NN breast) (NN cancer)) (HYPH -) (VBN associated)) (NP (NML (NN human) (NN liver) (NN tumor)) (NN analysis)

6/29/0510 Treebank Annotation Placeholder *P* for distributed material in coordinated nominal structures “K- and N-ras” NP NN NP CC K and HYPH - NML-1 -NONE- *P* NN NP N HYPH - NML-1 -NONE- ras

6/29/0511 Treebank Annotation To the left or right “codon 12 or 13” NP NML-1 NN NP CC codon CD 12 or NML-1 -NONE- NP *P* CD 13

6/29/0512 First Release Goal – let users choose how to handle the integration of entity and treebank levels Standoff annotation for entity and treebank Identical tokenization Merged representation Penn Treebank style (POSTag:[from..to] terminal) Entity listing before each tree.

6/29/0513 Merged Output Example sentence 4 Span: ;In the present study, we screened for ;the K-ras exon 2 point mutations in a ;group of 87 gynecological neoplasms ;[ ]:gene-rna:"K-ras" ;[ ]:variation-location:"exon 2" ;[ ]:variation-type: "point mutations“

6/29/0514 Merged Output Example […] ((VP (VBD:[ ] screened) (PP-CLR (IN:[ ] for) (NP (DT:[ ] the) (NN:[ ] K-ras) (NML (NN:[ ] exon) (CD:[ ] 2)) (NN:[ ] point) (NNS:[ ] mutations))) […]

6/29/0515 Merged Output Example ((VP (VBD:[ ] screened) (PP-CLR (IN:[ ] for) (NP (DT:[ ] the) (NN:[ ] K-ras) (NML (NN:[ ] exon) (CD:[ ] 2)) (NN:[ ] point) (NNS:[ ] mutations))) ;[ ]:gene-rna:"K-ras" ;[ ]:variation-location:"exon 2" ;[ ]:variation-type: "point mutations"

6/29/0516 Entity-Constituent Mapping : Exact Match Exact Match: A node in the tree yields exactly the entity: ;[ ]:variation-location:"exon 2" ( NP (DT:[ ] the) (NN:[ ] K-ras) (NML (NN:[ ] exon) (CD:[ ] 2)) (NN:[ ] point) (NNS:[ ] mutations)))

6/29/0517 Entity-Constituent Mapping : Missing Node Missing Node – Possible to add a node to yield exactly the entity ;[ ]:variation-type: "point mutations" ( NP (DT:[ ] the) (NN:[ ] K-ras) (NML (NN:[ ] exon) (CD:[ ] 2)) (NN:[ ] point) (NNS:[ ] mutations)))

6/29/0518 Entity-Constituent Mapping : Missing Node Done for internal research purposes, not in release (implicit constituents) NML already in release (explicit constituents) ( NP (DT:[ ] the) (NN:[ ] K-ras) (NML (NN:[ ] exon) (CD:[ ] 2)) (newnode(NN:[ ] point) (NNS:[ ] mutations))))

6/29/0519 Entity-Constituent Mapping : Crossing Crossing: Cuts across constituent boundaries, so cannot even add a node yielding the entity Typical case: entity containing text corresponding to a prepositional phrase One ER showed a G-to-T mutation in the second position of codon 12 [ ]: variation-location: “second position of codon 12”

6/29/0520 Entity-Constituent Mapping : Crossing Crossing - Determiner in NP but not in entity. Could relax matching, or modify entity or treebank annotation. Didn’t do that. (NP (NP (DT:[ ] the) (JJ:[ ] second) (NN:[ ] position)) (PP (IN:[ ] of) (NP (NN:[ ] codon) (CD:[ ] 12))))) [ ]: variation-location: “second position of codon 12”

6/29/0521 Entity-Constituent Mapping – Chain Exact Match “ codon 12 or 13” Entities: “codon 12”, “codon..13” NP NML-1 NN NP CC codon CD 12 or NML-1 -NONE- NP *P* CD 13

6/29/0522 Entity-Constituent Mapping – Chain Not a Exact Match “ specific codons (12, 13, and 61) Entities: “codons…12”, “codons..13”, “codons..61” (NP (JJ specific) (NNS codons) (PRN (-LRB- -LRB-) (NP (NP (CD 12)) (,,) (NP (CD 13)) (,,) (CC and) (NP (CD 61))) (-RRB- -RRB-)))

6/29/0523 Multiple Token Entities (Non-Chained) Entity TypeTotalExact Match Missing Node Crossing Gene-generic6411 Gene-protein Gene-RNA Var-location Var-state-orig5311 Var-state-altered10802 Var-type Total (4.4%)

6/29/0524 Multiple Token Entities (Chained) Entity TypeTotalExact Match Not Exact Match Gene-generic000 Gene-protein642 Gene-RNA36297 Var-location Var-state-orig000 Var-state-altered000 Var-type101 Total (19%)

6/29/0525 Conclusion Annotation of entities and treebank done together Identical tokenization for entities and trees, with standoff annotation Allows flexibility in use of integrated annotation Only 6.2% of the entities cannot be mapped to an implicit or explicit constituent node Changes in Treebank guidelines Use of Relations for potentially large entities Next: Relation annotation and integrated taggers

6/29/0526 References Ryan’s tagger Dan’s parser Web page again

6/29/0527 Entity Annotation - Variations “(S249C)” Var-type – none Var-location –249 Var-state-orig –S Var-state-altered –C Gene-{RNA,generic,protein} disambiguates gene metonymy Var-{type,location,state-orig,state-altered} are different kinds of entities

6/29/0528 Entities Entity TypeSingle Tokens Non- chains Chains Gene-generic10460 Gene-protein Gene-RNA Var-location Var-state-orig15150 Var-state-altered Var-type Multiple Tokens--

6/29/0529 Introduction Corpus for biomedical IE with several levels of annotation: Entity Syntactic Structure (Treebank) Relations (McDonald et al, ACL 2005) Ideal - entities mapped to treebank constituents Allow users to choose how to integrate the levels

6/29/0530 Annotation Process Tokenization  Entity  POS  Treebanking  Merged Representation Minimal requirement: identical tokenization for entity and treebank annotation Did not require an entity/constituent correspondence – but how did it work out?