6/29/051 New Frontiers in Corpus Annotation Workshop, 6/29/05 Ann Bies – Linguistic Data Consortium* Seth Kulick – Institute for Research in Cognitive Science* Mark Mandel – Linguistic Data Consortium* * University of Pennsylvania Parallel Entity and Treebank Annotation
6/29/052 Mining the Bibliome: Information Extraction from the Biomedical Literature NSF ITR grant EIA Collaboration with Division of Oncology, Children’s Hospital of Philadelpia PubMed abstracts – mining cancer literature for associations that link variations in genes with malignancies - release 0.9 available 1157 abstracts entity annotated, 318 also treebankedhttp://bioie.ldc.upenn.edu
6/29/053 Outline Entity Annotation Treebank Annotation – Modifications from Penn Treebank guidelines Annotation Process and Merged Format Entity-Constituent Mapping – How successful?
6/29/054 Entity Annotation Gene X with genomic Variation event Y is correlated with Malignancy Z Gene – composite entity, can refer to gene or protein : Gene-generic, Gene-protein, Gene-RNA (Malignancy – under development, not included in release 0.9) Variation Event – Relation between entities representing different aspects of a variation
6/29/055 Entity Annotation - Variations Variation – A relation between variation component entities “a single nucleotide substitution at codon 249, predicting a serine to cysteine amino acid substitution” Var-type – substitution Var-location –codon 249 Var-state-orig –serine Var-state-altered –cysteine
6/29/056 A Change in Tokenization Tokenization – Many hyphenated words treated as separate tokens “New York-based” Old (Penn Treebank) tokenization: [New] [York-based] New tokenization: [New][York][-][based]
6/29/057 Discontinuous Entities E.g.: “K- and N-ras” Tokenization: [K][-][and][N][-][ras] Entity annotation: [K][-]… [ras] – “chain” of discontinuous tokens [N][-][ras] – Contiguous tokens Splitting up not always done, depends on coordination
6/29/058 Treebank Annotation Default NP right-branching structure (NP (JJ primary) (NN liver) (NN cancer)) Simplifies multi-token nominal annotation Allows recovery of implicit constituents: (NP (JJ primary) (newnode (NN liver) (NN cancer))) Entities sometimes map to such implicit constituents
6/29/059 Treebank Annotation Exceptions to right-branching marked by NML So: Any two or more non-final elements that form a constituent are a NML (ADJP (NML (NNP New) (NNP York)) (HYPH -) (VBN based)) (ADJP (NML (NN breast) (NN cancer)) (HYPH -) (VBN associated)) (NP (NML (NN human) (NN liver) (NN tumor)) (NN analysis)
6/29/0510 Treebank Annotation Placeholder *P* for distributed material in coordinated nominal structures “K- and N-ras” NP NN NP CC K and HYPH - NML-1 -NONE- *P* NN NP N HYPH - NML-1 -NONE- ras
6/29/0511 Treebank Annotation To the left or right “codon 12 or 13” NP NML-1 NN NP CC codon CD 12 or NML-1 -NONE- NP *P* CD 13
6/29/0512 First Release Goal – let users choose how to handle the integration of entity and treebank levels Standoff annotation for entity and treebank Identical tokenization Merged representation Penn Treebank style (POSTag:[from..to] terminal) Entity listing before each tree.
6/29/0513 Merged Output Example sentence 4 Span: ;In the present study, we screened for ;the K-ras exon 2 point mutations in a ;group of 87 gynecological neoplasms ;[ ]:gene-rna:"K-ras" ;[ ]:variation-location:"exon 2" ;[ ]:variation-type: "point mutations“
6/29/0514 Merged Output Example […] ((VP (VBD:[ ] screened) (PP-CLR (IN:[ ] for) (NP (DT:[ ] the) (NN:[ ] K-ras) (NML (NN:[ ] exon) (CD:[ ] 2)) (NN:[ ] point) (NNS:[ ] mutations))) […]
6/29/0515 Merged Output Example ((VP (VBD:[ ] screened) (PP-CLR (IN:[ ] for) (NP (DT:[ ] the) (NN:[ ] K-ras) (NML (NN:[ ] exon) (CD:[ ] 2)) (NN:[ ] point) (NNS:[ ] mutations))) ;[ ]:gene-rna:"K-ras" ;[ ]:variation-location:"exon 2" ;[ ]:variation-type: "point mutations"
6/29/0516 Entity-Constituent Mapping : Exact Match Exact Match: A node in the tree yields exactly the entity: ;[ ]:variation-location:"exon 2" ( NP (DT:[ ] the) (NN:[ ] K-ras) (NML (NN:[ ] exon) (CD:[ ] 2)) (NN:[ ] point) (NNS:[ ] mutations)))
6/29/0517 Entity-Constituent Mapping : Missing Node Missing Node – Possible to add a node to yield exactly the entity ;[ ]:variation-type: "point mutations" ( NP (DT:[ ] the) (NN:[ ] K-ras) (NML (NN:[ ] exon) (CD:[ ] 2)) (NN:[ ] point) (NNS:[ ] mutations)))
6/29/0518 Entity-Constituent Mapping : Missing Node Done for internal research purposes, not in release (implicit constituents) NML already in release (explicit constituents) ( NP (DT:[ ] the) (NN:[ ] K-ras) (NML (NN:[ ] exon) (CD:[ ] 2)) (newnode(NN:[ ] point) (NNS:[ ] mutations))))
6/29/0519 Entity-Constituent Mapping : Crossing Crossing: Cuts across constituent boundaries, so cannot even add a node yielding the entity Typical case: entity containing text corresponding to a prepositional phrase One ER showed a G-to-T mutation in the second position of codon 12 [ ]: variation-location: “second position of codon 12”
6/29/0520 Entity-Constituent Mapping : Crossing Crossing - Determiner in NP but not in entity. Could relax matching, or modify entity or treebank annotation. Didn’t do that. (NP (NP (DT:[ ] the) (JJ:[ ] second) (NN:[ ] position)) (PP (IN:[ ] of) (NP (NN:[ ] codon) (CD:[ ] 12))))) [ ]: variation-location: “second position of codon 12”
6/29/0521 Entity-Constituent Mapping – Chain Exact Match “ codon 12 or 13” Entities: “codon 12”, “codon..13” NP NML-1 NN NP CC codon CD 12 or NML-1 -NONE- NP *P* CD 13
6/29/0522 Entity-Constituent Mapping – Chain Not a Exact Match “ specific codons (12, 13, and 61) Entities: “codons…12”, “codons..13”, “codons..61” (NP (JJ specific) (NNS codons) (PRN (-LRB- -LRB-) (NP (NP (CD 12)) (,,) (NP (CD 13)) (,,) (CC and) (NP (CD 61))) (-RRB- -RRB-)))
6/29/0523 Multiple Token Entities (Non-Chained) Entity TypeTotalExact Match Missing Node Crossing Gene-generic6411 Gene-protein Gene-RNA Var-location Var-state-orig5311 Var-state-altered10802 Var-type Total (4.4%)
6/29/0524 Multiple Token Entities (Chained) Entity TypeTotalExact Match Not Exact Match Gene-generic000 Gene-protein642 Gene-RNA36297 Var-location Var-state-orig000 Var-state-altered000 Var-type101 Total (19%)
6/29/0525 Conclusion Annotation of entities and treebank done together Identical tokenization for entities and trees, with standoff annotation Allows flexibility in use of integrated annotation Only 6.2% of the entities cannot be mapped to an implicit or explicit constituent node Changes in Treebank guidelines Use of Relations for potentially large entities Next: Relation annotation and integrated taggers
6/29/0526 References Ryan’s tagger Dan’s parser Web page again
6/29/0527 Entity Annotation - Variations “(S249C)” Var-type – none Var-location –249 Var-state-orig –S Var-state-altered –C Gene-{RNA,generic,protein} disambiguates gene metonymy Var-{type,location,state-orig,state-altered} are different kinds of entities
6/29/0528 Entities Entity TypeSingle Tokens Non- chains Chains Gene-generic10460 Gene-protein Gene-RNA Var-location Var-state-orig15150 Var-state-altered Var-type Multiple Tokens--
6/29/0529 Introduction Corpus for biomedical IE with several levels of annotation: Entity Syntactic Structure (Treebank) Relations (McDonald et al, ACL 2005) Ideal - entities mapped to treebank constituents Allow users to choose how to integrate the levels
6/29/0530 Annotation Process Tokenization Entity POS Treebanking Merged Representation Minimal requirement: identical tokenization for entity and treebank annotation Did not require an entity/constituent correspondence – but how did it work out?