Presentation is loading. Please wait.

Presentation is loading. Please wait.

Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar.

Similar presentations


Presentation on theme: "Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar."— Presentation transcript:

1 Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar March 17, 2006 Supported by NSF DBI-0317510 And a gift from Genentech

2 UC Berkeley Biotext Project Outline Motivation: NLP tasks System Description Annotation architecture Sample queries Database Design and Evaluation Related Work Future Work

3 UC Berkeley Biotext Project Double Exponential Growth in Bioscience Journal Articles From Hunter & Cohen, Molecular Cell 21, 2006

4 UC Berkeley Biotext Project BioText Project Goals Provide flexible, intelligent access to information for use in biosciences applications. Focus on Textual Information from Journal Articles Tightly integrated with other resources  Ontologies  Record-based databases

5 UC Berkeley Biotext Project Project Team Project Leaders: PI: Marti Hearst Co-PI: Adam Arkin Computational Linguistics and Databases Presley Nakov Ariel Schwartz Brian Wolf Barbara Rosario (alum) Gaurav Bhalotia (alum) User Interface / IR Rowena Luk Dr. Emilia Stoica Bioscience Janice Hamerja Dr. TingTing Zhang (alum)

6 UC Berkeley Biotext Project BioText Architecture Sophisticated Text Analysis Annotations in Database Improved Search Interface

7 UC Berkeley Biotext Project Sample Sentence “Recent research, in proliferating cells, has demonstrated that interaction of E2F1 with the p53 pathway could involve transcriptional up-regulation of E2F1 target genes such as p14/p19ARF, which affect p53 accumulation [67,68], E2F1-induced phosphorylation of p53 [69], or direct E2F1- p53 complex formation [70].”

8 UC Berkeley Biotext Project Motivation Most natural language processing (NLP) algorithms make use of the results of previous processing steps: Tokenizer Part-of-speech tagger Phrase boundary recognizer Syntactic parser Semantic tagger No standard way to represent, store and retrieve text annotations efficiently. MEDLINE has close to 13 million abstracts. Full text has started to become available as well.

9 UC Berkeley Biotext Project System overview A system for flexible querying of text that has been annotated with the results of NLP processing. Supports self-overlapping and parallel layers, integration of syntactic and ontological hierarchies, and tight integration with SQL. Designed to scale to very large corpora. Most NLP annotation systems assume in-memory usage We’ve evaluated indexing architectures

10 UC Berkeley Biotext Project Text Annotation Framework Annotations are stored independently of text in an RDBMS. Declarative query language for annotation retrieval. Indexing structure designed for efficient query processing.

11 UC Berkeley Biotext Project Key Contributions Support for hierarchical and overlapping layers of annotation. Querying multiple levels of annotations simultaneously. First to evaluate different physical database designs for NLP annotation architecture.

12 UC Berkeley Biotext Project Layers of Annotations Each annotation represents an interval spanning a sequence of characters absolute start and end positions Each layer corresponds to a conceptually different kind of annotation Protein, MESH label, Noun Phrase Layers can be Sequential Overlapping  two multiple-word concepts sharing a word Hierarchical (two different ways)  spanning, when the intervals are nested as in a parse tree, or  ontologically, when the token itself is derived from a hierarchical ontology

13 UC Berkeley Biotext Project Layer Type Properties One-to-one correspondence between the Word and the Part-of-speech (POS) layers. The Word, POS and Shallow parse layers are sequential The Full parse layer is spanning hierarchical The Gene/protein layer assigns IDs from the LocusLink database of gene names many-to-one in the case of multiple species The Ontology layer assigns terms from the hierarchical medical ontology MeSH (Medical Subject Headings) Overlapping (share the word cell) and hierarchical:  both spanning, since blood cell (with MeSH ID D001773) spans cell (which is also in MeSH), and  ontologically, since blood cell is a kind of cell and cell death (D016923) is a type of Biological Phenomena.

14 UC Berkeley Biotext Project Layers of Annotations

15 UC Berkeley Biotext Project Layers of Annotations

16 UC Berkeley Biotext Project Layers of Annotations

17 UC Berkeley Biotext Project Layers of Annotations Full parse, sentence and section layers are not shown.

18 UC Berkeley Biotext Project Example: Query for Noun Compound Extraction Goal: find noun phrases consisting ONLY of 3 nouns plastic water bottle blue water bottle big plastic water bottle FROM [layer=’shallow_parse’ && tag_name=’NP’ ˆ [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] [layer=’pos’ && tag_name="noun"] $ ] AS compound SELECT compound.content

19 UC Berkeley Biotext Project Query for Noun Compound Extraction (SQL wrapping) SELECT LOWER(compound.content), COUNT(*) FROM ( BEGIN_LQL [layer= ’ shallow_parse ’ && tag_name= ’ NP ’ ˆ [layer= ’ pos ’ && tag_name="noun"] [layer= ’ pos ’ && tag_name="noun"] [layer= ’ pos ’ && tag_name="noun"] $ ] AS compound SELECT compound.content END_LQL ) AS lql ORDER BY freq DESC

20 UC Berkeley Biotext Project Query for Noun Compound Extraction (using artificial layers) Goal: find noun phrases which have EXACTLY two nouns at the end, but no nouns before those two. “ big blue water bottle ” “ plastic water bottle ” FROM [layer= ’ shallow_parse ’ && tag_name= ’ NP ’ ˆ ( { ALLOW GAPS } ![layer= ’ pos ’ && tag_name="noun"] ( [layer= ’ pos ’ && tag_name="noun"] [layer= ’ pos ’ && tag_name="noun"] ) $ ) $ ] AS compound SELECT compound.content

21 UC Berkeley Biotext Project Example: Paraphrases Want to find phrases with certain variations: Immunodeficiency virus(?es) in ?the human(?s)  immunodeficiency virus in humans  immonodeficiency viruses in humans  immunodeficiency virus in the human  immunodeficiency virus in a human

22 UC Berkeley Biotext Project Query for Paraphrases (optional layers and disjunction) [layer= ’ sentence ’ [layer= ’ pos ’ && tag_name="noun" && content = "immunodeficiency"] [layer= ’ pos ’ && tag_name="noun" && content IN ("virus","viruses")] [layer= ’ pos ’ && tag_name= ’ IN ’ ] AS prep ?[layer= ’ pos ’ && tag_name= ’ DT ’ && content IN ("the","a","an")] [layer= ’ pos ’ && tag_name="noun" && content IN ("human", "humans")] ] SELECT prep.content

23 UC Berkeley Biotext Project Example: Protein-Protein Interactions Find all sentences that consist of a An NP containing a gene, followed by a morphological variant of the verb “ activate ”, “ inhibit ”, or “ bind ”, followed by another NP containing a gene. protein Activate(d,ing) Inhibit(ed,ing) Bind(s,ing) protein Sentence

24 UC Berkeley Biotext Project Query for Protein-Protein Interactions SELECT p1_text, verb_content, p2_text, COUNT(*) AS cnt FROM ( BEGIN_LQL [layer='sentence' { ALLOW GAPS } [layer='shallow_parse' && tag_name='NP' [layer='gene'] $ ] AS p1 [layer='pos' && tag_name="verb" && (content ~ "activate%" || content ~ "inhibit%" || content ~ "bind%") ] AS verb [layer='shallow_parse' && tag_name='NP' [layer='gene'] $ ] AS p2 ] SELECT p1.text AS p1_text, verb.content AS verb_content, p2.text AS p2_text END_LQL ) lql GROUP BY p1_text, verb_content, p2_text ORDER BY count(*) DESC

25 UC Berkeley Biotext Project Protein-Protein Interactions Sample Output PROTEIN 1INTERACTION VERBPROTEIN 2FREQUENCY Ca2activatesprotein kinase312 Cln3activateprotein kinase234 TAPbindstranscription factor192 TNFactivatesprotein tyrosine kinase133 serine/threonine kinase bindingRhoA GTPase132 PhospholambaninhibitsATPase114 PRLactivatedtranscription factor108 Interleukin 2activatestranscription factor84 Prolactinactivatestranscription factor84 AMPAactivatedprotein kinase78 Nerve growth factoractivatesprotein kinase78 LPSinhibitedMHC class II75 Heat shock proteinBindingp5972 EPOactivatedSTAT563 EGFactivatedPP2A60 cisbindsSp150

26 UC Berkeley Biotext Project Example: Chemical-Disease Interactions “A new approach to the respiratory problems of cystic fibrosis is dornase alpha, a mucolytic enzyme given by inhalation.” Goal: extract the relation that dornase alpha (potentially) prevents cystic fibrosis. MeSH C06.689 subtree contains pancrediseases MeSH supplementary concepts represent chemicals.

27 UC Berkeley Biotext Project Query on Disease-Chemical Interactions

28 UC Berkeley Biotext Project Query on Disease-Chemical Interactions [layer='sentence' { NO ORDER, ALLOW GAPS } [layer='shallow_parse' && tag_name='NP‘ [layer='chemicals'] AS chemical $ ] [layer='shallow_parse' && tag_name='NP' [layer='mesh' && tree_number BELOW 'C06.689%'] AS disease $ ] ] ] AS sent SELECT chemical.text, disease.text, sent.text

29 UC Berkeley Biotext Project Results: Chemical-Disease

30 UC Berkeley Biotext Project Query Translation

31 Database Design & Evaluation

32 UC Berkeley Biotext Project Database Design Evaluated 5 different logical and physical database designs. The basic model is similar to the one of TIPSTER (Grishman, 1996). Each annotation is stored as a record in a relation. Architecture 1 contains the following columns: 1. docid: document ID; 2. section: title, abstract or body text; 3. layer_id: a unique identifier of the annotation layer; 4. start_char_pos: starting character position, relative to particular section and docid; 5. end_char_pos: end character position, relative to particular section and docid; 6. tag_type: a layer-specific token unique identifier.  There is a separate table mapping token IDs to entities (the string in case of a word, the MeSH label(s) in case of a MeSH term etc.)

33 UC Berkeley Biotext Project Database Design (cont.) Architecture 2 introduces one additional column, sequence_pos, thus defining an ordering for each layer. Simplifies some SQL queries as there is no need for “NOT EXISTS” self joins, which are required under Architecture 1 in cases where tokens from the same layer must follow each other immediately. Architecture 3 adds sentence_id, which is the number of the current sentence and redefines sequence_pos as relative to both layer_id and sentence_id. Simplifies most queries since they are often limited to the same sentence.

34 UC Berkeley Biotext Project Database Design (cont.) Architecture 4 merges the word and POS layers, and adds word_id assuming a one-to-one correspondence between them. Reduces the number of stored annotations and the number of joins in queries with both word and POS constraints. Architecture 5 replaces sequence_pos with first_word_pos and last_word_pos, which correspond to the sequence_pos of the first/last word covered by the annotation. Requires all annotation boundaries to coincide with word boundaries. Copes naturally with adjacency constraints between different layers. Allows for a simpler indexing structure.

35 UC Berkeley Biotext Project Data Layout for all 5 Architectures Example: “Kinase inhibits RAG-1.” 231(NP)40343(s.parse)b3345 259(VP)49413b3345 23155503b3345 21665455506b3345 21077040346(mesh)b3345 23955505b3345 239(prt)40345 (gene)b3345 89985 22755501b3345 55608 253 (VB)49411b3345 59571 227 (NN)40341 (POS)b3345 89985 2 55500b3345 55608 2 49410b3345 59571 2 4034b (body)3345 WORD ID SENTE NCE SEQUE NCE POS TAG TYPE END CHAR POS START CHAR POS LAYER ID SECTIONPMID 1 31(NP)343(s.parse)b3345 2 59(VP)413b3345 3 31503b3345 2 16654506b3345 1 10770346(mesh)b3345 2 39505b3345 1 39(prt)345 (gene)b3345 899853 27501b3345 556082 53 (VB)411b3345 595711 27 (NN)341 (POS)b3345 899853 500b3345 556082 410b3345 595711 34 0 (word) b (body)3345 WORD ID SENTE NCE SEQUE NCE POS TAG TYPE END CHAR POS START CHAR POS LAYER ID SECTIONPMID Basic architectureAdded, architecture 3 Added, architecture 2Added, architecture 4 3 2 1 3 2 1 FIRST WORD POS 1 2 3 1 3 1 3 4 3 2 4 3 2 LAST WORD POS 2 3 4 2 4 2 4 Added, architecture 5

36 UC Berkeley Biotext Project Indexing Structure Two types of composite indexes: forward and inverted. An index lookup can be performed on any column combination that corresponds to an index prefix. The forward indexes support lookup based on position in a given document. The inverted indexes support lookup based on annotation values (i.e., tag type and word id). Most query plans involve both forward and inverted indexes Joins statistics would have been useful Detailed statistics are essential. Standard statistics in DB2 are insufficient. Records are clustered on their primary key

37 UC Berkeley Biotext Project Indexing Structure (cont.) ArchitectureTypeColumns Arch 1-4F*DOCID +SECTION +LAYER_ID +START_CHAR_POS +END_CHAR_POS +TAG_TYPE Arch 1-4ILAYER_ID +TAG_TYPE +DOCID +SECTION +START_CHAR_POS +END_CHAR_POS Arch 2FDOCID +SECTION +LAYER_ID +SEQUENCE POS +TAG_TYPE +START_CHAR_POS +END_CHAR_POS Arch 2ILAYER_ID +TAG_TYPE +DOCID +SECTION +SEQUENCE POS +START_CHAR_POS +END_CHAR_POS Arch 3-4FDOCID +SECTION +LAYER_ID +SENTENCE +SEQUENCE POS +TAG_TYPE +START_CHAR_POS +END_CHAR_POS Arch 3-4ILAYER_ID +TAG_TYPE +DOCID +SECTION +SENTENCE +SEQUENCE POS +START_CHAR_POS +END_CHAR_POS Arch 4IWORD ID +LAYER_ID +TAG_TYPE +DOCID +SECTION +START_CHAR_POS +END_CHAR_POS +SENTENCE +SEQUENCE POS Arch 5F*DOCID +SECTION +LAYER_ID +SENTENCE +FIRST_WORD_POS +LAST_WORD_POS +TAG_TYPE Arch 5ILAYER_ID +TAG_TYPE +DOCID +SECTION +SENTENCE +FIRST_WORD_POS +LAST_WORD_POS Arch 5IWORD ID +LAYER_ID +TAG_TYPE +DOCID +SECTION +SENTENCE +FIRST_WORD_POS

38 UC Berkeley Biotext Project Experimental Setup Annotated 13,504 MEDLINE abstracts Stanford Lexicalized Parser (Klein and Manning, 2003) for sentence splitting, word tokenization, POS tagging and parsing. We wrote a shallow parser and tools for gene and MeSH term recognition. This resulted in 10,910,243 records stored in an IBM DB2 Universal Database Server. Defined 4 workloads based on variants of queries.

39 UC Berkeley Biotext Project Experimental Setup: 4 Workloads [ layer='shallow_parse' && tag_name="NP"] AS np1 [layer='pos' && content='('] [layer='shallow_parse' && tag_name="NP"] AS np2 [layer='pos' && content=')'] (Pustejovsky et al., 2001) (d) Acronym-Meaning Extraction [ layer='shallow_parse' && tag_name="NP" [layer='pos' && tag_name="noun" ^ [layer='mesh' && tree_number BELOW "G07.553"] AS m1 $ ] [layer='pos' && tag_name="noun" ^ [layer='mesh' && tree_number BELOW "D"] AS m2 $ ] ] SELECT m1.content, m2.content (c) Descent of Hierarchy: (Rosario et al., 2002) [layer='sentence' {ALLOW GAPS} [layer='gene'] AS gene1 [layer='pos' && tag_name="verb" && content="binds"] AS verb [layer='gene'] AS gene2 ] SELECT gene1.content, verb.content, gene2.content (Blaschke et al., 1999) (a) Protein-Protein Interaction [layer='sentence' [layer='shallow_parse' && tag_name="NP"] AS np1 [layer='pos' && tag_name="verb" && content='binds'] AS verb [layer='pos' && tag_name="prep" && content='to'] [layer='shallow_parse' && tag_name="NP"] AS np2 ] SELECT np1.content, verb.content, np2.content (Thomas et al., 2000) (b) Protein-Protein Interaction A01 A07 limb:vein shoulder: artery

40 UC Berkeley Biotext Project Results Workload(a)(b) Architecture1234512345 SQL lines37 3429 9177756550 # Joins666551211 97 Time (sec)3.984.353.591.691.943.885.685.413.853.55 Workload(c)(d) Architecture1234512345 SQL lines4538 3941595053 35 # Joins7666677774 Time (sec)17.923.4221.4930.074.061,8791,7002,1821,6821,582 Workload(a)(b)(c)(d) #Queries5411501 #Results/query303.477.51.616,701 LQL lines8654

41 UC Berkeley Biotext Project Results Architecture Space (MB)12345 Data Storage168.5 132.5136.5 Index Storage617.01,397.01,441.01,182.0673.5 Total Storage785.51,565.51,609.51,314.5810.0 Architecture 5 performs well (if not best) on all query types, while the other architectures perform poorly on at least one query type. Storage requirement of Architecture 5 is comparable to that of Architecture 1 Architecture 5 results in much simpler queries Conclusion: We recommend Architecture 5 in most cases, or Architecture 1, if atomic annotation layer cannot be defined.

42 UC Berkeley Biotext Project Scalability Analysis Combined workload of 3 query types Varying buffer pool sizes

43 UC Berkeley Biotext Project Scalability Analysis Buffer Pool Size (MB)Elapsed Time (ms)Buffer Read Time (ms) 100023001050 10029001670 1046003340 183006250 Suggests that the query execution time grows as a sub-linear function of memory size. We believe a similar ratio will be observed when increasing the database size and keeping the memory size fixed Parallel query execution can be enabled after partitioning the annotation on document_id

44 UC Berkeley Biotext Project Study on a larger dataset Annotated 1.4 Million MEDLINE abstracts 10 million sentences 320 million annotations 70 GB total database size Workload(a)(b)(c)(d) Random (a, b, c) #Queries5411501115 #Results/query32,2955,42048113,48315,686 Time/query0:5055:441:353:33:576:26

45 UC Berkeley Biotext Project Related Work Annotation graphs (AG): directed acyclic graph; nodes can have time stamps or are constrained via paths to labeled parents and children. (Bird and Liberman, 2001) Emu system: sequential levels of annotations. Hierarchical relations may exist between different levels, but must be explicitly defined for each pair.(Cassidy&Harrington,2001) The Q4M query language for MATE: directed graph; constraints and ordering of the annotated components. Stored in XML (McKelvie&al., 2001) TIQL: queries consist of manipulating intervals of text, indicated by XML tags; supports set operations. (Nenadic et al., 2002) SELECT I WHERE X.[id:I].Y <- db/wrd X.[:hv].[]*.Y <- db/phn; Annotation Graphs Find arcs labeled as words, whose phonetic transcription starts with a “hv“: [[Phonetic=A -> Phonetic=p] ^ Syllable=S] Emu Find sentences of phonetic “A” followed by “p“ both dominated by an “S” syllable: ($ a word) ($b word); ($a pos ~ "NN") && ($a <> $b) && ($b # ~ "lesser") Q4M (MATE system) Find nouns followed by the word “lesser”: TIQL (TIMS system) Find sentences containing the noun phrase “COUP-TF II” and the verb “inhibit”: (  ) 

46 UC Berkeley Biotext Project What about XQuery/XPath?

47 UC Berkeley Biotext Project

48 Main Advantages of LQL System Stand-off annotation Flexible and modular Multi-layered, including overlaps LQL – simple yet powerful Support for hierarchies Optimized for cross-layer queries Much more expressive than standard text search engines Seamless integration with SQL and RDBMS Easy integration with additional data sources Simple parallelism Full text support Caption search Formatting-aware queries Flexible support for document structure

49 UC Berkeley Biotext Project On the Horizon Full text documents support Really complex in bioscience text  Caption search  Formatting-aware annotation layers  Flexible support for document structure Query simplification Shorthand syntax GUI helper

50 UC Berkeley Biotext Project Syntax-Helper Interface

51 Thank you! biotext.berkeley.edu/lql

52 UC Berkeley Biotext Project Overlap Example

53 UC Berkeley Biotext Project Meta-data tables BIOTEXT_ANNOTATION_LAYER LAYER_IDLAYER_NAMEOWNERLAST_UPDATED 1poshearst6/12/2005 2full_parsehearst6/12/2005 3shallow_parsehearst6/12/2005 4sentencehearst6/12/2005 5genehearst6/12/2005 6meshhearst6/12/2005 7chemicalshearst6/12/2005

54 UC Berkeley Biotext Project Meta-data tables BIOTEXT_ANNOTATION_ATTRIBUTES LAYER_IDATTRIBUTE ATTRIBUTE_F IELD TABLE_NAMEATTRIBUTE_ID ATTRIBUTE _TEXT DBL_QUOTE_A LIAS TREE_TABLETREE_DESCTREE_NUM layerlayer_id biotext_annotati on_layers layer_idlayer_namelayerNone tag_nametag_type biotext_annotati on_tag_types tag_type_idtag_nametag_groupNone tag_grouptag_type biotext_annotati on_tag_types tag_type_idtag_group None 1contentword_id biotext_annotati on_word word_idwordcontent_lowerNone 1 content_lowe r word_id biotext_annotati on_word word_idword_lowercontent_lowerNone 5nametag_typelocuslink_aliaseslocus_idname None 6tree_numbertag_type biotext_annotati on_mesh_tree descriptor_ui tree_numbe r biotext_annotati on_mesh_tree descriptor_ui tree_numbe r 6mesh_termtag_type biotext_annotati on_mesh_terms descriptor_uimesh_term mesh_term_low er biotext_annotati on_mesh_tree descriptor_ui tree_numbe r 6 mesh_term_l ower tag_type biotext_annotati on_mesh_terms descriptor_ui mesh_term_ lower biotext_annotati on_mesh_tree descriptor_ui tree_numbe r Create a new query:

55 UC Berkeley Biotext Project Meta-data tables BIOTEXT_ANNOTATION_TAG_TYPES LAYER_IDTAG_TYPE_IDTAG_NAMETAG_GROUP 2121019IN 2221020INTJ 2321021JJadjective 2421022JJRadjective 2521023JJSadjective 2621025LS 2721069LST 2821026MD 2921070NAC 3021027NNnoun 3121028NNPnoun 3221029NNPSnoun 3321030NNSnoun 3421031NP 3521032NX

56 UC Berkeley Biotext Project Meta-data tables BIOTEXT_ANNOTATION_WORD WORD_IDWORD WORD_LOWER 11212952BClbcl 212129532,2'-disulfonic 312129541762-1860 41212955Premkumarpremkumar 51212956329:265-285 61212957EVPROCevproc 71212958fascinae 81212959fascines 91212960Cox-Stuartcox-stuart 101212961epidydimo-orchitis 11121296210-20-min 1212129630.05-10-ng/ml 1312129641.016x 141212965Goldberg-Lindblomgoldberg-lindblom 151212966Lundborglundborg 161212967graft-loss

57 UC Berkeley Biotext Project References Steven Bird and Mark Liberman. 2001. A formal framework for linguistic annotation. Speech Communication, 33(1–2):23–60. Steve Cassidy and Jonathan Harrington. 2001. Speech annotation and corpus tools. Speech Communication, 33(1–2):61–77. David McKelvie, Amy Isard, Andreas Mengel, Morten B. Moller, Michael Grosse and Marion Klein. 2001. Speech annotation and corpus tools. Speech Communication, 33(1–2):97–112. Goran Nenadic, Hideki Mima, Irena Spasic, Sophia Ananiadou and Jun-ichi Tsujii. 2002. Terminology-Driven Literature Mining and Knowledge Acquisition in Biomedicine. International Journal of Medical Informatics, 67:33–48. Ralph Grishman. 1996. Building an Architecture: a CAWG Saga. Advances in Text Processing: Tipster Program Phase II, Morgan Kaufmann, 1996. Steve Cassidy. 1999. Compiling Multi-tiered Speech Databases into the Relational Model: Experiments with the Emu System. 6th European Conference on Speech Communication and Technology Eurospeech 99, 2127–2130, Budapest, Hungary. Xiaoyi Ma, Haejoong Lee, Steven Bird and Kazuaki Maeda. 2002. Models and Tools for Collaborative Annotation. Third International Conference on Language Resources and Evaluation, 2066–2073.

58 UC Berkeley Biotext Project Acquiring Labeled Data using Citances

59 UC Berkeley Biotext Project A discovery is made … A paper is written …

60 UC Berkeley Biotext Project That paper is cited … and cited … … as the evidence for some fact(s) F.

61 UC Berkeley Biotext Project Each of these in turn are cited for some fact(s) … … until it is the case that all important facts in the field can be found in citation sentences alone!

62 UC Berkeley Biotext Project Citances Nearly every statement in a bioscience journal article is backed up with a cite. It is quite common for papers to be cited 30-100 times. The text around the citation tends to state biological facts. (Call these citances.) Different citances will state the same facts in different ways … … so can we use these for creating models of language expressing semantic relations?


Download ppt "Supporting Annotation Layers for Natural Language Processing Marti Hearst, Preslav Nakov, Ariel Schwartz, Brian Wolf, Rowena Luk UC Berkeley Stanford InfoSeminar."

Similar presentations


Ads by Google