JHU WORKSHOP July 30th, 2003 Semantic Annotation – Week 3 Team: Louise Guthrie, Roberto Basili, Fabio Zanzotto, Hamish Cunningham, Kalina Boncheva, Jia Cui, Klaus Macherey, David Guthrie, Martin Holub, Marco Cammisa, Cassia Martin, Jerry Liu, Kris Haralambiev Fred Jelinek
JHU WORKSHOP July 30th, 2003 Our Hypotheses ● A transformation of a corpus to replace words and phrases with coarse semantic categories will help overcome the data sparseness problem encountered in language modeling ● Semantic category information will also help improve machine translation ● A noun-centric approach initially will allow bootstrapping for other syntactic categories
JHU WORKSHOP July 30th, 2003 An Example ● Astronauts aboard the space shuttle Endeavor were forced to dodge a derelict Air Force satellite Friday ● Humans aboard space_vehicle dodge satellite timeref.
JHU WORKSHOP July 30th, 2003 Our Progress – Preparing the data- Pre-Workshop ● Identify a tag set ● Create a Human annotated corpus ● Create a double annotated corpus ● Process all data for named entity and noun phrase recognition using GATE Tools ● Develop algorithms for mapping target categories to Wordnet synsets to support the tag set assessment
JHU WORKSHOP July 30th, 2003 The Semantic Classes for Annotators ● A subset of classes available in Longman's Dictionary of contemporary English (LDOCE) Electronic version ● Rationale: The number of semantic classes was small The classes are somewhat reliable since they were used by a team of lexicographers to code Noun senses Adjective preferences Verb preferences
JHU WORKSHOP July 30th, 2003 Semantic Classes Abstract T B Movable N Animate Q Plant PAnimal AHuman H Inanimate I Liquid LGas GSolid S Concrete C D FMNon-movable J Target Classes Annotated Evidence - - PhysQuant 4 Organic 5
JHU WORKSHOP July 30th, 2003 More Categories ● U: Collective ● K: Male ● R: Female ● W: Not animate ● X: Not concrete or animal ● Z: Unmarked We allowed annotators to choose “none of the above” (? in the slides that follow)
JHU WORKSHOP July 30th, 2003 Our Progress – Data Preparation ● Assess annotation format and define uniform descriptions for irregular phenomena and normalize them ● Determine the distribution of the tag set in the training corpus ● Analyze inter-annotator agreement ● Determine a reliable set of tags – T ● Parse all training data
JHU WORKSHOP July 30th, 2003 Doubly Annotated Data ● Instances (headwords): ● 8,950 instances without question marks. ● 8,446 of those are marked the same. ● Inter-annotator agreement is 94% (83% including question marks) ● Recall – these are non named entity noun phrases
JHU WORKSHOP July 30th, 2003 Distribution of Double Annotated Data
JHU WORKSHOP July 30th, 2003 Agreement of doubly marked instances
JHU WORKSHOP July 30th, 2003 Inter-annotator agreement – for each category 2
JHU WORKSHOP July 30th, 2003 Category distribution among agreed part 69%
JHU WORKSHOP July 30th, 2003 A few statistics on the human annotated data ● Total annotated 262,230 instances 48,175 with ? ● 214,055 with a category of those Z.5% W and X.5% 4, 5 1.6%
JHU WORKSHOP July 30th, 2003 Our progress – baselines ● Determine baselines for automatic tagging of noun phrases ● Baselines for tagging observed words in new contexts (new instances of known words) ● Baselines for tagging unobserved words Unseen words – not in the training material but in dictionary Novel words – not in the training material nor in the dictionary/Wordnet
JHU WORKSHOP July 30th, 2003 Overlap of dictionary and head nouns (in the BNC) ● 85% of NP’s covered ● only 33% of vocabulary (both in LDOCE and in Wordnet) in the NP’s covered
JHU WORKSHOP July 30th, 2003 Preparation of the test environment ● Selected the blind portion of the human annotated data for late evaluation ● Divided the remaining corpus into training and held-out portions Random division of files Unambiguous words for training – ambiguous for testing
JHU WORKSHOP July 30th, 2003 Baselines using only (target) words Error RateUnseen words marked with MethodValid training instances blame 15.1%the first classMaxEntropy count 3 Klaus 12.6%most frequent class MaxEntropy count 3 Jerry 16%most frequent class VFIallFabio 13%most frequent class NaiveBayesallFabio
JHU WORKSHOP July 30th, 2003 Baselines using only (target) words and preceeding adjectives Error RateUnseen words marked with MethodValid training instances blame 13%most frequent class MaxEntropy count 3 Jerry 13.2%most frequent class MaxEntropyallJerry 12.7%most frequent class MaxEntropy count 3 Jerry
JHU WORKSHOP July 30th, 2003 Baselines using multiple knowledge sources ● Experiments in Sheffield ● Unambiguous tagger (assign only available semantic categories) ● bag-of-words tagger (IR inspired) window size 50 words nouns and verbs ● Frequency-based tagger (assign the most frequent semantic category)
JHU WORKSHOP July 30th, 2003 Baselines using multiple knowledge sources (cont’d) ● Frequency-based tagger 16-18% error rate ● bag-of-words tagger 17% error rate ● Combined architecture % error rate
JHU WORKSHOP July 30th, 2003 Bootstrapping to Unseen Words ● Problem: Automatically identify the semantic class of words in LDOCE whose behavior was not observed in the training data ● Basic Idea: We use the unambiguous words (unambiguous with respect to the our semantic tag set) to learn context for tagging unseen words.
JHU WORKSHOP July 30th, 2003 Bootstrapping: statistics 6,656 different unambiguous lemmas in the (visible) human tagged corpus...these contribute to 166,249 instances of data...134,777 instances were considered correct by the annotators ! Observation: Unambiguous words can be used in the corpus in an “unforeseen” way
JHU WORKSHOP July 30th, 2003 Bootstrapping baselines Method% correct labelled instances Assigning the most frequent semantic tag (i.e. Abstract) 52% Using one previous word (Adjective, Noun, or Verb) (using Naive Bayes Classifier) (with reliable tagged instances) 45% (with all instances) 44.3% 1 previous and 1 following word (Adjective, Noun, or Verb) (using Naive Bayes Classifier) (with reliable tagged instances) 46.8% (with all instances) 44.5% ● Test Instances (instances of ambiguous words) : 62,853
JHU WORKSHOP July 30th, 2003 Metrics for Intrinsic Evaluation ● Need to take into account the hierarchical structure of the target semantic categories ● Two fuzzy measures based on: dominance between categories edge distance in the category tree / graph ● Results wrt inter annotator agreement is almost identical to exact match
JHU WORKSHOP July 30th, 2003 What’s next ● Investigate respective contribution of (independent) features ● Incorporate syntactic information ● Refine some coarse categories Using subject codes Using genus terms Re-mapping via Wordnet
JHU WORKSHOP July 30th, 2003 What’s next (cont’d) ● Reduce the number of features/values via external resources: lexical vs. semantic models of the context use selectional preferences ● Concentrate on complex cases (e.g. unseen words) ● Preparation of test data for extrinsic evaluation (MT)