Download presentation
Presentation is loading. Please wait.
Published byJuliet Washington Modified over 9 years ago
1
Carmen Banea, Rada Mihalcea University of North Texas carmenb@unt.edu, rada@cs.unt.edu A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Janyce Wiebe University of Pittsburg wiebe@cs.pitt.edu
2
Subjectivity analysis Subjectivity analysis (opinions and sentiments) Used in a wide variety of applications Tracking sentiment timelines in news (Lloyd et. al, 2005) Review classification (Turney, 2002; Pang et. al, 2002) Mining opinions from product reviews (Hu and Liu, 2004) Expressive text-to-speech synthesis (Alm et. al, 2005) Text semantic analysis (Wiebe and Mihalcea, 2006; Esuli and Sebastiani, 2006) Question answering (Yu and Hatzivassiloglou, 2003) Much work on subjectivity analysis has focused on English Japanese (Takumura et. al, 2006), Chinese (Hu et. al, 2005), German (Kim and Hovy, 2006)
3
Proportion of Languages on the Web internetworldstats.com ~ updated November 30, 2007
4
Objective Develop a method for subjectivity analysis that Requires few electronic resources Can be easily ported to a new language Applicable to the large number of languages that have scarce electronic resources
5
Related Work Tools that rely on manually or semi-automatically constructed lexicons Yu and Hatzivassiloglou, 2003; Riloff and Wiebe, 2003; Kim and Hovy, 2006 Enable the efficient rule-based subjectivity and sentiment classifiers that rely on the presence of lexicon entries in text These tools assume the availability of advanced language processing tools: Syntactic parsers (Wiebe, 2000), Information extraction (Riloff and Wiebe, 2003) broad-coverage rich lexical resources WordNet (Essuli and Sebastiani, 2006) Our approach relates most closely to the method of (Turney, 2002) for the construction of lexicons annotated for polarity We address the task of acquiring a subjectivity lexicon We rely on fewer, smaller-scale resources
6
Our Method Based on bootstrapping Requires: A small seed set of subjective entries One/multiple electronic dictionaries A small training corpus (approx. 500,000 words) Experiments focused on Romanian Applicable to other languages as well
7
Bootstrapping Process seeds query Candidate synonyms Max. no. of iterations? no yes Candidate synonyms Selected synonyms Variable filtering Online dictionary Fixed filtering
8
Seed Set CategorySample Entries (with their English translation) Nounblestem (curse), despot (tyrant), furie (fury), idiot (idiot), fericire (happiness) Verbiubi (love), aprecia (appreciate), spera (hope), dori (wish), uri (hate) Adjectivefrumos (beautiful), dulce (sweet), urat (ugly), fericit (happy), fascinant (fascinating) Adverbposibil (possibly), probabil (probably), desigur (of course), enervant (unnerving) 60 seeds, evenhandedly sampled from verbs, nouns, adjectives and adverbs. Manually selected Seed sources: XI-th grade curriculum for Romanian Language and Literature Translations of instances appearing in the OpinionFinder strong subjective lexicon (Wiebe and Riloff, 2005)
9
Expansion Romanian dictionary: http://www.dexonline.ro Dictionaries for other languages are also available, or can be obtained from paper dictionaries through OCR Definition All open-class words, that have a definition in the dictionary longer than 3 letters Diacritics are removed Candidate synonyms Seed
10
Filtering Candidates are filtered based on a measure of similarity with the original seeds We use Latent Semantic Analysis (LSA)(Dumais et al., 1988) trained on the SemCor corpus (Miller et al., 1993) After each iteration, only candidates with an LSA score higher than a given threshold are selected for further expansion Example: Seed: dulce (sweet) Candidate synonyms: cu gust dulce (sweet-tasting). placut (pleasant), dulceag (quasi-sweet)
11
Filtering Several iterations of the bootstrapping process will result in a subjectivity lexicon consisting of a ranked list of candidates in decreasing order of similarity to the original seeds A variable filtering threshold can be used to further restrict the similarity for a more pure lexicon Filtering parameters: Similarity threshold Number of iterations
12
Lexicon Acquisition
13
Evaluation Rule-based classifier of subjectivity (Riloff and Wiebe, 2003) Subjective sentence: three or more subjective entries. Objective sentence: two subjective entries or less. Gold standard data set (Mihalcea, Banea and Wiebe, 2007) 504 sentences from five SemCor documents (manually translated in Romanian) Labeled by two annotators Agreement (all): 83% ( =0.67) Agreement (uncertain removed): 89% ( =0.77) Baseline: 54% (all subjective)
14
Number of Iterations F-measure for the bootstrapping subjectivity lexicon over 5 iterations and an LSA threshold of 0.5
15
Similarity Threshold F-measure for the fifth bootstrapping iteration for varying LSA scores
16
Comparison Bootstrapping rule-based classifier: uses a 3913 entries subjectivity lexicon obtained through 5 iterations and similarity threshold of 0.5
17
Conclusions Our bootstrapping method uses few electronic resources: A small seed set One/multiple dictionaries A small corpus of half a million words A large subjectivity lexicon of approx. 4000 entries was extracted Using an unsupervised rule-based classifier, a subjectivity F- measure of 66.20% and an overall F-measure of 61.69% can be achieved
18
Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.