Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Novel Lexicalized HMM-based Learning Framework for Web Opinion Mining Wei Jin Department of Computer Science, North Dakota State University, USA Hung.

Similar presentations


Presentation on theme: "A Novel Lexicalized HMM-based Learning Framework for Web Opinion Mining Wei Jin Department of Computer Science, North Dakota State University, USA Hung."— Presentation transcript:

1 A Novel Lexicalized HMM-based Learning Framework for Web Opinion Mining Wei Jin Department of Computer Science, North Dakota State University, USA Hung Hay Ho Department of Computer Science & Engineering, State University of New York at Buffalo, USA

2 Outline Motivation Related Work A Lexicalized HMM-based Learning Framework Experimental Results Conclusion

3 Introduction Two main types of information on the Web.  Facts (Google searches)  Opinions (Google does not search for opinions) Opinions are hard to express with keywords Current search ranking strategy is not appropriate for opinion search. E-commerce more and more popular  More and more products sold on the web  More and more people buy products online.  Customers share their opinions and hands-on experiences on products Reading through all customer reviews is difficult,  for popular items, the number of reviews can be up to hundreds or even thousands.

4 Introduction – Applications Businesses and organizations: product and service benchmarking. Market intelligence.  Business spends a huge amount of money to find consumer sentiments and opinions.  Consultants, surveys and focused groups, etc Ads placements: Placing ads in user- generated content  Place an ad when one praises an product.  Place an ad from a competitor if one criticizes an product.

5 Design Goal Design a framework capable of  extracting, learning and classifying product entities and opinion expressions automatically from product reviews.  E.g., Battery Life is quite impressive. Three main tasks Task 1: Identifying and extracting product entities that have been commented on in each review. Task 2: Identifying and extracting opinion sentences describing the opinions for each recognized product entity. Task 3: Determining whether the opinions on the product entities are positive or negative.

6 Related Work Zhuang et al., 2006  Classified and summarized movie reviews by extracting high frequency feature keywords and high frequency opinion keywords.  Feature-opinion pairs were identified by using a dependency grammar graph. Popescu and Etzioni, 2005  Proposed a relaxation labeling approach to find the semantic orientation of words.  However, their approach only extracted feature words with frequency greater than an experimentally set threshold value and ignored low frequency feature words. Hu and Liu, 2004  Proposed a statistical approach capturing high frequency feature words by using association rules.  Infrequent feature words are captured by extracting known opinion words’ adjacent noun phrases.  A summary is generated by using high frequency feature words and ignoring infrequent features.  However, the capability of recognizing phrase features is limited by the accuracy of recognizing noun-group boundaries.  Lacks an effective way to address infrequent features. Our approach  Propose a framework naturally integrates multiple linguistic features into automatic learning.  Identify complex product-specific features and infrequently mentioned features (which are possible low frequency phrases in the reviews).  Self-learn and predict new features based on the patterns it has seen from the training data.

7 The Proposed Technique Lexicalized HMMs  Previously used in Part-of- Speech (POS) Tagging and Named Entity Recognition (NER) problem.  To correlate the web opinion mining task with POS Tagging and NER may well be a significant contribution in itself in this work.

8 Task 1: Entity Categories and Tag Sets

9 Task 1: Basic Tag Set and Patten Tag Set

10 An entity can be a single word or a phrase.  A word w in an entity may take one of the following four patterns to present itself: w is an independent entity; w is the beginning component of an entity; w is at the middle of an entity; w is at the end of an entity.

11 An Example of Hybrid Tag and Basic Tag Representation I love the ease of transferring the pictures to my computer. Hybrid tags:  I love t he ease of transferring the pictures to my computer Basic tags:  I love the ease of transferring the pictures to my computer

12 Task 1: Lexicalized HMMs Integrate linguistic features  Part-of-Speech (POS), lexical patterns  An observable state: a pair (word i, POS(word i )) Given a sequence of words W=w 1 w 2 w 3 …w n and corresponding parts-of-speech S = s 1 s 2 s 3 …s n, find an appropriate sequence of hybrid tags that maximize the conditional probability P(T|W,S)

13 Task 1: Approximations to the General Model First Approximation: First-order HMMs based on the independent hypothesis P(t i | t i-K …t i-1 ) ≈ P(t i | t i-1 ) Second approximation: combines the POS information with the lexicalization technique: 1. The assignment of current tag t i is supposed to depend not only on its previous tag t i-1 but also previous J (1≤J≤i-1) words w i-J …w i-1. 2. The appearance of current POS s i is supposed to depend both on the current tag t i and previous L(1≤L≤i-1) words w i-L …w i-1. 3. The appearance of current word w i is assumed to depend not only on the current tag t i, current POS s i, but also the previous K(1≤K≤i-1) words w i-K …w i-1. Maximum Likelihood Estimation: estimate the parameters

14 Task1: Smoothing Technique Solve zero probabilities of MLE for any cases that are not observed in the training data.  linear interpolation smoothing technique smooth higher-order models with their relevant lower-order models, or smooth the lexicalized parameters using the related non-lexicalized probabilities Where λ, β and α denote the interpolation coefficients Decoding of the best hybrid tag sequence: Viterbi algorithm.

15 Task 2: Identifying and Extracting Opinion Sentences Opinion sentences: sentences that express an opinion on product entities. Not considered as effective opinion sentences:  Describe product entities without expressing reviewers’ opinions.  Express opinions on another product model’s entities model numbers, such as DMC-LS70S, P5100, A570IS, identified using the regular expression “[A-Z-]+\d+([A- Za-z]+)?”).

16 Task 3: Opinion Orientation Classification Opinion orientation: not simply equal to opinion entity (word/phrase)’s orientation.  E.g., “I can tell you right now that the auto mode and the program modes are not that good.” (negative comment on both “auto mode” and “program modes”) Natural Language Rules Reflecting Sentence Context  For each recognized product entity, search its matching opinion entity (defined as the nearest opinion word/phrase identified by the tagger).  The orientation of this matching opinion entity: the initial opinion orientation for the corresponding product entity.  Natural language rules to deal with specific language constructs (may change the opinion orientation), The presence of negation words (e.g., not).

17 Task 3: Opinion Orientation Classification Line 8 to line 23: any negation words (e.g., not, didn’t, don’t) within five-word distance in front of an opinion entity, and changes opinion orientation accordingly, except A negation word appears in front of a coordinating conjunction (e.g., and, or, but). (line 10 – 13) A negation word appears after the appearance of a product entity during the backward search within the five-word window. (line 14 -17) Line 27 to 32: the coordinating conjunction “but” and prepositions such as “except” and “apart from”.

18 Experiments Corpus: Amazon’s digital camera reviews Experimental Design  The first 16 unique cameras listed on Amazon.com during November 2007 crawled.  Each individual review content, model number as well as manufacturer name extracted.  Sentence segmentation  POS parsing Training Design  1728 review documents Two sets  One set (293 documents for 6 cameras): manually tagged by experts.  The remaining documents (1435 documents for 10 cameras): the bootstrapping process Two challenges  Inconsistent terminologies: describe product entities E.g., battery cover, battery/SD door, battery/SD cover, battery/SD card cover.  Recognizing rarely mentioned entities (i.e. infrequent entities). E.g., only two reviewers mentioned “poor design on battery door” Bootstrapping  Reduce the effort of manual labeling of a large set of training documents  Enable self-directed learning  In situations where collecting a large training set could be expensive and difficult to accomplish.

19 Bootstrapping 1.The parent process: Master. coordinate the bootstrapping process, extract and distribute high confidence data to each worker. The rest: workers (two child processes ) 2.Split the training documents into two halves, t 1 and t 2. used as seeds for each worker’s HMM. 3.Each worker trains its own HMM classifier. tag the bootstrap document set. 4.Master inspects each sentence tagged by each HMM classifier, and extracts opinion sentences that are agreed upon by both classifiers. a newly discovered sentence: store it into the database. 5.Randomly splits the newly discovered data into two halves t 1 ’ and t 2 ’, and adds t 1 ’ and t 2 ’ to the training set of two workers respectively.

20 Bootstrapping results

21 Evaluation The review documents for 6 cameras: manually labeled by experts.  The largest four data sets (containing 270 documents): a 4-fold cross- validation.  The remaining review documents for 2 cameras (containing 23 documents): for training only.  The bootstrap document set (containing 1435 documents for 10 cameras): extract high confidence data through self-learning Evaluation  The recall, precision and F-score  Comparing the results tagged by the system with the manually tagged truth data. Only an exact match is considered as a correct recognition.  Entity recognition: the exact same word/phrase identified and classified correctly; each identified entity occurs in the same sentence, same position and same document as compared with the truth data.  Opinion sentence extraction: the exact same sentence from the same document.  Opinion orientation classification: the exact same entity and entity category identified with correct orientation (positive or negative).

22 Baseline A rule-based baseline system  Motivated by (Turney 2002) and (Hu and Liu 2004)’s approaches.  Uses a number of rules to identify opinion-bearing words. Matching nouns: product entities Matching adjectives: opinion words  Adjectives’ semantic orientations Using the bootstrapping technique proposed in (Hu and Liu 2004)  Twenty five commonly used positive adjectives and twenty five commonly used negative adjectives as seeds.  Expand these two seeds lists by searching synonyms and antonyms for each seed word.  Newly discovered words added into their corresponding seeds lists.  The orientations of extracted adjectives: checking the existence of these words in the lists.

23 Evaluation Results

24

25

26

27 Conclusion Propose a new and robust machine learning approach for web opinion mining and extraction.  Self-learn new vocabularies based on the patterns (not supported by rule based approaches.  Identify complex product entities and opinion expressions, infrequently mentioned entities  A bootstrapping approach to situations that collecting a large training set could be expensive and difficult to accomplish Future directions  Expansion of the datasets.  Researching the role of pronoun resolution in improving the mining results.


Download ppt "A Novel Lexicalized HMM-based Learning Framework for Web Opinion Mining Wei Jin Department of Computer Science, North Dakota State University, USA Hung."

Similar presentations


Ads by Google