Managing Shared Information: Statistical Methods for Validating, Integrating, and Analyzing Text Data from Multiple Sources ChengXiang (“Cheng”) Zhai Department.

Slides:



Advertisements
Similar presentations
( · ). Unit 3 Welcome to the unit A world of connections.
Advertisements

KDD 2011 Summary of Text Mining sessions Hongbo Deng.
Search Engine Optimisation (SEO) by Graham Sowerby (28 th November 2013)
Introduction to ReviewMiner Hongning Wang Department of Computer Science University of Illinois at Urbana-Champaign
1 Opinion Integration and Summarization ChengXiang (“Cheng”) Zhai Department of Computer Science Graduate School of Library & Information Science Institute.
Exploiting Structured Ontology to Organize Scattered Online Opinions Yue Lu, Huizhong Duan, Hongning Wang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
COMP423 Intelligent Agents. Recommender systems Two approaches – Collaborative Filtering Based on feedback from other users who have rated a similar set.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Introduction to IR Research ChengXiang Zhai Department of Computer.
Master the MULTI-SCREEN WORLD. AGENDA  What is a multi-screen website  The growing importance of multi-screen sites  What Google recommends  Turning.
Ao-Jan Su † Y. Charlie Hu ‡ Aleksandar Kuzmanovic † Cheng-Kok Koh ‡ † Northwestern University ‡ Purdue University How to Improve Your Google Ranking: Myths.
KDD 2011 Research Poster Content - Driven Trust Propagation Framwork V. G. Vinod Vydiswaran, ChengXiang Zhai, and Dan Roth University of Illinois at Urbana-Champaign.
Latent Aspect Rating Analysis without Aspect Keyword Supervision Hongning Wang, Yue Lu, ChengXiang Zhai Department of.
 Jan. 27, 2010 Apple CEO Steve Jobs, announced the release of the much anticipated iPad. As with most Apple products, the iPad was kept in secrecy until.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
2008 © ChengXiang Zhai 1 Contextual Text Analysis with Probabilistic Topic Models ChengXiang Zhai Department of Computer Science Graduate School of Library.
2010 © University of Michigan 1 Text Retrieval and Data Mining in SI - An Introduction Qiaozhu Mei School of Information Computer Science and Engineering.
Statistical Topic Models for Integrating and Analyzing Opinions in Blog articles Yue Lu Qiaozhu Mei ChengXiang Zhai.
In Situ Evaluation of Entity Ranking and Opinion Summarization using Kavita Ganesan & ChengXiang Zhai University of Urbana Champaign
Statistical Methods for Integration and Analysis of Online Opinionated Text Data ChengXiang (“Cheng”) Zhai Department of Computer Science University of.
Mining and Summarizing Customer Reviews
More than words: Social networks’ text mining for consumer brand sentiments A Case on Text Mining Key words: Sentiment analysis, SNS Mining Opinion Mining,
1 CADE Finance and HR Reports Administrative Staff Leadership Conference Presenter: Mary Jo Kuffner, Assistant Director Administration.
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.
SOCIAL MEDIA FOR BUSINESS reqSmart. Some Facts about Social Media - I Years to reach 50 million users. Radio – 38 years Television – 13 years Internet.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Pick a Good IR Research Problem ChengXiang Zhai Department of Computer.
Technology Tools. Tablets A tablet computer is a general use machine that utilizes a touch screen as opposed to a keyboard or mouse. Tablets are easy.
MINING MULTI-FACETED OVERVIEWS OF ARBITRARY TOPICS IN A TEXT COLLECTION Xu Ling, Qiaozhu Mei, ChengXiang Zhai, Bruce Schatz Presented by: Qiaozhu Mei,
The Five Step Sales Process The Five Step Sales Process Step One: Plan and Prepare May 11, 2011.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Prepare Yourself for IR Research ChengXiang Zhai Department of Computer.
Utilizing Electronic Communication and the Digital World to Replace Traditional Instruction at the College and Implications for K – 12 Education.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Master the MULTI-SCREEN WORLD. AGENDA What is a multi-screen website? The growing importance of multi-screen sites What Google recommends What Google.
Proposal for Term Project J. H. Wang Mar. 2, 2015.
Real World IR Challenges (CS598-CXZ Advanced Topics in IR Presentation) Jan. 20, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
1 Rated Aspect Summarization of Short Comments Yue Lu, ChengXiang Zhai, and Neel Sundaresan Presented by: Sapan Shah.
MODEL ADAPTATION FOR PERSONALIZED OPINION ANALYSIS MOHAMMAD AL BONI KEIRA ZHOU.
There are many reasons why I hate my apartment. I moved into my apartment last year. Last year was a terrible year because I lost my job. I’ve been looking.
WEB 2.0 PATTERNS Carolina Marin. Content  Introduction  The Participation-Collaboration Pattern  The Collaborative Tagging Pattern.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
KDD 2011 Doctoral Session Modeling Trustworthiness of Online Content V. G. Vinod Vydiswaran Advisors: Prof.ChengXiang Zhai, Prof.Dan Roth University of.
By Abanob Khalil CSC 490.  What is a smart phone?  Top Smart Phones 2015  Features, More features  Comparing between the Iphone Vs Galaxy  Negative.
1 Generating Comparative Summaries of Contradictory Opinions in Text (CIKM09’)Hyun Duk Kim, ChengXiang Zhai 2010/05/24 Yu-wen,Hsu.
Automating Readers’ Advisory to Make Book Recommendations for K-12 Readers by Alicia Wood.
Automatic Labeling of Multinomial Topic Models
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
User Modeling and Recommender Systems: recommendation algorithms
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Grade 4 Zinck PowerPoint overview of Grade Four Overview of class website Question and Answer Tonight ’ s Tasks…
2014 Lexicon-Based Sentiment Analysis Using the Most-Mentioned Word Tree Oct 10 th, 2014 Bo-Hyun Kim, Sr. Software Engineer With Lina Chen, Sr. Software.
Big Data Javad Azimi May First of All… Sorry about the language  Feel free to ask any question Please share similar experiences.
温州市实验中学 陈玫月. Give opinions in different ways. I think students should be allowed to …. I don’t think students should …. I agree / disagree that … I think.
Research Progress Kieu Que Anh School of Knowledge, JAIST.
Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.
Unlock your Big Data with Analytics and BI on Office365 Brian Culver ● SharePoint Fest Seattle● BI102 ● August 18-20, 2015.
COMP423 Intelligent Agents. Recommender systems Two approaches – Collaborative Filtering Based on feedback from other users who have rated a similar set.
Master the MULTI-SCREEN WORLD.
Proposal for Term Project
Statistical Learning Methods for Natural Language Processing on the Internet 徐丹云.
Memory Standardization
Master the MULTI-SCREEN WORLD.
Text Retrieval and Data Mining in SI - An Introduction
Aspect-based sentiment analysis
Latent Aspect Rating Analysis
Weakly Learning to Match Experts in Online Community
How to read FOR 8th grade AND BEYOND
iSRD Spam Review Detection with Imbalanced Data Distributions
Understanding User Intentions by Computational Techniques
Presentation transcript:

Managing Shared Information: Statistical Methods for Validating, Integrating, and Analyzing Text Data from Multiple Sources ChengXiang (“Cheng”) Zhai Department of Computer Science University of Illinois at Urbana-Champaign 1 AFOSR Information Operations and Security Annual PI Meeting, Arlington, VA, Aug. 7, 2013

Managing Shared Information 2 Information Sharing Network Decision=? Information Need Information Nuggets How to manage shared information? - validate textual claims? - integrate scattered opinions? - summarize & analyze opinions?

Opinionated text data are very useful for decision support “Which cell phone should I buy?” “What are the winning features of iPhone over blackberry?” “How do people like this new drug?” “How is Obama’s health care policy received?” “Which presidential candidate should I vote for?” … Opinionated Text DataDecision Making & Analytics 3

How can I digest them all? However, a user will be overwhelmed with lots of text data… How can I integrate and organize all opinions? What can I believe ? Any pattern? 4

Research Questions How can we integrate scattered opinions? How can we validate and summarize opinions? How can we analyze online opinions to discover patterns? How can we do all these in a general way with no or minimum human effort? –Must work for all topics –Must work for different natural languages 5 Solutions: Statistical Methods for Text Data Mining

Rest of the talk: general methods for 1. Opinion Integration & Validation 2. Opinion Summarization 3. Opinion Analysis 6

Outline 1. Opinion Integration & Validation 2. Opinion Summarization 3. Opinion Analysis 7

How to digest all scattered opinions? 190,451 posts 4,773,658 results Need tools to automatically integrate all scattered opinions 8

4,773,658 results Observation: two kinds of opinions Expert opinions CNET editor’s review Wikipedia article Well-structured Easy to access Maybe biased Outdated soon 190,451 posts Ordinary opinions Forum discussions Blog articles Represent the majority Up to date Hard to access fragmented Can we combine them? 9

Opinion Integration Strategy 1 [Lu & Zhai WWW 2008] Align scattered opinions with well-structured expert reviews Yue Lu, ChengXiang Zhai. Opinion Integration Through Semi-supervised Topic Modeling, Proceedings of the World Wide Conference 2008 ( WWW'08), pages

11 Review-Based Opinion Integration cute… tiny…..thicker.. last many hrs die out soon could afford it still expensive Design Battery Price.. Design Battery Price.. Topic: iPod Expert review with aspects Text collection of ordinary opinions, e.g. Weblogs Integrated Summary Design Battery Price Design Battery Price iTunes … easy to use… warranty …better to extend.. Review Aspects Extra Aspects Similar opinions Supplementary opinions Input Output

Results: Product (iPhone) Opinion Integration with review aspects Review articleSimilar opinionsSupplementary opinions You can make emergency calls, but you can't use any other functions… N/A… methods for unlocking the iPhone have emerged on the Internet in the past few weeks, although they involve tinkering with the iPhone hardware… rated battery life of 8 hours talk time, 24 hours of music playback, 7 hours of video playback, and 6 hours on Internet use. iPhone will Feature Up to 8 Hours of Talk Time, 6 Hours of Internet Use, 7 Hours of Video Playback or 24 Hours of Audio Playback Playing relatively high bitrate VGA H.264 videos, our iPhone lasted almost exactly 9 freaking hours of continuous playback with cell and WiFi on (but Bluetooth off). Unlock/hack iPhone Activation Battery Confirm the opinions from the review Additional info under real usage 12

Results: Product (iPhone) Opinions on extra aspects supportSupplementary opinions on extra aspects 15You may have heard of iASign … an iPhone Dev Wiki tool that allows you to activate your phone without going through the iTunes rigamarole. 13Cisco has owned the trademark on the name "iPhone" since 2000, when it acquired InfoGear Technology Corp., which originally registered the name. 13With the imminent availability of Apple's uber cool iPhone, a look at 10 things current smartphones like the Nokia N95 have been able to do for a while and that the iPhone can't currently match... Another way to activate iPhone iPhone trademark originally owned by Cisco A better choice for smart phones? 13

As a result of integration… What matters most to people? Price Bluetooth & Wireless Activation 14

Sample Results of “Obama” 15 Wikipedia article about “Barack Obama” Relevant discussions in blog

4,773,658 results What if we don’t have expert reviews? Expert opinions CNET editor’s review Wikipedia article Well-structured Easy to access Maybe biased Outdated soon 190,451 posts Ordinary opinions Forum discussions Blog articles Represent the majority Up to date Hard to access fragmented How can we organize scattered opinions? Exploit online ontology! 16

Opinion Integration Strategy 2 [Lu et al. COLING 2010] Organize scattered opinions using an ontology Yue Lu, Huizhong Duan, Hongning Wang and ChengXiang Zhai. Exploiting Structured Ontology to Organize Scattered Online Opinions, Proceedings of COLING 2010 (COLING 10), pages

Sample Ontology: 18

Ontology-Based Opinion Integration Topic = “Abraham Lincoln” (Exists in ontology) Aspects from Ontology (more than 50) Scattered Opinion Sentences Professions Quotations Parents … … Date of Birth Place of Death Professions Quotations Subset of Aspects Matching Opinions Ordered to optimize readability 19

Sample Results: FreeBase + Blog 20 Ronald Reagan Sony Cybershot DSC-W200

More opinion integration results are available at:

Opinion Validation: What to Believe? [Vydiswaran et al. KDD 2011] A General Framework for Textual Claim Validation 22 V.G.Vinod Vydiswaran, ChengXiang Zhai, Dan Roth, Content-driven Trust Propagation Framework, Proc ACM SIGKDD Int. Conf.. Knowledge Discovery and Data Mining (KDD'11), Aug

Motivation How does a user know what to believe and what to be trusted? Task: annotate textual claims and their sources with trustworthiness scores Solution: a general content-based trust framework, capturing the following intuition: –Not all claims are equally trustable –Not all sources are equally trustable –Trustable sources tend to provide trustable claims 23

Content-Based Trust Propagation Framework First trust framework to model trust of textual claims directly Combines multiple evidences (source trust, evidence trust, and claim trust) in a unified framework to enable mutual reinforcement of different scores General and thus applicable to multiple domains Claim 1 Claim n Claim Textual Support Evidence Textual Claims (e.g. blog articles) Different Sources (e.g., news, Web) 24

Sample Results of the New Framework Claim 1 Claim n Claim EvidenceClaims Sources Web sources Evidence passages Claim sentences Advantages over traditional models Claim 1 Claim n Claim 2 … [Yin, et al., 2007; Pasternack & Roth, 2010] Which news source should you trust? It depends on the genres! Incorporating the trust model improves retrieval accuracy for news 25

Outline 1. Opinion Integration & validation 2. Opinion Summarization 3. Opinion Analysis 26

Need for opinion summarization 1,432 customer reviews How can we help users digest these opinions? How can we help users digest these opinions? 27

Nice to have…. Can we do this in a general way? 28

Opinion Summarization 1: [Mei et al. WWW 07] Multi-Aspect Topic Sentiment Summarization Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, ChengXiang Zhai, Topic Sentiment Mixture: Modeling Facets and Opinions in Weblogs, Proceedings of the World Wide Conference 2007 ( WWW'07), pages

30 Multi-Faceted Sentiment Summary (query=“Da Vinci Code”) NeutralPositiveNegative Facet 1: Movie... Ron Howards selection of Tom Hanks to play Robert Langdon. Tom Hanks stars in the movie,who can be mad at that? But the movie might get delayed, and even killed off if he loses. Directed by: Ron Howard Writing credits: Akiva Goldsman... Tom Hanks, who is my favorite movie star act the leading role. protesting... will lose your faith by... watching the movie. After watching the movie I went online and some research on... Anybody is interested in it?... so sick of people making such a big deal about a FICTION book and movie. Facet 2: Book I remembered when i first read the book, I finished the book in two days. Awesome book.... so sick of people making such a big deal about a FICTION book and movie. I’m reading “Da Vinci Code” now. … So still a good book to past time. This controversy book cause lots conflict in west society.

Separate Theme Sentiment Dynamics “book” “religious beliefs” 31

32 Can we make the summary more concise? NeutralPositiveNegative Facet 1: Movie... Ron Howards selection of Tom Hanks to play Robert Langdon. Tom Hanks stars in the movie,who can be mad at that? But the movie might get delayed, and even killed off if he loses. Directed by: Ron Howard Writing credits: Akiva Goldsman... Tom Hanks, who is my favorite movie star act the leading role. protesting... will lose your faith by... watching the movie. After watching the movie I went online and some research on... Anybody is interested in it?... so sick of people making such a big deal about a FICTION book and movie. Facet 2: Book I remembered when i first read the book, I finished the book in two days. Awesome book.... so sick of people making such a big deal about a FICTION book and movie. I’m reading “Da Vinci Code” now. … So still a good book to past time. This controversy book cause lots conflict in west society. What if the user is using a smart phone?

Opinion Summarization 2: [Ganesan et al. WWW 12] “Micro” Opinion Summarization Kavita Ganesan, Chengxiang Zhai and Evelyne Viegas, Micropinion Generation: An Unsupervised Approach to Generating Ultra-Concise Summaries of Opinions, Proceedings of the World Wide Conference 2012 ( WWW'12), pages ,

Micro Opinion Summarization “Good service” “Delicious soup dishes” “Good service” “Delicious soup dishes” “Room is large” “Room is clean” “large clean room” 34

Overview of summarization algorithm Text to be summarized …. very nice place clean problem dirty room … …. very nice place clean problem dirty room … Step 1: Shortlist high freq unigrams (count > median) Unigrams Step 2: Form seed bigrams by pairing unigrams. Shortlist by S rep. (S rep > σ rep ) very + nice very + clean very + dirty clean + place clean + room dirty + place … very + nice very + clean very + dirty clean + place clean + room dirty + place … Srep > σ rep Seed BigramsInput 35

Overview of summarization algorithm Step 3: Generate higher order n-grams. Concatenate existing candidates + seed bigrams Prune non-promising candidates (S rep & S read ) Eliminate redundancies (sim(mi,mj)) Repeat process on shortlisted candidates (until no possbility of expansion) Higher order n-grams very clean very dirty very nice Candidates Seed Bi-grams clean rooms clean bed dirty room dirty pool nice place nice room Step 4: Final summary. Sort by objective function value. Add phrases until |M|< σ ss 0.9 very clean rooms 0.8 friendly service 0.7 dirty lobby and pool 0.5 nice and polite staff … very clean rooms 0.8 friendly service 0.7 dirty lobby and pool 0.5 nice and polite staff ….. Sorted Candidates Summary = = = very clean rooms very clean bed very dirty room very dirty pool very nice place very nice room = New Candidates S rep <σ rep ; S read <σ read 36

Performance comparisons (reviews of 330 products) Proposed method works the best 37

The program can generate meaningful novel phrases Example: “wide screen lcd monitor is bright” readability : representativeness: 4.25 “wide screen lcd monitor is bright” readability : representativeness: 4.25 Unseen N-Gram (Acer AL2216 Monitor) “…plus the monitor is very bright…” “…it is a wide screen, great color, great quality…” “…this lcd monitor is quite bright and clear…” “…plus the monitor is very bright…” “…it is a wide screen, great color, great quality…” “…this lcd monitor is quite bright and clear…” Related snippets in original text 38

A Sample Summary Canon Powershot SX120 IS Easy to use Good picture quality Crisp and clear Good video quality Easy to use Good picture quality Crisp and clear Good video quality E-reader/ Tablet Smart Phones Cell Phones Useful for pushing opinions to devices where the screen is small 39

Outline 1. Opinion Integration & Validation 2. Opinion Summarization 3. Opinion Analysis 40

Motivation How to infer aspect ratings? Value Location Service … How to infer aspect weights? Value Location Service … 41

Opinion Analysis: [Wang et al. KDD 2010] & [Wang et al. KDD 2011] Latent Aspect Rating Analysis Hongning Wang, Yue Lu, ChengXiang Zhai. Latent Aspect Rating Analysis on Review Text Data: A Rating Regression Approach, Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'10), pages , Hongning Wang, Yue Lu, ChengXiang Zhai, Latent Aspect Rating Analysis without Aspect Keyword Supervision, Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'11), 2011, pages

Latent Aspect Rating Analysis Given a set of review articles about a topic with overall ratings Output –Major aspects commented on in the reviews –Ratings on each aspect –Relative weights placed on different aspects by reviewers Many applications –Opinion-based entity ranking –Aspect-level opinion summarization –Reviewer preference analysis –Personalized recommendation of products –… 43

Solution: Latent Rating Regression Model Reviews + overall ratingsAspect segments location:1 amazing:1 walk:1 anywhere: nice:1 accommodating:1 smile:1 friendliness:1 attentiveness:1 Term weightsAspect Rating room:1 nicely:1 appointed:1 comfortable: Aspect SegmentationLatent Rating Regression Aspect Weight Topic model for aspect discovery + 44

Sample Result 1:Aspect-Specific Sentiment Lexicon Uncover sentimental information directly from the data 45 ValueRoomsLocationCleanliness resort 22.80view 28.05restaurant 24.47clean value 19.64comfortable 23.15walk 18.89smell excellent 19.54modern 15.82bus 14.32linen worth 19.20quiet 15.37beach 14.11maintain bad carpet -9.88wall smelly money smell -8.83bad -5.40urine terrible dirty -7.85road -2.90filthy overprice -9.06stain -5.85website -1.67dingy -0.38

Sample Result 2: Rated Aspect Summarization 46 AspectSummaryRating Value Truly unique character and a great location at a reasonable price Hotel Max was an excellent choice for our recent three night stay in Seattle. 3.1 Overall not a negative experience, however considering that the hotel industry is very much in the impressing business there was a lot of room for improvement. 1.7 Location The location, a short walk to downtown and Pike Place market, made the hotel a good choice. 3.7 When you visit a big metropolitan city, be prepared to hear a little traffic outside!1.2 Business Service You can pay for wireless by the day or use the complimentary Internet in the business center behind the lobby though. 2.7 My only complaint is the daily charge for internet access when you can pretty much connect to wireless on the streets anymore. 0.9 (Hotel Max in Seattle)

Sample Result 3: Comparative analysis of opinion holders 47 CityAvgPriceGroupVal/LocVal/RmVal/Ser Amsterdam241.6 top bot Barcelona280.8 top bot San Francisco261.3 top bot Florence272.1 top bot Reviewers emphasizing more on ‘value’ tend to prefer cheaper hotels

Sample Result 4: Analysis of Latent Preferences 48 Expensive HotelCheap Hotel 5 Stars3 Stars5 Stars1 Star Value Room Location Cleanliness Service People like expensive hotels because of good service People like cheap hotels because of good value

Sample Result 5: Personalized Ranking of Entities Query: 0.9 value 0.1 others Non-Personalized Personalized 49

Sample Result 6: Discover consumer preferences 50 battery life accessory service file format volume video

Summary 1. Opinion Integration & Validation 2. Opinion Summarization 3. Opinion Analysis -Leverage expert reviews [WWW 08] -Leverage ontology [COLING 10] -Content-based trust framework [KDD 11] - Aspect sentiment summary [WWW 07] - Micro opinion summary [WWW 12] - Two-stage rating analysis [KDD 10] - Unified rating analysis [KDD 11] Shared opinionated text data are very useful for decision making Users face significant challenges in integrating, validating, and digesting opinions 51

Future Work: Putting All Together 52 Information Sharing Network Decision=? Information Need Preliminary Result: FindiLike SystemFindiLike

Acknowledgments Collaborators: Yue Lu, Qiaozhu Mei, Kavita Ganesan, Hongning Wang, Vinod Vydiswaran, and many others Funding 53 Thanks to MURI!

Thank You! Questions/Comments? 54