Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Research on Intelligent Text Information Management ChengXiang Zhai Department of Computer Science Graduate School of Library & Information Science Institute.

Similar presentations


Presentation on theme: "1 Research on Intelligent Text Information Management ChengXiang Zhai Department of Computer Science Graduate School of Library & Information Science Institute."— Presentation transcript:

1 1 Research on Intelligent Text Information Management ChengXiang Zhai Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, czhai@cs.uiuc.edu Contains joint work with Xuehua Shen, Bin Tan, Qiaozhu Mei, Yue Lu, and other members of the TIMan group, © 2009 ChengXiang Zhai

2 2 Search Text Filtering Categorization Summarization Clustering Natural Language Content Analysis Extraction Mining Visualization Search Applications Mining Applications Information Access Knowledge Acquisition Information Organization Research Roadmap - Personalized - Retrieval models - Topic map - Recommender Current focus -Comparative text mining -Opinion integration -Controversy discovery Current focus Web, Email, and Bioinformatics Entity/Relation Extraction

3 Sample Projects User-Centered Adaptive Information Retrieval Multi-Resolution Topic Map for Browsing Comparative Text Mining Opinion Integration and summarization 3

4 4 Project 1: User-Centered Adaptive IR (UCAIR) A novel retrieval strategy emphasizing – user modeling (“user-centered”) – search context modeling (“adaptive”) – interactive retrieval Implemented as a personalized search agent that –sits on the client-side (owned by the user) –integrates information around a user (1 user vs. N sources as opposed to 1 source vs. N users) –collaborates with each other –goes beyond search toward task support

5 5 Non-Optimality of Document-Centered Search Engines As of Oct. 17, 2005 Query = Jaguar Car Software Animal Mixed results, unlikely optimal for any particular user

6 6 The UCAIR Project (NSF CAREER) Search Engine “jaguar” Personalized search agent WEB Search Engine Email Search Engine Desktop Files Personalized search agent “jaguar”... Viewed Web pages Query History

7 7 Potential Benefit of Personalization Car Software Animal Suppose we know: 1.Previous query = “racing cars” vs. “Apple OS” 2.“car” occurs far more frequently than “Apple” in pages browsed by the user in the last 20 days 3. User just viewed an “Apple OS” document

8 8 Intelligent Re-ranking of Unseen Results When a user clicks on the “back” button after viewing a document, UCAIR reranks unseen results to pull up documents similar to the one the user has viewed

9 9 UCAIR Outperforms Google [Shen et al. 05] Ranking Method prec@5prec@10prec@20prec@30 Google0.5380.4720.3770.308 UCAIR0.5810.5560.4530.375 Improvement8.0%17.8%20.2%21.8% Precision at N documents UCAIR toolbar available at http://sifaka.cs.uiuc.edu/ir/ucair/

10 10 Future: Personal Information Agent Email WWW E-COM Blog Sports Literature IM Desktop Intranet … User Profile Personal Content Index Active Info Service Frequently Accessed Info Security Handler Task Support …

11 Ongoing Work UCAIR system Recommendation and advertising on social networks 11

12 Project 2: Multi-Resolution Topic Map for Browsing Promoting browsing as a “first-class citizen” Multi-resolution topic map for browsing –Enable a user to find information through navigation –Very useful when a user can’t formulate effective queries or uses a small screen device Search log as information footprints –Organize search log into a topic map –Allow a user to follow information footprints of previous users –Enable social surfing 2009 © ChengXiang Zhai 12

13 Querying vs. Browsing 13

14 Information Seeking as Sightseeing Know the address of an attraction site? –Yes: take a taxi and go directly to the site –No: walk around or take a taxi to a nearby place then walk around Know what exactly you want to find? –Yes: use the right keywords as a query and find the information directly –No: browse the information space or start with a rough query and then browse 14 When query fails, browsing comes to rescue…

15 15 Current Support for Browsing is Limited Hyperlinks –Only page-to-page –Mostly manually constructed –Browsing step is very small Web directories –Manually constructed –Fixed categories –Only support vertical navigation OD P Beyond hyperlinks? Beyond fixed categories? How to promote browsing as a “first-class citizen”?

16 Sightseeing Analogy Continues… 16

17 17 Topic Map for Touring Information Space 0.05 0.03 0.02 0.01 Zoom in Zoom out Horizontal navigation Topic regions Multiple resolutions

18 18 Topic-Map based Browsing Demo

19 How can we construct such a multi-resolution topic map? Multiple possibilities… 19

20 20 Search Logs as Information Footprints User 2722 searched for "national car rental" [!] at 2006-03-09 11:24:29 User 2722 searched for "military car rental benefits" [!] at 2006-03-10 09:33:37 (found http://www.valoans.com) User 2722 searched for "military car rental benefits" [!] at 2006-03-10 09:33:37 (found http://benefits.military.com) User 2722 searched for "military car rental benefits" [!] at 2006-03-10 09:33:37 (found http://www.avis.com) User 2722 searched for "enterprise rent a car" [!] at 2006-04- 05 23:37:42 (found http://www.enterprise.com) User 2722 searched for "meineke car care center" [!] at 2006- 05-02 09:12:49 (found http://www.meineke.com) User 2722 searched for "car rental" [!] at 2006-05-25 15:54:36 User 2722 searched for "autosave car rental" [!] at 2006-05- 25 23:26:54 (found http://eautosave.com) User 2722 searched for "budget car rental" [!] at 2006-05-25 23:29:53 User 2722 searched for "alamo car rental" [!] at 2006-05-25 23:56:13 …… Footprints in information space

21 21 Information Footprints  Topic Map Challenges –How to define/construct a topic region –How to control granularities/resolutions of topic regions –How to connect topic regions to support effective browsing Two approaches –Multi-granularity clustering –Query editing

22 Collaborative Surfing 22 Clickthroughs become new footprints Navigation trace enriches map structures New queries become new footprints Browse logs offer more opportunities to understand user interests and intents

23 23 Project 3: Comparative Text Mining Documents are often associated with context (meta- data) –Direct context: time, location, source, authors,… –Indirect context: events, policies, … Many applications require “contextual text analysis”: –Discovering topics from text in a context-sensitive way –Analyzing variations of topics over different contexts –Revealing interesting patterns (e.g., topic evolution, topic variations, topic communities)

24 24 Example 1: Comparing News Articles Common Themes“Vietnam” specific“Afghan” specific“Iraq” specific United nations ……… Death of people ……… … ……… Vietnam WarAfghan War Iraq War CNNFox Blog Before 9/11During Iraq war Current US blogEuropean blog Others What’s in common? What’s unique?

25 25 More Contextual Analysis Questions What positive/negative aspects did people say about X (e.g., a person, an event)? Trends? How does an opinion/topic evolves over time? What are emerging topics? What topics are fading away? How can we characterize a social network?

26 26 Research Questions Can we model all these problems generally? Can we solve these problems with a unified approach? How can we bring human into the loop?

27 27 Document context: Time = July 2005 Location = Texas Author = xxx Occup. = Sociologist Age Group = 45+ … Contextual Probabilistic Latent Semantics Analysis ([KDD 2006]…) View1View2View3 Themes government donation New Orleans government 0.3 response 0.2.. donate 0.1 relief 0.05 help 0.02.. city 0.2 new 0.1 orleans 0.05.. TexasJuly 2005 sociolo gist Theme coverages: Texas July 2005 document …… Choose a view Choose a Coverage government donate new Draw a word from  i response aid help Orleans Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut- in gas production … Over seventy countries pledged monetary donations or other assistance. … Choose a theme

28 28 Comparing News Articles Iraq War (30 articles) vs. Afghan War (26 articles) Cluster 1Cluster 2Cluster 3 Common Theme united 0.042 nations 0.04 … killed 0.035 month 0.032 deaths 0.023 … … Iraq Theme n 0.03 Weapons 0.024 Inspections 0.023 … troops 0.016 hoon 0.015 sanches 0.012 … … Afghan Theme Northern 0.04 alliance 0.04 kabul 0.03 taleban 0.025 aid 0.02 … taleban 0.026 rumsfeld 0.02 hotel 0.012 front 0.011 … … The common theme indicates that “United Nations” is involved in both wars Collection-specific themes indicate different roles of “United Nations” in the two wars

29 29 Spatiotemporal Patterns in Blog Articles Query= “Hurricane Katrina” Topics in the results: Spatiotemporal patterns

30 30 Theme Life Cycles (“Hurricane Katrina”) city 0.0634 orleans 0.0541 new 0.0342 louisiana 0.0235 flood 0.0227 evacuate 0.0211 storm 0.0177 … price 0.0772 oil 0.0643 gas 0.0454 increase 0.0210 product 0.0203 fuel 0.0188 company 0.0182 … Oil Price New Orleans

31 31 Theme Snapshots (“Hurricane Katrina”) Week4: The theme is again strong along the east coast and the Gulf of Mexico Week3: The theme distributes more uniformly over the states Week2: The discussion moves towards the north and west Week5: The theme fades out in most states Week1: The theme is the strongest along the Gulf of Mexico

32 32 Theme Life Cycles (KDD Papers) gene 0.0173 expressions 0.0096 probability 0.0081 microarray 0.0038 … marketing 0.0087 customer 0.0086 model 0.0079 business 0.0048 … rules 0.0142 association 0.0064 support 0.0053 …

33 33 Theme Evolution Graph: KDD T SVM 0.007 criteria 0.007 classifica – tion 0.006 linear 0.005 … decision 0.006 tree 0.006 classifier 0.005 class 0.005 Bayes 0.005 … Classifica - tion 0.015 text 0.013 unlabeled 0.012 document 0.008 labeled 0.008 learning 0.007 … Informa - tion 0.012 web 0.010 social 0.008 retrieval 0.007 distance 0.005 networks 0.004 … ………… 1999 … web 0.009 classifica – tion 0.007 features0.006 topic 0.005 … mixture 0.005 random 0.006 cluster 0.006 clustering 0.005 variables 0.005 … topic 0.010 mixture 0.008 LDA 0.006 semantic 0.005 … … 20002001200220032004

34 34 Multi-Faceted Sentiment Summary (query=“Da Vinci Code”) NeutralPositiveNegative Facet 1: Movie... Ron Howards selection of Tom Hanks to play Robert Langdon. Tom Hanks stars in the movie,who can be mad at that? But the movie might get delayed, and even killed off if he loses. Directed by: Ron Howard Writing credits: Akiva Goldsman... Tom Hanks, who is my favorite movie star act the leading role. protesting... will lose your faith by... watching the movie. After watching the movie I went online and some research on... Anybody is interested in it?... so sick of people making such a big deal about a FICTION book and movie. Facet 2: Book I remembered when i first read the book, I finished the book in two days. Awesome book.... so sick of people making such a big deal about a FICTION book and movie. I’m reading “Da Vinci Code” now. … So still a good book to past time. This controversy book cause lots conflict in west society.

35 35 Separate Theme Sentiment Dynamics “book” “religious beliefs”

36 36 Event Impact Analysis: IR Research vector 0.0514 concept 0.0298 extend 0.0297 model 0.0291 space 0.0236 boolean 0.0151 function 0.0123 feedback 0.0077 … xml 0.0678 email 0.0197 model 0.0191 collect 0.0187 judgment 0.0102 rank 0.0097 subtopic 0.0079 … probabilist 0.0778 model 0.0432 logic 0.0404 ir 0.0338 boolean 0.0281 algebra 0.0200 estimate 0.0119 weight 0.0111 … model 0.1687 language 0.0753 estimate 0.0520 parameter 0.0281 distribution 0.0268 probable 0.0205 smooth 0.0198 markov 0.0137 likelihood 0.0059 … 1998 Publication of the paper “A language modeling approach to information retrieval” Starting of the TREC conferences year 1992 term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … Theme: retrieval models SIGIR papers

37 37 Topic Modeling + Social Networks 37 Authors writing about the same topic form a community Topic Model OnlyTopic Model + Social Network Separation of 3 research communities: IR, ML, Web

38 38 On-Going Work Combining contextual text analysis with visualization More detailed semantic modeling (entities, relations,…) Integration of search and contextual text analysis to develop an analyst’s workbench: –Interactive semantic navigation and probing –Synthesis of information/knowledge –Personalized/customized service

39 39 Project 4: Opinion Integration and Summarization Increasing popularity of Web 2.0 applications –more people express opinions on the Web 190,451 posts 4,773,658 results How to digest all?

40 40 4,773,658 results Motivation: Two kinds of opinions Expert opinions CNET editor’s review Wikipedia article Well-structured Easy to access Maybe biased Outdated soon 190,451 posts Ordinary opinions Forum discussions Blog articles Represent the majority Up to date Hard to access fragmental How to benefit from both?

41 41 Problem Definition cute… tiny…..thicker.. last many hrs die out soon could afford it still expensive Design Battery Price.. Topic: iPod Expert review with aspects Text collection of ordinary opinions, e.g. Weblogs Integrated Summary Design Battery Price Design Battery Price iTunes … easy to use… warranty …better to extend.. Review Aspects Extra Aspects Similar opinions Supplementary opinions Input Output

42 Methods Semi-Supervised Probabilistic Latent Semantic Analysis (PLSA) –The aspects extracted from expert reviews serve as clues to define a conjugate prior on topics –Maximum a Posteriori (MAP) estimation –Repeated applications of PLSA to integrate and align opinions in blog articles to expert review

43 43 Results: Product (iPhone) Opinion Integration with review aspects Review articleSimilar opinionsSupplementary opinions You can make emergency calls, but you can't use any other functions… N/A… methods for unlocking the iPhone have emerged on the Internet in the past few weeks, although they involve tinkering with the iPhone hardware… rated battery life of 8 hours talk time, 24 hours of music playback, 7 hours of video playback, and 6 hours on Internet use. iPhone will Feature Up to 8 Hours of Talk Time, 6 Hours of Internet Use, 7 Hours of Video Playback or 24 Hours of Audio Playback Playing relatively high bitrate VGA H.264 videos, our iPhone lasted almost exactly 9 freaking hours of continuous playback with cell and WiFi on (but Bluetooth off). Unlock/hack iPhone Activation Battery Confirm the opinions from the review Additional info under real usage

44 44 Results: Product (iPhone) Opinions on extra aspects supportSupplementary opinions on extra aspects 15You may have heard of iASign … an iPhone Dev Wiki tool that allows you to activate your phone without going through the iTunes rigamarole. 13Cisco has owned the trademark on the name "iPhone" since 2000, when it acquired InfoGear Technology Corp., which originally registered the name. 13With the imminent availability of Apple's uber cool iPhone, a look at 10 things current smartphones like the Nokia N95 have been able to do for a while and that the iPhone can't currently match... Another way to activate iPhone iPhone trademark originally owned by Cisco A better choice for smart phones?

45 45 Results: Product (iPhone) Support statistics for review aspects People care about price People comment a lot about the unique wi-fi feature Controversy: activation requires contract with AT&T

46 46 Summarization of Contradictory Opinions [Kim & Zhai CIKM 09] NeutralPositiveNegative Facet 1: Movie... Ron Howards selection of Tom Hanks to play Robert Langdon. Tom Hanks stars in the movie,who can be mad at that? But the movie might get delayed, and even killed off if he loses. Directed by: Ron Howard Writing credits: Akiva Goldsman... Tom Hanks, who is my favorite movie star act the leading role. protesting... will lose your faith by... watching the movie. After watching the movie I went online and some research on... Anybody is interested in it?... so sick of people making such a big deal about a FICTION book and movie. Facet 2: Book I remembered when i first read the book, I finished the book in two days. Awesome book.... so sick of people making such a big deal about a FICTION book and movie. I’m reading “Da Vinci Code” now. … So still a good book to past time. This controversy book cause lots conflict in west society. How can we help analysts digest and interpret contradictory opinioons?

47 Contrastive Opinion Summarization 47 x1x1 x2x2 xnxn … x3x3 X x4x4 x5x5 y1y1 y2y2 ymym … y3y3 Y y4y4

48 Contrastive Opinion Summarization 48 x1x1 x2x2 xnxn … x3x3 X x4x4 x5x5 y1y1 y2y2 ymym … y3y3 Y y4y4 u1u1 u2u2 ukuk … v1v1 v2v2 vkvk … Contrastive Opinion Summary

49 Problem Formulation 49 x1x1 x2x2 xnxn … x3x3 X x4x4 x5x5 y1y1 y2y2 ymym … y3y3 Y y4y4 u1u1 u2u2 ukuk … U v1v1 v2v2 vkvk … V Contrastiveness Representativeness

50 Problem Formulation 50 x1x1 x2x2 xnxn … x3x3 X x4x4 x5x5 y1y1 y2y2 ymym … y3y3 Y y4y4 u1u1 u2u2 ukuk … U v1v1 v2v2 vkvk … V Contrastiveness Representativeness

51 Summarization as Optimization 1. Define an appropriate content similarity function Ф 2. Define an appropriate contrastive similarity function ψ 3. Solve the optimization problem efficiently. 51

52 Sample Results NoPositiveNegative 1oh... and file transfers are fast & easy. you need the software to actually transfer files 2i noticed that the micro adjustment knob and collet are well made and work well too. the adjustment knob seemed ok, but when lowering the router, i have to practically pull it down while turning the knob. 3the navigation is nice enough, but scrolling and searching through thousands of tracks, hundreds of albums or artists, or even dozens of genres is not conducive to save driving difficult navigation - i wo n’t necessarily say " difficult,“ but i do n’t enjoy the scrollwheel to navigate. 4i imagine if i left my player untouched (no backlight) it could play for considerably more than 12 hours at a low volume level. there are 2 things that need fixing first is the battery life. it will run for 6 hrs without problems with medium usage of the buttons. 52

53 Sample Result NoPositiveNegative 1oh... and file transfers are fast & easy. you need the software to actually transfer files 2i noticed that the micro adjustment knob and collet are well made and work well too. the adjustment knob seemed ok, but when lowering the router, i have to practically pull it down while turning the knob. 3the navigation is nice enough, but scrolling and searching through thousands of tracks, hundreds of albums or artists, or even dozens of genres is not conducive to save driving difficult navigation - i wo n’t necessarily say " difficult,“ but i do n’t enjoy the scrollwheel to navigate. 4i imagine if i left my player untouched (no backlight) it could play for considerably more than 12 hours at a low volume level. there are 2 things that need fixing first is the battery life. it will run for 6 hrs without problems with medium usage of the buttons. 53 Different polarities of opinions made from different perspectives.

54 Sample Result NoPositiveNegative 1oh... and file transfers are fast & easy. you need the software to actually transfer files 2i noticed that the micro adjustment knob and collet are well made and work well too. the adjustment knob seemed ok, but when lowering the router, i have to practically pull it down while turning the knob. 3the navigation is nice enough, but scrolling and searching through thousands of tracks, hundreds of albums or artists, or even dozens of genres is not conducive to save driving difficult navigation - i wo n’t necessarily say " difficult,“ but i do n’t enjoy the scrollwheel to navigate. 4i imagine if i left my player untouched (no backlight) it could play for considerably more than 12 hours at a low volume level. there are 2 things that need fixing first is the battery life. it will run for 6 hrs without problems with medium usage of the buttons. 54 Positive vs. negative Not much disagreement

55 Sample Result NoPositiveNegative 1oh... and file transfers are fast & easy. you need the software to actually transfer files 2i noticed that the micro adjustment knob and collet are well made and work well too. the adjustment knob seemed ok, but when lowering the router, i have to practically pull it down while turning the knob. 3the navigation is nice enough, but scrolling and searching through thousands of tracks, hundreds of albums or artists, or even dozens of genres is not conducive to save driving difficult navigation - i wo n’t necessarily say " difficult,“ but i do n’t enjoy the scrollwheel to navigate. 4i imagine if i left my player untouched (no backlight) it could play for considerably more than 12 hours at a low volume level. there are 2 things that need fixing first is the battery life. it will run for 6 hrs without problems with medium usage of the buttons. 55 Judgments revealing detailed conditions

56 Latent Aspect Rating Analysis Motivation: –Many online reviews with text content as well as numeral ratings –How can we decompose an overall rating into ratings on different aspects? –How can we infer a reviewer’s relative weights on different aspects (rating factors)? –More generally, how to analyze latent factors related to numerical values associated with text? Solution –Latent Rating Regression Model 2009 © ChengXiang Zhai 56

57 Need for analyzing opinions at fine grained level of topical aspects! Problem 1. Different reviewers give the same overall ratings for different reasons 57 How do we decompose overall ratings into aspect ratings?

58 Needs for further analyzing aspect emphasis of each reviewer! Problem 2. Same rating means differently for different reviewers 58 How do we infer aspect weights the reviewers have put onto the ratings?

59 Our Solution 59 Reviews + overall ratingsAspect segments location:1 amazing:1 walk:1 anywhere:1 0.1 0.7 0.1 0.9 nice:1 accommodating:1 smile:1 friendliness:1 attentiveness:1 Term weightsAspect Rating 0.0 0.9 0.1 0.3 room:1 nicely:1 appointed:1 comfortable:1 0.6 0.8 0.7 0.8 0.9 Aspect SegmentationLatent Rating Regression 1.3 1.8 3.8 Aspect Weight 0.2 0.6 Boot-stripping method +

60 Latent Rating Regression (LRR) Bayesian regression 60 Aspect rating Overall rating Aspect weight Joint probability

61 Inference in LRR Aspect rating – Aspect weight –Maximum a posteriori estimation 61 priorlikelihood

62 Qualitative evaluation 1 Aspect-level Hotel Analysis –Hotels with the same overall rating but different aspect rating –A better understanding in the finer-granularity level 62

63 Qualitative evaluation-2 Reviewer-level Hotel Analysis –Different reviewers’ ratings about the same hotel –Detailed analysis of reviewer’s opinion 63

64 Quantitative Evaluation Results Analysis 64

65 Application 1 User Rating Behavior Analysis –Rating factor analysis –Aspect emphasis analysis 65

66 Application 2 Aspect-based Summarization 66

67 Ongoing Work Discovery of controversy and contrastive summarization Information trustworthiness 67

68 68 The End Thank You! More information about our research can be found at http://timan.cs.uiuc.edu/


Download ppt "1 Research on Intelligent Text Information Management ChengXiang Zhai Department of Computer Science Graduate School of Library & Information Science Institute."

Similar presentations


Ads by Google