Text Retrieval and Mining + Software Engineering =?

Slides:

Advertisements

Similar presentations

1 A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs Qiaozhu Mei, Chao Liu, Hang Su, and ChengXiang Zhai : University of Illinois.

Advertisements

Introduction to ReviewMiner Hongning Wang Department of Computer Science University of Illinois at Urbana-Champaign

1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.

1 Introduction to Natural Language Processing (Lecture for CS410 Text Information Systems) Jan 28, 2011 ChengXiang Zhai Department of Computer Science.

Introduction to Natural Language Processing Hongning Wang

WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.

Latent Aspect Rating Analysis without Aspect Keyword Supervision Hongning Wang, Yue Lu, ChengXiang Zhai Department of.

2008 © ChengXiang Zhai 1 Contextual Text Analysis with Probabilistic Topic Models ChengXiang Zhai Department of Computer Science Graduate School of Library.

WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.

Basic IR Concepts & Techniques ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Personalized Ontologies for Web Search and Caching Susan Gauch Information and Telecommunications Technology Center Electrical Engineering and Computer.

In Situ Evaluation of Entity Ranking and Opinion Summarization using Kavita Ganesan & ChengXiang Zhai University of Urbana Champaign

CS598CXZ Course Summary ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.

2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, Pick a Good IR Research Problem ChengXiang Zhai Department of Computer.

Automatic Construction of Topic Maps for Navigation in Information Space ChengXiang (“Cheng”) Zhai Department of Computer Science University of Illinois.

1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,

Analysis of Causal Topics in Text Data and Time Series with Applications to Presidential Prediction Markets Hyun Duk Kim, ChengXiang (“Cheng”) Zhai (UIUC)

Comparative Text Mining Q. Mei, C. Liu, H. Su, A. Velivelli, B. Yu, C. Zhai DAIS The Database and Information Systems Laboratory. at The University of.

Real World IR Challenges (CS598-CXZ Advanced Topics in IR Presentation) Jan. 20, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,

Toward A Session-Based Search Engine Smitha Sriram, Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Overview of Information Retrieval (CS598-CXZ Advanced Topics in IR Presentation) Jan. 18, 2005 ChengXiang Zhai Department of Computer Science University.

Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign.

Implicit User Modeling for Personalized Search Xuehua Shen, Bin Tan, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.

Data Mining: Text Mining

Automatic Labeling of Multinomial Topic Models

Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.

Introduction to Natural Language Processing Hongning Wang

Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.

Information Retrieval in Practice

CS510 Advanced Topics in Information Retrieval (Fall 2017)

Information Organization: Overview

TextScope: Enhance Human Perception via Text Mining

Taking a Tour of Text Analytics

Sentiment analysis algorithms and applications: A survey

Contextual Intelligence as a Driver of Services Innovation

文本大数据分析与挖掘：机遇，挑战，及应用前景 Analysis and Mining of Big Text Data: Opportunities, Challenges, and Applications ChengXiang Zhai (翟成祥) Department of Computer.

School of Computer Science & Engineering

Probabilistic Topic Model

Introduction to IR Research

Statistical Learning Methods for Natural Language Processing on the Internet 徐丹云.

Search Engine Architecture

Statistical Data Analysis

Personalized Social Image Recommendation

Master the MULTI-SCREEN WORLD.

Qiaozhu Mei†, Chao Liu†, Hang Su‡, and ChengXiang Zhai†

Aspect-based sentiment analysis

TextScope: Enhance Human Perception via Text Mining

ChengXiang (“Cheng”) Zhai Department of Computer Science

Introduction to TIMAN: Text Information Managemetn & Analysis

CS510 (Fall 2018) Advanced Topics in Information Retrieval

TextScope: Enhance Human Perception via Text Mining

Overview of Machine Learning

iSRD Spam Review Detection with Imbalanced Data Distributions

CSE 635 Multimedia Information Retrieval

Introduction to Visual Analytics

Course Summary ChengXiang “Cheng” Zhai Department of Computer Science

Statistical Data Analysis

CS246: Information Retrieval

CHAPTER 7: Information Visualization

Information Retrieval

Understanding User Intentions by Computational Techniques

Web Mining Research: A Survey

Information Organization: Overview

NAÏVE BAYES CLASSIFICATION

GhostLink: Latent Network Inference for Influence-aware Recommendation

Introduction Dataset search

CS249: Neural Language Model

Power BI for the Consumer

Presentation transcript:

Text Retrieval and Mining + Software Engineering =? ChengXiang (“Cheng”) Zhai Department of Computer Science University of Illinois at Urbana-Champaign The Sixth International Workshop on Software Mining, Oct. 30, 2017, Urbana, IL

Text data cover all kinds of topics on the Web People Events Products Services, … … Sources: Blogs Microblogs Forums Reviews ,… 45M reviews 53M blogs 1307M posts 65M msgs/day 115M users 10M groups …

Text Data in Software Engineering Throughout the life cycle of a software, we encounter all kinds of text data, e.g., Requirements, specification Software documentation, comments Communications among team members After a product is put in use, we further encounter other kinds of text data, e.g., Bug reports Software reviews Forum discussions Also, literature in software engineering … How can we manage and exploit all these data to improve software productivity, software quality, and user experience ?

Main Techniques for Harnessing Big Text Data: Text Retrieval + Text Analysis Text Mining Big Text Data Big Text Data Small Relevant Data Small Relevant Data Knowledge Many Applications

Natural Language Content Analysis Conceptual Framework for Text Retrieval & Mining: Text Information Systems (TIS) Users Retrieval Applications Summarization Visualization Analytics Applications Filtering Topic Analysis Information Organization Information Access Text Mining Search Extraction Categorization Clustering Natural Language Content Analysis Text Text Acquisition

Elements of TIS: Natural Language Content Analysis Natural Language Processing (NLP) is the foundation of TIS Enable understanding of meaning of text Provide semantic representation of text for TIS Current NLP techniques mostly rely on statistical machine learning enhanced with limited linguistic knowledge Shallow techniques are robust, but deeper semantic analysis is only feasible for very limited domain Some TIS capabilities require deeper NLP than others Most text information systems use very shallow NLP (“bag of words” representation)

Elements of TIS: Text Access Search: take a user’s query and return relevant documents Filtering/Recommendation: monitor an incoming stream and recommend to users relevant items (or discard non-relevant ones) Categorization: classify a text object into one of the predefined categories Summarization: take one or multiple text documents, and generate a concise summary of the essential content

Elements of TIS: Text Mining Topic Analysis: take a set of documents, extract and analyze topics in them Information Extraction: extract entities, relations of entities or other “knowledge nuggets” from text Clustering: discover groups of similar text objects (terms, sentences, documents, …) Visualization: visually display patterns in text data

Outline Overview of Natural Language Processing (NLP) Main Techniques for Text Retrieval and Mining Selected Projects in Text Retrieval and Mining Applications of Text Retrieval and Mining in Software Engineering Summary

A person saying this may be reminding another person to An Example of NLP Lexical analysis (part-of-speech tagging) A dog is chasing a boy on the playground Det Noun Aux Verb Prep Dog(d1). Boy(b1). Playground(p1). Chasing(d1,b1,p1). Semantic analysis Noun Phrase Complex Verb Prep Phrase Verb Phrase Sentence Syntactic analysis (Parsing) Scared(x) if Chasing(_,x,_). + Scared(b1) Inference A person saying this may be reminding another person to get the dog back… Pragmatic analysis (speech act)

NLP Is Difficult! Natural language is designed to make human communication efficient. As a result, we omit a lot of “common sense” knowledge, which we assume the hearer/reader possesses we keep a lot of ambiguities, which we assume the hearer/reader knows how to resolve This makes EVERY step in NLP hard Ambiguity is a “killer”! Common sense reasoning is pre-required

Examples of Challenges Word-level ambiguity: E.g., “design” can be a noun or a verb (Ambiguous POS) “root” has multiple meanings (Ambiguous sense) Syntactic ambiguity: E.g., “natural language processing” (Modification) “A man saw a boy with a telescope.” (PP Attachment) Anaphora resolution: “John persuaded Bill to buy a TV for himself.” (himself = John or Bill?) Presupposition: “He has quit smoking.” implies that he smoked before.

Parsing: partial >90%(?) The State of the Art A dog is chasing a boy on the playground POS Tagging: 97% Det Noun Aux Verb Det Noun Prep Det Noun Noun Phrase Noun Phrase Complex Verb Noun Phrase Prep Phrase Semantics: some aspects - Entity/relation extraction - Word sense disambiguation - Sentiment analysis Verb Phrase Parsing: partial >90%(?) Verb Phrase Sentence Speech act analysis: ??? Inference: ???

What We Can’t Do Robust & general NLP tends to be “shallow” 100% POS tagging “He turned off the highway.” vs “He turned off the fan.” General complete parsing “A man saw a boy with a telescope.” Precise deep semantic analysis What’s the meaning of “own” in “John owns a restaurant.”? Robust & general NLP tends to be “shallow” while “deep” understanding doesn’t scale up

How Much NLP is Really Needed? “Easier” & More “workarounds” Tasks Dependency on NLP Classification/ Retrieval Summarization/ Extraction/ Topic Mining Translation/ Dialogue Question Answering

Workaround Example I: ELIZA ELIZA plays the role of a therapist How does it work? Heuristic rules based on Keyword matching! E.g., I remember X => Do you often think of X? always => Can you think of a specific example? Person : Men are all alike. ELIZA : In what way? Person : They're always bugging us about something or other. ELIZA : Can you think of a specific example? Person : Well, my boyfriend made me come here. ELIZA : Your boyfriend made you come here? Person : He says I'm depressed much of the time. ELIZA : I am sorry to hear that you are depressed. Person : It's true. I am unhappy. ELIZA : Do you think coming here will help you not to be unhappy? Person : I need some help, that much seems certain. ELIZA : What would it mean to you if you got some help? Person : Perhaps I could learn to get along with my mother. ELIZA : Tell me more about your family It keeps the dialogue going on, but NEVER makes any claim! A breakdown example: Person : Failure is the mother of success. ELIZA : Tell me more about your family No NLP, but useful. Perhaps we should call this NLP? Statistical NLP often has a similar flavor with “SOFT” rules LEARNED from data

Workaround Example II: Statistical Translation Learn how to translate Chinese to English from many example translations Intuitions: If we have seen all possible translations, then we simply lookup If we have seen a similar translation, then we can adapt If we haven’t seen any example that’s similar, we try to generalize what we’ve seen English Speaker Translator Noisy Channel P(E) P(C|E) Words (E) Chinese Words(C) Translation P(E|C)=? All these intuitions are captured through a probabilistic model

NLP for Text Retrieval and Mining Must be general, robust & efficient  Shallow NLP “Bag of words” representation tends to be sufficient for most search and mining tasks (but not all!) Some text retrieval techniques can naturally address NLP problems (e.g., word sense disambiguation based on other words in query) However, deeper NLP is needed for complex search tasks Most useful NLP techniques are usually based on statistical language models (to capture patterns in text data) and supervised machine learning (to leverage human expertise)

Outline Overview of Natural Language Processing (NLP) Main Techniques for Text Retrieval and Mining Selected Projects in Text Retrieval and Mining Applications of Text Retrieval and Mining in Software Engineering Summary

Access to Relevant Text Data: Text Retrieval Big Text Data Small Relevant Data Search Engine Recommender System 1. Natural Language Content Analysis Text Access Querying + Browsing Push Pull User

Two Modes of Text Access: Pull vs. Push Pull Mode (search engines) Users take initiative Ad hoc information need Push Mode (recommender systems) Systems take initiative Stable information need or system has good knowledge about a user’s need

Pull Mode: Querying vs. Browsing User enters a (keyword) query System returns relevant documents Works well when the user knows what keywords to use Browsing User navigates into relevant information by following a path enabled by the structures on the documents Works well when the user wants to explore information, doesn’t know what keywords to use, or can’t conveniently enter a query

Information Seeking as Sightseeing Sightseeing: Know address of an attraction? Yes: take a taxi and go directly to the site No: walk around or take a taxi to a nearby place then walk Information seeking: Know exactly what you want to find? Yes: use the right keywords as a query and find the information directly No: browse the information space or start with a rough query and then browse

Text Mining and Analytics Text mining  Text analytics Turn text data into high-quality information or actionable knowledge Minimizes human effort (on consuming text data) Supplies knowledge for optimal decision making Related to text retrieval, which is an essential component in any text mining system Text retrieval can be a preprocessor for text mining Text retrieval is needed for knowledge provenance

Humans as Subjective & Intelligent “Sensors” Sense Report Real World Sensor Data Weather Thermometer 3C , 15F, … Perceive Express “Human Sensor” Locations Geo Sensor 41°N and 120°W …. Network Sensor 01000100011100 Networks

Unique Value of Text Data Useful to all big data applications Especially useful for mining knowledge about people’s behavior, attitude, and opinions Directly express knowledge about our world: Small text data are also useful! Data  Information  Knowledge Text Data

Opportunities of Text Mining Applications + Non-Text Data 4. Infer other real-world variables (predictive analytics) + Context 2. Mining content of text data Real World Observed World Text Data Perceive Express (Perspective) (English) 3. Mining knowledge about the observer 1. Mining knowledge about language

TextScope to enhance human perception Microscope Telescope Intelligent Interactive Retrieval & Text Analysis for Task Support and Decision Making

TextScope in Action: intelligent interactive decision support Interactive information retrieval Natural language processing Interactive text analysis Text + Non-Text Learning to interact Prediction Domain Knowledge Predicted Values of Real World Variables Optimal Decision Making Predictive Model Multiple Predictors (Features) … Sensor 1 Sensor k … Non-Text Data Text Joint Mining of Non-Text and Text Real World

TextScope = Intelligent & Interactive Information Retrieval + Text Mining Task Panel TextScope … Topic Analyzer Opinion Prediction Event Radar Search Box Microsoft (MSFT,) Google, IBM (IBM) and other cloud-computing rivals of Amazon Web Services are bracing for an AWS "partnership" announcement with VMware expected to be announced Thursday. … MyFilter1 MyFilter2 … Select Time Select Region My WorkSpace Project 1 Alert A Alert B ...

Outline Overview of Natural Language Processing (NLP) Main Techniques for Text Retrieval and Mining Selected Projects in Text Retrieval and Mining Applications of Text Retrieval and Mining in Software Engineering Summary

Sample Project 1: User-Centered Adaptive IR (UCAIR) A novel retrieval strategy emphasizing user modeling (“user-centered”) search context modeling (“adaptive”) interactive retrieval Implemented as a personalized search agent that sits on the client-side (owned by the user) integrates information around a user (1 user vs. N sources as opposed to 1 source vs. N users) collaborates with each other goes beyond search toward task support

Non-Optimality of Document-Centered Search Engines As of Oct. 17, 2005 Query = Jaguar Car Software Animal Mixed results, unlikely optimal for any particular user

... The UCAIR Project WEB Email Desktop Files Personalized Viewed Web pages Query History Search Engine Search Engine Personalized search agent Search Engine “jaguar” Personalized search agent Desktop Files “jaguar”

Potential Benefit of Personalization Suppose we know: Previous query = “racing cars” vs. “Apple OS” “car” occurs far more frequently than “Apple” in pages browsed by the user in the last 20 days 3. User just viewed an “Apple OS” document Car Car Software Car Animal Car

Intelligent Re-ranking of Unseen Results When a user clicks on the “back” button after viewing a document, UCAIR reranks unseen results to pull up documents similar to the one the user has viewed

UCAIR Outperforms Google [Shen et al. 05] Precision at N documents Ranking Method prec@5 prec@10 prec@20 prec@30 Google 0.538 0.472 0.377 0.308 UCAIR 0.581 0.556 0.453 0.375 Improvement 8.0% 17.8% 20.2% 21.8% UCAIR toolbar available at http://sifaka.cs.uiuc.edu/ir/ucair/

Future: Personal Information Agent Desktop WWW Intranet Email User Profile Active Info Service E-COM IM … Task Support Security Handler Personal Content Index Sports Blog Frequently Accessed Info … Literature

Sample Project 2: Multi-Resolution Topic Map for Browsing Promoting browsing as a “first-class citizen” Multi-resolution topic map for browsing Enable a user to find information through navigation Very useful when a user can’t formulate effective queries or uses a small screen device Search log as information footprints Organize search log into a topic map Allow a user to follow information footprints of previous users Enable social surfing

Querying vs. Browsing

Information Seeking as Sightseeing Know the address of an attraction site? Yes: take a taxi and go directly to the site No: walk around or take a taxi to a nearby place then walk around Know what exactly you want to find? Yes: use the right keywords as a query and find the information directly No: browse the information space or start with a rough query and then browse When query fails, browsing comes to rescue…

Current Support for Browsing is Limited Hyperlinks Only page-to-page Mostly manually constructed Browsing step is very small Web directories Manually constructed Fixed categories Only support vertical navigation Beyond hyperlinks? Beyond fixed categories? How to promote browsing as a “first-class citizen”? ODP

Sightseeing Analogy Continues… Horizontal navigation Region Zoom in Zoom out

Topic Map for Touring Information Space Topic regions Multiple resolutions Zoom in 0.03 0.05 0.03 0.02 0.01 Zoom out Horizontal navigation

Topic-Map based Browsing Demo

How can we construct such a multi-resolution topic map? Multiple possibilities…

Search Logs as Information Footprints Footprints in information space User 2722 searched for "national car rental" [!] at 2006-03-09 11:24:29 User 2722 searched for "military car rental benefits" [!] at 2006-03-10 09:33:37 (found http://www.valoans.com) User 2722 searched for "military car rental benefits" [!] at 2006-03-10 09:33:37 (found http://benefits.military.com) User 2722 searched for "military car rental benefits" [!] at 2006-03-10 09:33:37 (found http://www.avis.com) User 2722 searched for "enterprise rent a car" [!] at 2006-04-05 23:37:42 (found http://www.enterprise.com) User 2722 searched for "meineke car care center" [!] at 2006-05-02 09:12:49 (found http://www.meineke.com) User 2722 searched for "car rental" [!] at 2006-05-25 15:54:36 User 2722 searched for "autosave car rental" [!] at 2006-05-25 23:26:54 (found http://eautosave.com) User 2722 searched for "budget car rental" [!] at 2006-05-25 23:29:53 User 2722 searched for "alamo car rental" [!] at 2006-05-25 23:56:13 ……

Information Footprints  Topic Map Challenges How to define/construct a topic region How to control granularities/resolutions of topic regions How to connect topic regions to support effective browsing Two approaches Multi-granularity clustering Query editing

Collaborative Surfing New queries become new footprints Navigation trace enriches map structures Clickthroughs become new footprints Browse logs offer more opportunities to understand user interests and intents

Sample Project 3:Contextual Text Mining Documents are often associated with context (meta-data) Direct context: time, location, source, authors,… Indirect context: events, policies, … Many applications require “contextual text analysis”: Discovering topics from text in a context-sensitive way Analyzing variations of topics over different contexts Revealing interesting patterns (e.g., topic evolution, topic variations, topic communities)

Example 1: Comparing News Articles Vietnam War Afghan War Iraq War CNN Fox Blog Before 9/11 During Iraq war Current US blog European blog Others Common Themes “Vietnam” specific “Afghan” specific “Iraq” specific United nations … Death of people What’s in common? What’s unique?

More Contextual Analysis Questions What positive/negative aspects did people say about X (e.g., a person, an event)? Trends? How does an opinion/topic evolve over time? What are emerging topics? What topics are fading away? How can we characterize a social network?

Research Questions Can we model all these problems generally? Can we solve these problems with a unified approach? How can we bring human into the loop?

Contextual Probabilistic Latent Semantics Analysis ([KDD 2006]…) Choose a theme View1 View2 View3 Themes government donation New Orleans Draw a word from i Criticism of government response to the hurricane primarily consisted of criticism of its response to … The total shut-in oil production from the Gulf of Mexico … approximately 24% of the annual production and the shut-in gas production … Over seventy countries pledged monetary donations or other assistance. … Document context: Time = July 2005 Location = Texas Author = xxx Occup. = Sociologist Age Group = 45+ … government 0.3 response 0.2.. donate 0.1 relief 0.05 help 0.02 .. city 0.2 new 0.1 orleans 0.05 .. government response donate help aid Orleans new Texas July 2005 sociologist Choose a view Theme coverages: Texas July 2005 document …… Choose a Coverage

Comparing News Articles Iraq War (30 articles) vs Comparing News Articles Iraq War (30 articles) vs. Afghan War (26 articles) The common theme indicates that “United Nations” is involved in both wars Cluster 1 Cluster 2 Cluster 3 Common Theme united 0.042 nations 0.04 … killed 0.035 month 0.032 deaths 0.023 Iraq n 0.03 Weapons 0.024 Inspections 0.023 troops 0.016 hoon 0.015 sanches 0.012 Afghan Northern 0.04 alliance 0.04 kabul 0.03 taleban 0.025 aid 0.02 taleban 0.026 rumsfeld 0.02 hotel 0.012 front 0.011 Collection-specific themes indicate different roles of “United Nations” in the two wars

Spatiotemporal Patterns in Blog Articles Query= “Hurricane Katrina” Topics in the results: Spatiotemporal patterns

Theme Life Cycles (“Hurricane Katrina”) Oil Price price 0.0772 oil 0.0643 gas 0.0454 increase 0.0210 product 0.0203 fuel 0.0188 company 0.0182 … New Orleans city 0.0634 orleans 0.0541 new 0.0342 louisiana 0.0235 flood 0.0227 evacuate 0.0211 storm 0.0177 … The upper figure is the life cycles for different themes in Texas. The red line refers to a theme with the top probability words such as price, oil, gas, increase, etc, from which we know that it is talking about “oil price”. The blue one, on the other hand, talks about events that happened in the city “new orleans”. In the upper figure, we can see that both themes were getting hot during the first two weeks, and became weaker around the mid September. The theme New Orleans got strong again around the last week of September while the other theme dropped monotonically. In the bottom figure, which is the life cycles for the same theme “New Orleans” in different states. We observe that this theme reaches the highest probability first in Florida and Louisiana, followed by Washington and Texas, consecutively. During early September, this theme drops significantly in Louisiana while still strong in other states. We suppose this is because of the evacuation in Louisiana. Surprisingly, around late September, a re-arising pattern can be observed in most states, which is most significant in Louisiana. Since this is the time period in which Hurricane Rita arrived, we guess that Hurricane Rita has an impact on the discussion of Hurricane Katrina. This is reasonable since people are likely to mention the two hurricanes together or make comparisons. We can find more clues to this hypothesis from Hurricane Rita data set.

Theme Snapshots (“Hurricane Katrina”) Week4: The theme is again strong along the east coast and the Gulf of Mexico Week3: The theme distributes more uniformly over the states Week2: The discussion moves towards the north and west Week5: The theme fades out in most states Week1: The theme is the strongest along the Gulf of Mexico This slide shows the snapshot for theme ``Government Response'' over the first five weeks of Hurricane Katrina. The darker the color is, the hotter the discussion about this theme is. we observe that at the first week of Hurricane Katrina, the theme ``Government Response'‘ is the strongest in the southeast states, especially those along the Gulf of Mexico. In week 2, we can see the pattern that the theme is spreading towards the north and western states because the northern states are getting darker. In week 3, the theme is distributed even more uniformly, which means that it is spreading all over the states. However, in week 4, we observe that the theme converges to east states and southeast coast again. Interestingly, this week happens to overlap with the first week of Hurricane Rita, which may raise the public concern about government response again in those areas. In week 5, the theme becomes weak in most inland states and most of the remaining discussions are along the coasts. Another interesting observation is that this theme is originally very strong in Louisiana (the one to the right of Texas, ), but dramatically weakened in Louisiana during week 2 and 3, and becomes strong again from the fourth week. Interestingly, Week 2 and 3 are consistent with the time of evacuation in Louisiana.

Theme Life Cycles (KDD Papers) gene 0.0173 expressions 0.0096 probability 0.0081 microarray 0.0038 … marketing 0.0087 customer 0.0086 model 0.0079 business 0.0048 … rules 0.0142 association 0.0064 support 0.0053 …

Theme Evolution Graph: KDD 1999 2000 2001 2002 2003 2004 T web 0.009 classifica –tion 0.007 features0.006 topic 0.005 … SVM 0.007 criteria 0.007 classifica – tion 0.006 linear 0.005 … mixture 0.005 random 0.006 cluster 0.006 clustering 0.005 variables 0.005 … topic 0.010 mixture 0.008 LDA 0.006 semantic 0.005 … decision 0.006 tree 0.006 classifier 0.005 class 0.005 Bayes 0.005 … … Classifica - tion 0.015 text 0.013 unlabeled 0.012 document 0.008 labeled 0.008 learning 0.007 … Informa - tion 0.012 web 0.010 social 0.008 retrieval 0.007 distance 0.005 networks 0.004 … … …

Multi-Faceted Sentiment Summary (query=“Da Vinci Code”) Neutral Positive Negative Facet 1: Movie ... Ron Howards selection of Tom Hanks to play Robert Langdon. Tom Hanks stars in the movie,who can be mad at that? But the movie might get delayed, and even killed off if he loses. Directed by: Ron Howard Writing credits: Akiva Goldsman ... Tom Hanks, who is my favorite movie star act the leading role. protesting ... will lose your faith by ... watching the movie. After watching the movie I went online and some research on ... Anybody is interested in it? ... so sick of people making such a big deal about a FICTION book and movie. Facet 2: Book I remembered when i first read the book, I finished the book in two days. Awesome book. I’m reading “Da Vinci Code” now. … So still a good book to past time. This controversy book cause lots conflict in west society.

Separate Theme Sentiment Dynamics “book” “religious beliefs”

Event Impact Analysis: IR Research xml 0.0678 email 0.0197 model 0.0191 collect 0.0187 judgment 0.0102 rank 0.0097 subtopic 0.0079 … vector 0.0514 concept 0.0298 extend 0.0297 model 0.0291 space 0.0236 boolean 0.0151 function 0.0123 feedback 0.0077 … Publication of the paper “A language modeling approach to information retrieval” Starting of the TREC conferences year 1992 term 0.1599 relevance 0.0752 weight 0.0660 feedback 0.0372 independence 0.0311 model 0.0310 frequent 0.0233 probabilistic 0.0188 document 0.0173 … Theme: retrieval models SIGIR papers 1998 model 0.1687 language 0.0753 estimate 0.0520 parameter 0.0281 distribution 0.0268 probable 0.0205 smooth 0.0198 markov 0.0137 likelihood 0.0059 … probabilist 0.0778 model 0.0432 logic 0.0404 ir 0.0338 boolean 0.0281 algebra 0.0200 estimate 0.0119 weight 0.0111 …

Applications of TextScope Interactive information retrieval Natural language processing Interactive text analysis Text + Non-Text Learning to interact Prediction Domain Knowledge Predicted Values of Real World Variables Optimal Decision Making Predictive Model Multiple Predictors (Features) … Sensor 1 Sensor k … Non-Text Data Text Joint Mining of Non-Text and Text Real World

Application Example 1: Medical & Health Predicted Values of Real World Variables Optimal Decision Making Predictive Model Multiple Predictors (Features) … Diagnosis, optimal treatment Side effects of drugs, … Doctors, Nurses, Patients… Sensor 1 Sensor k … Non-Text Data Text Joint Mining of Non-Text and Text Medical & Health Real World

Discovery of Adverse Drug Reactions from Forums [Wang et al. 14] Green: Disease symptoms Blue: Side effect symptoms Red: Drug Drug: Cefalexin ADR: panic attack faint …. TextScope Sheng Wang et al. 2014. SideEffectPTM: an unsupervised topic model to mine adverse drug reactions from health forums. In ACM BCB 2014.

Sample ADRs Discovered [Wang et al. 14] Drug(Freq) Drug Use Symptoms in Descending Order Zoloft (84) antidepressant weigh gain, weight, depression, side effects, mgs, gain weight, anxiety, nausea, head, brain, pregnancy, pregnant, headaches, depressed, tired Ativan (33) anxiety disorders Ativan, sleep, Seroquel, doc prescribed seroqual, raising blood sugar levels, anti-psychotic drug, diabetic, constipation, diabetes, 10mg, benzo, addicted Topamax (20) anticonvulsant Topmax, liver, side effects, migraines, headaches, weight, Topamax, pdoc, neurologist, supplement, sleep, fatigue, seizures, liver problems, kidney stones Ephedrine (2) stimulant dizziness, stomach, Benadryl, dizzy, tired, lethargic, tapering, tremors, panic attach, head Unreported to FDA Sheng Wang et al. 2014. SideEffectPTM: an unsupervised topic model to mine adverse drug reactions from health forums. In ACM BCB 2014.

Application Example 2: Business intelligence Predicted Values of Real World Variables Optimal Decision Making Predictive Model Multiple Predictors (Features) … Business intelligence Consumer trends… Business analysts, Market researcher… Sensor 1 Sensor k … Non-Text Data Text Joint Mining of Non-Text and Text Products Real World

Latent Aspect Rating Analysis (LARA) [Wang et al. 10] How to infer aspect ratings? TextScope Value Location Service … How to infer aspect weights? Hongning Wang, Yue Lu, ChengXiang Zhai. Latent Aspect Rating Analysis on Review Text Data: A Rating Regression Approach, Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'10), pages 115-124, 2010.

Solving LARA in two stages: Aspect Segmentation + Rating Regression Latent Rating Regression Reviews + overall ratings Aspect segments Term Weights Aspect Rating Aspect Weight location:1 amazing:1 walk:1 anywhere:1 0.0 2.9 0.1 0.9 3.9 0.2 room:1 nicely:1 appointed:1 comfortable:1 0.1 1.7 3.9 4.8 0.2 nice:1 accommodating:1 smile:1 friendliness:1 attentiveness:1 2.1 1.2 1.7 2.2 0.6 5.8 0.6 Observed Latent!

Latent Rating Regression Aspect segments Term Weights Aspect Rating Aspect Weight location:1 amazing:1 walk:1 anywhere:1 0.0 0.9 0.1 0.3 1.3 0.2 room:1 nicely:1 appointed:1 comfortable:1 0.1 0.7 0.9 1.8 0.2 nice:1 accommodating:1 smile:1 friendliness:1 attentiveness:1 0.6 0.8 0.7 0.9 3.8 0.6 Conditional likelihood

A Unified Generative Model for LARA Entity Review Aspects Aspect Rating Aspect Weight Location Excellent location in walking distance to Tiananmen Square and shopping streets. That’s the best part of this hotel! The rooms are getting really old. Bathroom was nasty. The fixtures were falling off, lots of cracks and everything looked dirty. I don’t think it worth the price. Service was the most disappointing part, especially the door men. this is not how you treat guests, this is not hospitality. location amazing walk anywhere 0.86 0.04 0.10 Room room dirty appointed smelly Service terrible front-desk smile unhelpful

Sample Result 1: Rating Decomposition Hotels with the same overall rating but different aspect ratings Reveal detailed opinions at the aspect level (All 5 Stars hotels, ground-truth in parenthesis.) Hotel Value Room Location Cleanliness Grand Mirage Resort 4.2(4.7) 3.8(3.1) 4.0(4.2) 4.1(4.2) Gold Coast Hotel 4.3(4.0) 3.9(3.3) 3.7(3.1) Eurostars Grand Marina Hotel 3.7(3.8) 4.4(3.8) 4.1(4.9) 4.5(4.8)

Sample Result 2: Comparison of reviewers Reviewer-level Hotel Analysis Different reviewers’ ratings on the same hotel Reveal differences in opinions of different reviewers Reviewer Value Room Location Cleanliness Mr.Saturday 3.7(4.0) 3.5(4.0) 5.8(5.0) Salsrug 5.0(5.0) 3.0(3.0) 5.0(4.0) (Hotel Riu Palace Punta Cana)

Sample Result 3:Aspect-Specific Sentiment Lexicon Uncover sentimental information directly from the data Value Rooms Location Cleanliness resort 22.80 view 28.05 restaurant 24.47 clean 55.35 value 19.64 comfortable 23.15 walk 18.89 smell 14.38 excellent 19.54 modern 15.82 bus 14.32 linen 14.25 worth 19.20 quiet 15.37 beach 14.11 maintain 13.51 bad -24.09 carpet -9.88 wall -11.70 smelly -0.53 money -11.02 smell -8.83 bad -5.40 urine -0.43 terrible -10.01 dirty -7.85 road -2.90 filthy -0.42 overprice -9.06 stain -5.85 website -1.67 dingy -0.38

Sample Result 4: User Rating Behavior Analysis Expensive Hotel Cheap Hotel 5 Stars 3 Stars 1 Star Value 0.134 0.148 0.171 0.093 Room 0.098 0.162 0.126 0.121 Location 0.074 0.161 0.082 Cleanliness 0.081 0.163 0.116 0.294 Service 0.251 0.101 0.049 People like expensive hotels because of good service People like cheap hotels because of good value

Sample Result 5: Personalized Recommendation of Entities Query: 0.9 value 0.1 others Non-Personalized Personalized

Application Example 3: Prediction of Stock Market Predicted Values of Real World Variables Optimal Decision Making Market volatility Stock trends, … Predictive Model Multiple Predictors (Features) … Stock traders Sensor 1 Sensor k … Non-Text Data Text Joint Mining of Non-Text and Text Real World Events in Real World

Text Mining for Understanding Time Series [Kim et al. CIKM’13] What might have caused the stock market crash? Sept 11 attack! … Time TextScope Dow Jones Industrial Average [Source: Yahoo Finance] If we consider external factor, there are additional challenges. For example, this is Dow Jones Industrial Average, average stock price. Near September, stock price crash. Actually, this is 2001, and September 11 terrorist attack happened at that time. Like this, these days, need for analyzing text data jointly with external temporal factor is increasing. If we can understand why this happen, automatically find reasons, it would be very useful. … Time Any clues in the companion news stream? H. Kim, M. Castellanos, M. Hsu, C. Zhai, T. A. Rietz, D. Diermeier. Mining causal topics in text data: iterative topic modeling with time series feedback, Proceedings of ACM CIKM 2013, pp. 885-890, 2013.

A General Framework for Causal Topic Modeling [Kim et al. CIKM’13] Sep 2001 Oct … 2001 Text Stream Topic 1 Topic Modeling Topic 2 Topic 3 Topic 4 Causal Topics Topic 1 Topic 2 Topic 3 Topic 4 Non-text Time Series Zoom into Word Level Feedback as Prior Split Words Topic 1 W1 + W2 -- W3 + W4 -- W5 … Topic 1-2 W2 -- W4 -- Topic 1-1 W1 + W3 + Causal Words H. Kim, M. Castellanos, M. Hsu, C. Zhai, T. A. Rietz, D. Diermeier. Mining causal topics in text data: iterative topic modeling with time series feedback, Proceedings of ACM CIKM 2013, pp. 885-890, 2013.

Heuristic Optimization of Causality + Coherence

Stock-Correlated Topics in New York Times: June 2000 ~ Dec. 2011 AAMRQ (American Airlines) AAPL (Apple) russia russian putin europe european germany bush gore presidential police court judge airlines airport air united trade terrorism food foods cheese nets scott basketball tennis williams open awards gay boy moss minnesota chechnya paid notice st russia russian europe olympic games olympics she her ms oil ford prices black fashion blacks computer technology software internet com web football giants jets japan japanese plane … Topics are biased toward each time series Hyun Duk Kim, Malu Castellanos, Meichun Hsu, ChengXiang Zhai, Thomas A. Rietz, Daniel Diermeier. Mining causal topics in text data: iterative topic modeling with time series feedback, Proceedings of the 22nd ACM international conference on Information and knowledge management (CIKM ’13), pp. 885-890, 2013.

“Causal Topics” in 2000 Presidential Election Top Three Words in Significant Topics from NY Times tax cut 1 screen pataki guiliani enthusiasm door symbolic oil energy prices news w top pres al vice love tucker presented partial abortion privatization court supreme abortion gun control nra Text: NY Times (May 2000 - Oct. 2000) Time Series: Iowa Electronic Market http://tippie.uiowa.edu/iem/ Issues known to be important in the 2000 presidential election

Retrieval with Time Series Query [Kim et al. ICTIR’13] News 2000 2001 … RANK DATE EXCERPT 1 9/29/2000 Expect earning will be far below 2 12/8/2000 $4 billion cash in company 3 10/19/2000 Disappointing earning report 4 4/19/2001 Dow and Nasdaq soar after rate cut by Federal Reserve 5 7/20/2001 Apple's new retail store … Hyun Duk Kim, Danila Nikitin, ChengXiang Zhai, Malu Castellanos, and Meichun Hsu. 2013. Information Retrieval with Time Series Query. In Proceedings of the 2013 Conference on the Theory of Information Retrieval (ICTIR '13),

Outlook & Challenges： A General TextScope to Support Many Different Applications Task Panel TextScope … Topic Analyzer Opinion Prediction Event Radar Search Box Microsoft (MSFT,) Google, IBM (IBM) and other cloud-computing rivals of Amazon Web Services are bracing for an AWS "partnership" announcement with VMware expected to be announced Thursday. … Medical & Health MyFilter1 MyFilter2 … E-COM Select Time Stocks Select Region My WorkSpace Project 1 Many other users, including Chatbots… Alert A Alert B ...

Beyond TextScope: Intelligent Task Agent Interactive information retrieval Natural language processing Interactive text analysis Text + Non-Text Learning to interact Prediction Domain Knowledge ... Intelligent Task Agents Learning to explore Learning to collaborate … Predicted Values of Real World Variables Optimal Decision Making Predictive Model Multiple Predictors (Features) … Sensor 1 Sensor k … Non-Text Data Text Joint Mining of Non-Text and Text Real World

Future Intelligent Information Systems Task Support Full-Fledged Text Info. Management Mining Access Search Current Search Engine Keyword Queries Bag of words Search History Entities-Relations Personalization (User Modeling) Large-Scale Semantic Analysis (Vertical Search Engines) Complete User Model Knowledge Representation

Outline Overview of Natural Language Processing (NLP) Main Techniques for Text Retrieval and Mining Selected Projects in Text Retrieval and Mining Applications of Text Retrieval and Mining in Software Engineering Summary

The Data-User-Service (DUS) Triangle Software Developers Managers Software Users … Users Data Search Browsing Mining Task support, … Requirements Documentations Comments Literature Email … Services

Many Ways to Connect the DUS Triangle … Software Developers Project Managers Software Users Software Researcher Researcher’s workbench Requirement Search Productivity analyzer Just-in-time Advisor Documentation service agent Requirements Specification Documentation Team Communications Bug Reports Forums Literature … … Task/Decision support Search Browsing Mining

Outline Overview of Natural Language Processing (NLP) Main Techniques for Text Retrieval and Mining Selected Projects in Text Retrieval and Mining Applications of Text Retrieval and Mining in Software Engineering Summary

Summary Text retrieval and text mining are two general techniques for building text data applications, usually based on Statistical language models Machine learning Great opportunities of applying text retrieval and mining to software engineering for support of text data access: push and/or pull (querying + browsing) text data analysis: lexical analysis, text clustering, text categorization, topic analysis, text-based prediction Humans generally should be in the loop for text data applications optimization of human-machine collaboration is important

Acknowledgments Collaborators: Many! Funding

Looking forward to opportunities for collaboration! Thank You! Questions/Comments? Looking forward to opportunities for collaboration! More information can be found at http://timan.cs.uiuc.edu/