Mining Multi-Faceted Overviews of Arbitrary Topics in a Text Collection Xu Ling, Qiaozhu Mei, ChengXiang Zhai, Bruce Schatz (KDD`08) Speaker: Hsu, Yi Ling.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Psychological Advertising: Exploring User Psychology for Click Prediction in Sponsored Search Date: 2014/03/25 Author: Taifeng Wang, Jiang Bian, Shusen.

Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

Biological literature mining

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Searchable Web sites Recommendation Date : 2012/2/20 Source : WSDM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh Jia-ling 1.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Automatic Image Annotation and Retrieval using Cross-Media Relevance Models J. Jeon, V. Lavrenko and R. Manmathat Computer Science Department University.

Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.

Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.

Presented by Zeehasham Rasheed

Probabilistic Latent Semantic Analysis

Chapter 5: Information Retrieval and Web Search

Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.

Modeling (Chap. 2) Modern Information Retrieval Spring 2000.

MINING MULTI-FACETED OVERVIEWS OF ARBITRARY TOPICS IN A TEXT COLLECTION Xu Ling, Qiaozhu Mei, ChengXiang Zhai, Bruce Schatz Presented by: Qiaozhu Mei,

Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.

Leveraging Conceptual Lexicon ： Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.

1 Retrieval and Feedback Models for Blog Feed Search SIGIR 2008 Advisor ： Dr. Koh Jia-Ling Speaker ： Chou-Bin Fan Date ：

Outline Quick review of GS Current problems with GS Our solutions Future work Discussion …

A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

1 A Unified Relevance Model for Opinion Retrieval (CIKM 09’) Xuanjing Huang, W. Bruce Croft Date: 2010/02/08 Speaker: Yu-Wen, Hsu.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.

Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG, CHUNG.

1 Efficient Search Ranking in Social Network ACM CIKM2007 Monique V. Vieira, Bruno M. Fonseca, Rodrigo Damazio, Paulo B. Golgher, Davi de Castro Reis,

Querying Structured Text in an XML Database By Xuemei Luo.

April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.

Automatically Generating Gene Summaries from Biomedical Literature (To appear in Proceedings of PSB 2006) X. LING, J. JIANG, X. He, Q.~Z. MEI, C.~X. ZHAI,

Retrieval Models for Question and Answer Archives Xiaobing Xue, Jiwoon Jeon, W. Bruce Croft Computer Science Department University of Massachusetts, Google,

Chapter 6: Information Retrieval and Web Search

Adding Semantics to Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer.

Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.

Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science.

BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.

Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Date : 2013/03/18 Author : Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, Grant Weddell Source : CIKM’12 Speaker : Er-Gang Liu Advisor : Prof. Jia-Ling.

Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.

Positional Relevance Model for Pseudo–Relevance Feedback Yuanhua Lv & ChengXiang Zhai Department of Computer Science, UIUC Presented by Bo Man 2014/11/18.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

Finding Experts Using Social Network Analysis 2007 IEEE/WIC/ACM International Conference on Web Intelligence Yupeng Fu, Rongjing Xiang, Yong Wang, Min.

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.

A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,

DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.

Automatic Labeling of Multinomial Topic Models

Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.

Date: 2012/5/28 Source: Alexander Kotov. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Interactive Sense Feedback for Difficult Queries.

A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.

{ Adaptive Relevance Feedback in Information Retrieval Yuanhua Lv and ChengXiang Zhai (CIKM ‘09) Date: 2010/10/12 Advisor: Dr. Koh, Jia-Ling Speaker: Lin,

CiteData: A New Multi-Faceted Dataset for Evaluating Personalized Search Performance CIKM’10 Advisor : Jia-Ling, Koh Speaker : Po-Hsien, Shih.

Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.

Queensland University of Technology

Multimodal Learning with Deep Boltzmann Machines

Compact Query Term Selection Using Topically Related Text

Applying Key Phrase Extraction to aid Invalidity Search

Date : 2013/1/10 Author : Lanbo Zhang, Yi Zhang, Yunfei Chen

Text Categorization Berlin Chen 2003 Reference:

Relevance and Reinforcement in Interactive Browsing

Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^

Connecting the Dots Between News Article

Presentation transcript:

Mining Multi-Faceted Overviews of Arbitrary Topics in a Text Collection Xu Ling, Qiaozhu Mei, ChengXiang Zhai, Bruce Schatz (KDD`08) Speaker: Hsu, Yi Ling Advisor: Dr. Koh, Jia-Ling Date: 07/21/2009

Outline Introduction Problem Formulation Method Experiment and Result Conclusion

Introduction Mining and extracting information from a text collection with ad hoc information needs is a common task in many applications. –The gene summarization task in biomedical literature –The car review mining task for online customer reviews

Definition Definition 1 (Document): We define a document d in a text collection C as a sequence of words d = {w1,w2,...,w|d|}. We use c(w, d) to denote the occurrences of word w in d.

Definition Definition 2 (Facet Model): A facet model Θin a text collection C is a multinomial distribution of words {p(w|Θ)} wεV, which represents a semantic facet.

Definition Definition 3 (Multi-faceted Overview): A multifaceted overview of a topic is a semi-structured summary of all information about the queried topic.

Definition Multi-Faceted Overview Mining (MuFOM):given an ad hoc topic and a few user specified keywords about the facets they are interested in,the goal is to generate a semi-structured overview of thequery topic and present it with the user- specified facets.

The problem as defined above is challenging is many ways. First, we cannot rely on training examples to mine multi-faceted overview of arbitrary topics. The reasons are twofold: –1) it is impossible to create training examples for all ad hoc topics and facets; –2) it is usually too much burden for a user to label training examples for their interested facets. Therefore, a reasonable solution should not rely on any well-studied supervised-learning method.

Second, we alternatively assume that a user could specify facets by providing a few keywords. How to model and discriminate different facets using this limited information is not straightforward.

MINING MULTIPLE FACETS OF ARBITRARY TOPICS Facet Initialization –a facet initialization module that initializes the representation of the user specified facets Facet modeling –a facet modeling module that extracts and models the facets by statistical topic modeling.

Facet Initialization A facet can often be described in many different ways. –In biomedical literature, the facet about genetic interaction can be described using different verbs such as “regulate”, “inhibit”, “promote” and “enhance.” –When writing a review about the interior design features of a car, the reviewer might describe it with different terms such as “air condition”, “seat”.

Facet Initialization [cont.] G=(V,E) V: each node v in V is a term. E: edge e =(v i, v j ) ε E indicates a similarity relationship between two terms. We weight e using MI(i, j) (i ≠ j), where MI(i, j) = log(p(vi,vj) / p(vi)p(vj)) [pointwise mutual information ]

Facet Initialization [cont.] Deg(w) =Σ w`εV MI(w,w`) ， N(w) is a node w`s maximum weighted neighbor.

Facet Modeling Statistical topic modeling is quite effective for mining topics in a text collection. Probabilistic Latent Semantic Analysis (PLSA) is a commonly used topic model. In this model,the log likelihood of a collection C is defined as:

Facet Modeling [cont.] EM algorithm E step: M step:

Facet Modeling [cont.] Prior based approach: Note that a user does not have to give keywords for every facet; indeed, if a user has not given keywords for the j-th facet, we could set μj = 0, and the j-th facet would be discovered as an additional facet to those specified by the user.

Facet Modeling [cont.] Regularizered based approach: L(C): is the log likelihood of the collection C defined by Eqn.2

Facet Modeling [cont.] Regularizered based approach: R(C): is a regularizer defined on the collection C according to document similarity R T (C T ) is the regularizer defined on the “training” document sets CT based on the facet distribution. (u,v is document)

Facet Modeling [cont.]

EXPERIMENTS Gene summarization Overview of consumer reviews of cars

EXPERIMENTS Gene summarization –The experiment was done on 19 randomly selected fruit fly genes. –Retrieved 463 sentences relevant to these testing genes from our fruit fly document collection, –Asked an insect biologist to annotate these sentences with the predefined six facets in Table 1 to construct a gold standard.

NameKeywordDefinition Sequence Information (SI)sequence, similarity Describing the sequence information of the target gene and its product. Genetical Interaction (GI)Interaction Describing the genetical interactions of the target gene with other molecules Gene Product (GP)protein, product Describing the product (protein, rRNA, etc.) of the target gene. Expression Location (EL)Expression Describing where the target gene is mainly expressed. Mutant Phenotype (MP)mutation, phenotype Describing the information about the mutant phenotypes of the target gene. Wild-type Function & PhenotypicInformation (WFPI) wild-type, function Describing the information about the target gene and its product.

Experiments We evaluate our system in two completely different domains –Gene summarization –Overview of consumer reviews of cars The strategy of facet expansion is quite effective The regularized topic model approach performs the best among almost all compared methods. 24

Gene summarization Basically, for each retrieved sentence of the query gene, we rank all the facets based on the sentence-facet relevance score, and assign it to two most relevant facets. Then each facet is summarized by a ranked list of assigned sentences based on that score. 25

Gene summarization We retrieved PubMed abstracts about fruit fly as our document collection by matching the keyword“Drosophila melanogaster”in the MESH3 field. We used Lemur Toolkit4 to implement the system. Adopting the six facets, our system started from the facet keywords and estimated facet models based on the entire collection. Different methods are evaluated based on their final generated gene summaries. The experiment was done on 19 randomly selected fruit fly genes. We retrieved 463 sentences relevant to these testing genes from our fruit fly document collection Asked an insect biologist to annotate these sentences with the predefined six facets to construct a gold standard. 26

Gene summarization A sentence is assigned a facet label if and only if it contains information on this facet, regardless of whether it contains any extra information. To study how different methods affect the final generated summary, we evaluated them based on the precision of best five sentences for each facet separately. The results are shown in Table 2 and 4. 27

Gene summarization We fixed the facet model estimation step to the regularized approach in Section α = 0.1, β = 0.5 and top-5“training” document per facet Measured using the human annotated gold standard for results with 5 different levels of facet expansion. 28

Gene summarization Column UpBd indicates the upper bound scores as some testing genes with relatively few references do not have 5 sentences per facet in our gold standard annotations. Along with more expansion on the facet representation, the generated summary achieves better score. 29

Gene summarization We show the top 10 words of each facet model after 10 iterations of expansion with 50 MI neighbors. (Due to space limit, the probabilities of these terms are not shown.) In facet GP, we see terms like“encode”now ranked very high, which is actually a very informative term indicating gene product information. In facet MP, terms like “allele,” “defect,” “lethal” indicating important information are now expanded into the original facet. 30

Gene summarization We evaluated the effectiveness of different facet modeling approaches in Table.4. Pri and Reg are our prior-based and regularizer- based approacheswith most extensive facet expansion; Sup represents the result of the system by [11] using the supervised approach MQR casts this task as a multi-query retrieval problem, where it treats the original query and keywords of each facet as independent queries and generate final summary MQR+FB is a variation of MQR with pseudo feedback. [11] X. Ling, J. Jiang, X. He, Q. Mei, C. Zhai, and B. Schatz. Automatically generating gene summaries from biomedical literature. In Proceedings of PSB ’06,pages 41–50,

Gene summarization 32

Overview of consumer reviews of cars We crawled the consumers’reviews from edmunds.com on 15 car models like “chevrolet, malibu, 2006,” “honda, accord, 2006,” as our document collection (1156 reviews in total). Among these queries, 12 have comprehensive editor reviews, which are used as our gold standard overviews for evaluation. We applied different methods on generating overviews of the consumer reviews, and evaluated the final performance using ROUGE5. The generated overviews consist of ten best sentences per facet. We evaluated different methods with all the metrics provided by ROUGE, and report the ROUGE-1 Average R score Averaging over all 12 test queries in Table 7 (performance on other metrics are all consistent and not presented here). 33

Overview of consumer reviews of cars The top words of each facet model expanded by one iteration of adjustment through 10 MI neighbors are displayed in Table 6. In the facet Powertrains & Performance (PP), we see terms like “economy” is ranked very high, which is actually a very informative term indicating fuel economy of the car. In the facet Driving Impressions (DI), terms like “drive,”“seat” indicating driving experience are now expanded into the original model. 34

Overview of consumer reviews of cars The regularizer-based method performed best for all five facets. Consistent with the above experiments on the gene summarization task, both of our proposed methods (Pri and Reg) performed better than the multi-query retrieval methods. 35

Overview of consumer reviews of cars We also presented one example of our generated overview (with top-2 sentences per facet) for the query “honda, accord, 2006” and its corresponding editor’s review in Table 8. Our extracted sentence for the facet interior design matches excellently well to the editor’s review. 36

Overview of consumer reviews of cars In this case, the user do not want to discriminate between interior and exterior design features.The returned sentences about exterior and interior all come to the facet “Design.” In the “Finance” facet, the sentences about price are extracted. 37

Conclusion General Unsupervised Do not consider the removal of redundant information in generated overviews. The user specified facets might lie in different granularities There are many types of online resources that can be utilized to improve the facet modeling