WIMS 2014, Thessaloniki, June 2014 A soft frequent pattern mining approach for textual topic detection Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris.

Slides:

Advertisements

Similar presentations

Chapter 5: Introduction to Information Retrieval

Advertisements

Introduction to Information Retrieval

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

WSCD INTRODUCTION  Query suggestion has often been described as the process of making a user query resemble more closely the documents it is expected.

SNOW Workshop, 8th April 2014 Real-time topic detection with bursty ngrams: RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert.

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.

Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.

Evaluating Search Engine

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.

Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.

MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.

1 Query Language Baeza-Yates and Navarro Modern Information Retrieval, 1999 Chapter 4.

Recall: Query Reformulation Approaches 1. Relevance feedback based vector model (Rocchio …) probabilistic model (Robertson & Sparck Jones, Croft…) 2. Cluster.

1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.

J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.

Chapter 5: Information Retrieval and Web Search

Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,

Tag-based Social Interest Discovery

«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,

Beyond Co-occurrence: Discovering and Visualizing Tag Relationships from Geo-spatial and Temporal Similarities Date : 2012/8/6 Resource : WSDM’12 Advisor.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc – SIAM Web Analytics Workshop.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.

Mining and Analysis of Control Structure Variant Clones Guo Qiao.

Exploring Online Social Activities for Adaptive Search Personalization CIKM’10 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG, CHUNG.

Bug Localization with Machine Learning Techniques Wujie Zheng

Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.

April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.

Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.

Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.

Generalized Vector Model n Classic models enforce independence of index terms. n For the Vector model: u Set of term vectors {k1, k2,..., kt} are linearly.

Chapter 6: Information Retrieval and Web Search

Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Clustering C.Watters CS6403.

Semantic v.s. Positions: Utilizing Balanced Proximity in Language Model Smoothing for Information Retrieval Rui Yan†, ♮, Han Jiang†, ♮, Mirella Lapata‡,

Vector Space Models.

DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.

1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.

UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.

1 Information Retrieval LECTURE 1 : Introduction.

1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.

V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.

LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,

Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.

Learning to Estimate Query Difficulty Including Applications to Missing Content Detection and Distributed Information Retrieval Elad Yom-Tov, Shai Fine,

Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)

Natural Language Processing Topics in Information Retrieval August, 2002.

Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Authors: Yutaka Matsuo & Mitsuru Ishizuka Designed by CProDM Team.

Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Topic Modeling for Short Texts with Auxiliary Word Embeddings

CMPT 120 Topic: Searching – Part 1

Information Organization: Clustering

INF 141: Information Retrieval

Presentation transcript:

WIMS 2014, Thessaloniki, June 2014 A soft frequent pattern mining approach for textual topic detection Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris (CERTH-ITI) Luca Aiello (Yahoo Labs) Ryan Skraba (Alcatel-Lucent Bell Labs)

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #2 Overview Motivation: classes of topic detection methods, degree of co- occurrence patterns considered. Beyond pairwise co-occurrence analysis: –Frequent Pattern Mining. –Soft Frequent Pattern Mining. Experimental evaluation. Conclusions and future work.

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #3 Classes of textual topic detection methods Probabilistic topic models: –Learn the joint probability distribution of topics and terms and perform inference on it (e.g. LDA). Document-pivot methods: –Cluster together documents, each group of documents is a topic (e.g. incremental clustering based on cosine similarity / tfidf representation) Feature-pivot methods: –Cluster together terms based on their co-occurrence patterns, each group of terms is a topic (e.g. graph-based feature-pivot)

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #4 Feature-pivot methods: degree of cooccurrence (1/2) We focus on feature-pivot methods and examine the effect of the “degree” of examined co-occurrence patterns on the term clustering procedure. Let us consider the following topics: –Romney wins Virginia –Romney appears on TV and thanks Virginia –Obama wins Vermont –Romney congratulates Obama Key terms (e.g. Romney, Obama, Virginia, Vermont, wins) co-occur with more than one other key term.

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #5 Feature-pivot methods: degree of cooccurrence (2/2) Previous approaches typically examined only “pairwise” cooccurrence patterns. In the case of closely related topics, such as the above, examining only pairwise cooccurrence patterns may lead in mixed topics. We propose to examine the simultaneous co-occurrence patterns of a larger number of terms.

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #6 Beyond pairwise co-occurrence analysis: FPM Frequent Pattern Mining. A variety of algorithms (e.g. APriori, FPGrowth) that can be used to find groups of items that co-occur frequently. Not a new approach for textual topic detection.

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #7 Beyond pairwise co-occurrence analysis: SFPM Frequent Pattern Mining is strict in that it looks for sets of terms all of which cooccur frequently at the same time. May be able to surface only topics with a very small number of terms, i.e. very coarse topics. Can we formulate an algorithm that lies between the two ends of the spectrum, i.e. looks at co-occurrence patterns with degree higher than two but is not as strict as a typical FPM algorithm?

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #8 SFPM 1.Term selection. 2.Co-occurrence vector formation. 3.Post-processing.

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #9 SFPM–Step 1: Term selection Select the top K terms that will enter the clustering procedure. Different options to do this. For instance: –Select the most frequent terms. –Select the terms that exhibit the most “bursty” behaviour (if we are considering temporal processing). In our experiments we select the terms that are most “unusually frequent”:

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #10 SFPM–Step 2: Co-occurrence vector formation (1/4) The heart of the SFPM approach. Notation: –n: The number of documents in the collection –S: A set of terms, representing a topic –D S : vector of length n, the i-th element indicates how many of the terms in S co-occur in the i-th document. –D t : binary vector of length n, the i-th element indicates if the term t occurs in the i-th document. The vector D t for a term t that frequently cooccurs with the terms in S will have high cosine similarity with D S. Idea of algorithm: greedily expand S with the term t that best matches D S.

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #11 SFPM–Step 2: Co-occurrence vector formation (2/4)

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #12 SFPM–Step 2: Co-occurrence vector formation (3/4) Need a stopping criterion for the expansion procedure. If not properly set, the set of terms may end up being too small (i.e. the topic may be too coarse) or may end up being a mixture of topics. We use a threshold on the cosine of D s and D t and we adapt the threshold dynamically on the size of S. In particular, we use a sigmoid function that is a function of |S|:

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #13 SFPM–Step 2: Co-occurrence vector formation (4/4) We run the expansion procedure many times, each time starting from a different term. Additionally, to avoid having less important terms dominate Ds and the cosine similarity, at the end of each expansion step we zero-out entries of Ds that have a value smaller than |S|/2.

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #14 SFPM–Step 3: Post-processing Due to the fact that we run the expansion procedure many times, each time starting with a different term, we may end up with a large number of duplicate topics. At the final step of the algorithm, we filter out duplicate topics (by considering the Jaccard similarity between the sets of keywords of the produced topics).

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #15 SFPM – Overview

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #16 Evaluation – Datasets and evaluation metrics Three datasets collected from Twitter: –Super Tuesday dataset: 474,109 documents –F.A. Cup final: 148,652 documents –U.S. presidential elections: 1,247,483 documents For each of them a number of ground-truth topics were determined by examining the relevant stories that appeared in the mainstream media. Each topic is represented by a set of mandatory terms, a set of forbidden terms (so that we make sure that closely related topics are not merged) and a set of optional terms. We evaluate: –Topic recall. –Keyword recall. –Keyword precision.

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #17 Evaluation – Competing methods A classic probabilistic method: LDA. A graph-based feature-pivot approach that examines only pairwise co-occurrence patterns. FPM using the FP-Growth algorithm.

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #18 Evaluation – Results: topic recall Evaluated for different numbers of returned topics SFPM achieves highest topic recall in all three datasets

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #19 Evaluation – Results: keyword recall SFPM achieves highest topic recall in all three datasets SFPM not only retrieves more target topics than the other methods, but also provides a quite complete representation of the topics

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #20 Evaluation – Results: keyword precision SFPM nevertheless achieves a somewhat lower keyword precision, indicating that some spurious keywords are also included in the topics. FPM as the strictest method achieves the highest keyword precision.

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #21 Evaluation – Example topics produced

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #22 Conclusions Started from the observation that in order to detect closely related topics, a feature-pivot topic detection method should examine co-occurrence patterns of degree larger than 2. We have presented an approach, SFPM, that does this, albeit in a soft and controllable manner. It is based on a greedy set expansion procedure. We have experimentally shown that the proposed approach may indeed improve performance when dealing with corpora containing closely inter-related topics.

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #23 Future work Experiment with different types of documents. Consider the problem of synonyms. Examine alternative, more efficient search strategies. E.g. may index the D t vectors, using e.g. LSH, in order to rapidly retrieve the best matching term to a set S.

WIMS 2014, Thessaloniki, June 2014 Georgios Petkos #24 Thank you! Open source implementation (including a set of other topic detection methods) available at: Dataset and evaluation resources available at: Relevant topic detection dataset on which SFPM will be tested: Questions, comments, suggestions?