Generating Semantic Annotations for Frequent Patterns Using Context Analysis.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

1 Modeling and Simulation: Exploring Dynamic System Behaviour Chapter9 Optimization.
Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Clustering Basic Concepts and Algorithms
Clustering Categorical Data The Case of Quran Verses
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date : 2014/04/15 Source : KDD’13 Authors : Chi Wang, Marina Danilevsky, Nihit.
Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Temporal Pattern Matching of Moving Objects for Location-Based Service GDM Ronald Treur14 October 2003.
1 Exploratory Tools for Follow-up Studies to Microarray Experiments Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State.
Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
Chapter 5: Information Retrieval and Web Search
Aparna Kulkarni Nachal Ramasamy Rashmi Havaldar N-grams to Process Hindi Queries.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Topic Modelling: Beyond Bag of Words By Hanna M. Wallach ICML 2006 Presented by Eric Wang, April 25 th 2008.
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Friends and Locations Recommendation with the use of LBSN By EKUNDAYO OLUFEMI ADEOLA
Chapter 6: Information Retrieval and Web Search
CONCLUSION & FUTURE WORK Normally, users perform search tasks using multiple applications in concert: a search engine interface presents lists of potentially.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University.
This paper was presented at KDD ‘06 Discovering Interesting Patterns Through User’s Interactive Feedback Dong Xin Xuehua Shen Qiaozhu Mei Jiawei Han Presented.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Section 2.3 Properties of Solution Sets
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
NEW EVENT DETECTION AND TOPIC TRACKING STEPS. PREPROCESSING Removal of check-ins and other redundant data Removal of URL’s maybe Stemming of words using.
Exploiting Group Recommendation Functions for Flexible Preferences.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
1 Efficient Discovery of Frequent Approximate Sequential Patterns Feida Zhu, Xifeng Yan, Jiawei Han, Philip S. Yu ICDM 2007.
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
A Multiresolution Symbolic Representation of Time Series Vasileios Megalooikonomou Qiang Wang Guo Li Christos Faloutsos Presented by Rui Li.
D-skyline and T-skyline Methods for Similarity Search Query in Streaming Environment Ling Wang 1, Tie Hua Zhou 1, Kyung Ah Kim 2, Eun Jong Cha 2, and Keun.
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Automatic Labeling of Multinomial Topic Models
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Discovering Evolutionary Theme Patterns from Text -An exploration of Temporal Text Mining KDD’05, August 21–24, 2005, Chicago, Illinois, USA. Qiaozhu Mei.
Graph Indexing From managing and mining graph data.
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
Of 24 lecture 11: ontology – mediation, merging & aligning.
Subgraph Search Over Uncertain Graphs Erşan Demircioğlu.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
Queensland University of Technology
Presentation 王睿.
Presented by: Prof. Ali Jaoua
Implementation of Learning Systems
Latent Semantic Analysis
Presentation transcript:

Generating Semantic Annotations for Frequent Patterns Using Context Analysis

Topics Covered  Need.  Evolution.  What exactly are we looking for?  Terms and Definitions.  Modeling Context Patterns.  Semantic Analysis and pattern annotation.  Experiments and results.  Related Work.  Conclusions.  Acknowledgements

Need  Fundamental focus of the data mining task.  Lack of information associated with the generated frequent pattern sets.  Analogous to the abstract of a paper.

Evolution of Pattern Generation  Researches towards presentation and interpretation of the discovered frequent patterns  Use of concepts life closed frequent pattern and maximal frequent pattern to shrink the size of the frequent pattern and provide information more than just “support”.  Use of other parameters to summarize frequent patterns namely “transaction coverage” and “pattern profiles”.  In spite of the added information the user still cannot interpret the hidden semantic information of the frequent pattern, and still has to go through the entire dataset to check whether it is worth exploring.

What exactly is a Semantic Annotation?  Cue from natural language processing.  Comparative thinking.

Example  Example of a Dictionary

Example (cont’d)  Dictionary Pronunciation Definition Examples Synonyms and Thesaurus  Frequent Patterns Context indicators Example transactions Semantically similar patterns

Example (cont’d)  What we are finally looking for

Problem Formulation  Considering the following: D=> Database t=> transactions p α => Pattern P D => Set of all patterns Hence we have: D = {t 1,t 2,t 3,….t n } and P D = {p 1,p 2 ….p i } D α = {t i | p α t i, t i D}

Terms and Definitions  Using the terminology we define the following terms: Frequent Pattern. Context Unit. Pattern Context. Semantic Annotations. Context Modeling. Transaction Extraction. Semantically Similar Pattern (SSP).  Semantic Pattern Annotations (SPA).

 Frequent pattern: A pattern p is frequent in a dataset D, if D /D >=, where is a user-specified threshold and is called the support of p, usually denoted as s().  Context Unit: Given a dataset D and the set of frequent patterns PD, a context unit is a basic object in D which contains some semantic data and co-occurs with at least one in at least one transaction.  Pattern Context: Given a dataset D and a frequent pattern, the context is represented by the selected set of context units such that every co-occurs with. Each context unit is also called a context indicator.  Semantic Annotation: Let be a frequent pattern in a dataset D, U be the set of context indicators of p and P be a set of patterns in D, then a semantic annotation of p consists of: A set of context indicators of p. A set of transactions. A set of patterns.  Context Modeling: Given a dataset D and a set of possible context units U, the problem of context modeling is to select a subset of U, define a strength measure w() for context indicators and construct a model of c() for each given pattern p.

 Transaction Extraction: Given a dataset D, the problem of transaction extraction is to define a similarity measure between the transaction and the pattern context, and to extract a set of key transactions for frequent pattern.  Semantically Similar Pattern (SSP): Given a dataset D and a set of candidate patterns, the problem of Semantically Similar Patterns (SSP) extraction is to define a similarity measure between the contexts of two patterns and to extract a set of k patterns for any frequent pattern.  Semantic Pattern Annotation: Semantic Pattern Annotation consists of: Select context units and define a weight function for them. Design similarity measures Extract significant context indicators.

Challenges associated with Semantic Pattern Annotation (SPA)  We have no prior knowledge of how to model a context model.  We do not have a clue of how to select context units if the set of possible context units is huge.  It is not clear of how to analyze pattern semantics, thus the design of weighting functions and similarity measures is non- trivial.  Since no training data is available, the learning is totally unsupervised.  These above challenges however provide SPA with a flexibility, such that it doesn’t depend on any specific domain knowledge of the dataset.

Context Modeling  Vector Space Model (VSM).  Defining context modeling.  Generality of context modeling.  Context unit selection.  Strength weighting for context units.

Vector Space Model (VSM)  Used in natural language processing.  Use of ‘term vectors’ and ‘weight vectors’.  Why use Vector Space Model?

Context Model definition  Given a dataset D, a selected set of context units, we represent the context c() of a frequent pattern as a vector where w i = w (u i,α) and the weighting function is given by w(.,α). Hence a transaction t if represented as a vector, where v i = 1 iff u i t, otherwise v i = 0.

Generality of context modeling  Summarization of Itemset patterns using Probabilistic Models. Chao Wang & Srinivasan Parthasarathy

Context Unit Selection  Definition : A context unit may be defined as a minimal unit which holds semantic information in a dataset.  The choice of a particular unit is task dependent.

Granularity & Redundancy Removal  Definition: Granularity generally define the level of detail within a particular dataset.  Varied Granularity Redundancy.  Redundancy removal techniques. Existing techniques Closed Frequent Pattern removal. Micro-clustering.

Existing Techniques  Use of techniques such as pattern summarization and dimension reduction. 1) Introduced by S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, However, these are not efficient considering the scope of these techniques was meant to be dimension reduction for higher dimensional datasets; hence the dimensions are reduced but the redundancy remains the same. 2) Introduced by X. Yan, H. Cheng, J. Han, and D. Xin. Summarizing itemset patterns: a profile-based approach. Too focused to be of generalized use as in our case.

Closed Frequent Pattern  Looking at closed and maximal frequent patterns. Drawbacks of maximal frequent patterns.  Why closed frequent patterns.  How?

Micro - clustering  Need for micro-clustering  Jaccard distance.  Types of micro-clustering Hierarchical micro-clustering. One-pass micro-clustering.

Hierarchical micro-clustering

One-pass micro - clustering

Micro – clustering  Both algorithms give us an output composed of frequent itemsets.  Both algorithms further assure that the distance between any 2 patterns is above a certain threshold.

Strength Weighting for Context Units  Weighting functions.  Concept of constraints. Context indicator=> U, Frequent pattern=> p α  A strength weighting function w(., p α ) is good if : w (u i, p α ) <= w (p α, p α ): the best semantic indicator of p α is itself. w (u i, p α ) = w (p α, p α ): two patterns are equally strong to indicate the meaning of each other. w (u i, p α ) = 0.if the appearance of u i and p α is independent, u i cannot indicate the semantics of p α.

Semantic analysis and Pattern annotation  Semantic similarity: Earlier we introduced the notion that the frequent patters are semantically similar if their contexts are similar to each other. It can be defined formally as: Let be three frequent patterns in P and c( α ), c(β), c(γ) are their context models. Let sim (c(.), c(.)) : V k * V k  R+ be a similarity function of 2 context vectors. If sim (c( α ), c(β)) > sim ( c( α ), c(γ)), we say that p is semantically more similar to p than p w.r.t. sim (c(.), c(.)).  The cosine function is widely used to find the similarity between two vectors. The formal cosine function is given by: Where c( α ) = and c(β) =.

Extracting Strongest Context Indicators  Let p be a frequent pattern and c() be it’s context model, which can be defined as a context vector over a set of context units U = {u 1,u 2,…u K }. As defined earlier w1 is a weight for u1 which states how well u1 indicates the semantics of p.  Therefore the goal of extracting strongest context indicator is to extract a subset of k’ context units such that and we have w i >= w j

Extracting Representative Transactions Let p be a frequent pattern, c() be it’s context model and D ={t 1 ….t l } be a set of transactions. Our goal is to select k t transactions with a similarity function. Representation of the transaction as a vector. Use of cosine function to find the similarity. Compute from each transaction and rank them in descending order. Select the top k t transactions.

Experiments and Results  To test the functioning of the proposed framework, we apply the proposed methods and algorithms to three different datasets of completely different backgrounds.  The three datasets are: The DBLP dataset. Matching of protein motifs. Matching Gene Synonyms.

DBLP Dataset  A subset of the DBLP dataset is considered.  It contains papers from the proceedings of 12 major conferences in Database and Data Mining.  The data is stored in transactions containing 2 parts: The author’s name. The title of the corresponding paper.  Two patterns are considered: Frequent co-authorship. Frequent title terms.  The goal of this experiment is to explain the effectiveness of SPA in developing a dictionary-like annotation for the frequent patterns.

Experiments 1.One the basis of authors/ co-authors. 2.On the basis of the title of the papers presented.  For both the experiments, the closed frequent pattern and closed spanning methods were used to generate a set of closed frequent itemsets of authors/co-authors and a set of closed sequential patterns of title terms.  A technique called Krovetz Stemming is used top convert the title terms into their root forms.

Details  We set the minimum support for frequent itemsets as 10 and sequential patterns as 4.  Which outputs 9926 closed patterns. We use the One-pass microclustering algorithm to get a smaller set of 3443.

PatternTypeAnnotations Xifeng_yan Jaiwen_han (SSP set = co- authors) IGraph; philip_s_yu; mine close; mine close frequent; index approach; graph pattern; sequential pattern. TTTTTT Gspan graph_base substructure pattern mine; Mine close relational graph connect constraint Clospan mine close sequential pattern large database SJiawei_han& philip_s_hu; jian_pei&jaiwei_han; jainyong_wang; jiong_yang&philip_s_ hu&wei_wang

Matching motif’s and GO terms  Prediction of the functionality of newly discovered protein motifs.  Gene Ontology (GO).  The goal is to match each individual motif to the GO terms which best represent it’s functions.  Hence in this case the problem may be formulated as: Given a set of transactions D (protein sequences with motifs), a set P of frequent patterns in D to be associated (motifs), a set of candidate patterns P C with explicit semantics (GO terms), our goal is for A p α P and find P’ C P C, which best indicates the semantics of p α.

Details  We use the same dataset and judgments as used in T. Tao, C. Zhai, X. Lu, and H. Fang. A study of statistical methods for function prediction of protein motifs.  We have sequences, 1097 motifs and 3761 GO terms.  We use the same performance measure as in the above paper (i.e. a variant of MRR (Mean Reciprocal Rank)) to evaluate the effectiveness of the SPA technique on the motif-GO matching problem.  We formulate the problem as G = {g 1, g 2, … g n } be a set of GO terms, given a motif pattern, GO’ ={g 1 ’, g 2 ’,…g n ’} G is a set of correct GO terms for the pattern. We rank G with the SPA system and pick the top ranked terms; where G is either termed as a context unit or as a semantically similar pattern to p.

Matching Gene Synonyms  In biomedical literature, it is common to call the same gene with different names, called gene synonyms.  These synonyms do not appear with each other but are replaceable with one-another. In this experiment we use the SPA technique to extract SSP ( Semantically Similar Patterns).  We construct a list of 100 synonyms, randomly selected from the BioCre-AtIvE Task 1B; which is basically a collection of abstracts of different papers.. We extract all sentence s which contain at least one synonym from the list, keeping the support above 3 we get a list of 41 synonyms.  We then mix the synonyms which belong to different genes and use the algorithm to extract the matching synonyms.  As in the previous case we use MRR to calculate the efficiency of the algorithm.

Results

Related Work  To our knowledge the problem of semantic pattern annotation has not been well studied.  More frequent pattern mining work focus on discovering frequent patterns and do not address the problem of pre-processing.  The work proposed to shrink the size of the dataset are not efficient at removing redundancy in the patterns discovered.  None of the works provide information more than basic statistical information.  Recent researches develop techniques to approximate, summarize a frequent pattern. Although these explore some context information, none of them are provide in-depth semantic information.  Context analysis quite common in natural language, but focus more on non-redundant word based contexts, which is different than pattern contexts.  Although not optimal, the general methods proposed here can be well applied to these tasks.

Conclusions  Existing mining works generate a large set of frequent patterns without providing information to interpret them.  We propose the novel problem of semantic pattern annotation (SPA) – generating semantic annotations for frequent patterns  We propose algorithms to exploit context modeling and semantic analysis to generate semantic annotations automatically. The proposed methods are quite general and can deal with any type of frequent pattern with context information.  We evaluated out approach with 3 different datasets. The results of which show that our methods can generate semantic annotations efficiently.  As shown, the proposed methods can be applied to many interesting real world tasks through selecting different context units.  A major goal for future research is to fully develop the potential of the proposed framework by studying alternative instantiations.

Acknowledgements