Generating Semantic Annotations for Frequent Patterns Using Context Analysis.

Generating Semantic Annotations for Frequent Patterns Using Context Analysis

Topics Covered  Need.  Evolution.  What exactly are we looking for?  Terms and Definitions.  Modeling Context Patterns.  Semantic Analysis and pattern annotation.  Experiments and results.  Related Work.  Conclusions.  Acknowledgements

Need  Fundamental focus of the data mining task.  Lack of information associated with the generated frequent pattern sets.  Analogous to the abstract of a paper.

Evolution of Pattern Generation  Researches towards presentation and interpretation of the discovered frequent patterns  Use of concepts life closed frequent pattern and maximal frequent pattern to shrink the size of the frequent pattern and provide information more than just “support”.  Use of other parameters to summarize frequent patterns namely “transaction coverage” and “pattern profiles”.  In spite of the added information the user still cannot interpret the hidden semantic information of the frequent pattern, and still has to go through the entire dataset to check whether it is worth exploring.

What exactly is a Semantic Annotation?  Cue from natural language processing.  Comparative thinking.

Example  Example of a Dictionary

Example (cont’d)  Dictionary Pronunciation Definition Examples Synonyms and Thesaurus  Frequent Patterns Context indicators Example transactions Semantically similar patterns

Example (cont’d)  What we are finally looking for

Problem Formulation  Considering the following: D=> Database t=> transactions p α => Pattern P D => Set of all patterns Hence we have: D = {t 1,t 2,t 3,….t n } and P D = {p 1,p 2 ….p i } D α = {t i | p α t i, t i D}

Terms and Definitions  Using the terminology we define the following terms: Frequent Pattern. Context Unit. Pattern Context. Semantic Annotations. Context Modeling. Transaction Extraction. Semantically Similar Pattern (SSP).  Semantic Pattern Annotations (SPA).

 Frequent pattern: A pattern p is frequent in a dataset D, if D /D >=, where is a user-specified threshold and is called the support of p, usually denoted as s().  Context Unit: Given a dataset D and the set of frequent patterns PD, a context unit is a basic object in D which contains some semantic data and co-occurs with at least one in at least one transaction.  Pattern Context: Given a dataset D and a frequent pattern, the context is represented by the selected set of context units such that every co-occurs with. Each context unit is also called a context indicator.  Semantic Annotation: Let be a frequent pattern in a dataset D, U be the set of context indicators of p and P be a set of patterns in D, then a semantic annotation of p consists of: A set of context indicators of p. A set of transactions. A set of patterns.  Context Modeling: Given a dataset D and a set of possible context units U, the problem of context modeling is to select a subset of U, define a strength measure w() for context indicators and construct a model of c() for each given pattern p.

 Transaction Extraction: Given a dataset D, the problem of transaction extraction is to define a similarity measure between the transaction and the pattern context, and to extract a set of key transactions for frequent pattern.  Semantically Similar Pattern (SSP): Given a dataset D and a set of candidate patterns, the problem of Semantically Similar Patterns (SSP) extraction is to define a similarity measure between the contexts of two patterns and to extract a set of k patterns for any frequent pattern.  Semantic Pattern Annotation: Semantic Pattern Annotation consists of: Select context units and define a weight function for them. Design similarity measures Extract significant context indicators.

Challenges associated with Semantic Pattern Annotation (SPA)  We have no prior knowledge of how to model a context model.  We do not have a clue of how to select context units if the set of possible context units is huge.  It is not clear of how to analyze pattern semantics, thus the design of weighting functions and similarity measures is non- trivial.  Since no training data is available, the learning is totally unsupervised.  These above challenges however provide SPA with a flexibility, such that it doesn’t depend on any specific domain knowledge of the dataset.

Context Modeling  Vector Space Model (VSM).  Defining context modeling.  Generality of context modeling.  Context unit selection.  Strength weighting for context units.

Vector Space Model (VSM)  Used in natural language processing.  Use of ‘term vectors’ and ‘weight vectors’.  Why use Vector Space Model?

Context Model definition  Given a dataset D, a selected set of context units, we represent the context c() of a frequent pattern as a vector where w i = w (u i,α) and the weighting function is given by w(.,α). Hence a transaction t if represented as a vector, where v i = 1 iff u i t, otherwise v i = 0.

Generality of context modeling  Summarization of Itemset patterns using Probabilistic Models. Chao Wang & Srinivasan Parthasarathy

Context Unit Selection  Definition : A context unit may be defined as a minimal unit which holds semantic information in a dataset.  The choice of a particular unit is task dependent.

Granularity & Redundancy Removal  Definition: Granularity generally define the level of detail within a particular dataset.  Varied Granularity Redundancy.  Redundancy removal techniques. Existing techniques Closed Frequent Pattern removal. Micro-clustering.

Existing Techniques  Use of techniques such as pattern summarization and dimension reduction. 1) Introduced by S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990. However, these are not efficient considering the scope of these techniques was meant to be dimension reduction for higher dimensional datasets; hence the dimensions are reduced but the redundancy remains the same. 2) Introduced by X. Yan, H. Cheng, J. Han, and D. Xin. Summarizing itemset patterns: a profile-based approach. Too focused to be of generalized use as in our case.

Closed Frequent Pattern  Looking at closed and maximal frequent patterns. Drawbacks of maximal frequent patterns.  Why closed frequent patterns.  How?

Micro - clustering  Need for micro-clustering  Jaccard distance.  Types of micro-clustering Hierarchical micro-clustering. One-pass micro-clustering.

Hierarchical micro-clustering

One-pass micro - clustering

Micro – clustering  Both algorithms give us an output composed of frequent itemsets.  Both algorithms further assure that the distance between any 2 patterns is above a certain threshold.

Strength Weighting for Context Units  Weighting functions.  Concept of constraints. Context indicator=> U, Frequent pattern=> p α  A strength weighting function w(., p α ) is good if : w (u i, p α ) <= w (p α, p α ): the best semantic indicator of p α is itself. w (u i, p α ) = w (p α, p α ): two patterns are equally strong to indicate the meaning of each other. w (u i, p α ) = 0.if the appearance of u i and p α is independent, u i cannot indicate the semantics of p α.

Semantic analysis and Pattern annotation  Semantic similarity: Earlier we introduced the notion that the frequent patters are semantically similar if their contexts are similar to each other. It can be defined formally as: Let be three frequent patterns in P and c( α ), c(β), c(γ) are their context models. Let sim (c(.), c(.)) : V k * V k  R+ be a similarity function of 2 context vectors. If sim (c( α ), c(β)) > sim ( c( α ), c(γ)), we say that p is semantically more similar to p than p w.r.t. sim (c(.), c(.)).  The cosine function is widely used to find the similarity between two vectors. The formal cosine function is given by: Where c( α ) = and c(β) =.

Extracting Strongest Context Indicators  Let p be a frequent pattern and c() be it’s context model, which can be defined as a context vector over a set of context units U = {u 1,u 2,…u K }. As defined earlier w1 is a weight for u1 which states how well u1 indicates the semantics of p.  Therefore the goal of extracting strongest context indicator is to extract a subset of k’ context units such that and we have w i >= w j

Extracting Representative Transactions Let p be a frequent pattern, c() be it’s context model and D ={t 1 ….t l } be a set of transactions. Our goal is to select k t transactions with a similarity function. Representation of the transaction as a vector. Use of cosine function to find the similarity. Compute from each transaction and rank them in descending order. Select the top k t transactions.

Experiments and Results  To test the functioning of the proposed framework, we apply the proposed methods and algorithms to three different datasets of completely different backgrounds.  The three datasets are: The DBLP dataset. Matching of protein motifs. Matching Gene Synonyms.

DBLP Dataset  A subset of the DBLP dataset is considered.  It contains papers from the proceedings of 12 major conferences in Database and Data Mining.  The data is stored in transactions containing 2 parts: The author’s name. The title of the corresponding paper.  Two patterns are considered: Frequent co-authorship. Frequent title terms.  The goal of this experiment is to explain the effectiveness of SPA in developing a dictionary-like annotation for the frequent patterns.

Experiments 1.One the basis of authors/ co-authors. 2.On the basis of the title of the papers presented.  For both the experiments, the closed frequent pattern and closed spanning methods were used to generate a set of closed frequent itemsets of authors/co-authors and a set of closed sequential patterns of title terms.  A technique called Krovetz Stemming is used top convert the title terms into their root forms.

Details  We set the minimum support for frequent itemsets as 10 and sequential patterns as 4.  Which outputs 9926 closed patterns. We use the One-pass microclustering algorithm to get a smaller set of 3443.

PatternTypeAnnotations Xifeng_yan Jaiwen_han (SSP set = co- authors) IGraph; philip_s_yu; mine close; mine close frequent; index approach; graph pattern; sequential pattern. TTTTTT Gspan graph_base substructure pattern mine; Mine close relational graph connect constraint Clospan mine close sequential pattern large database SJiawei_han& philip_s_hu; jian_pei&jaiwei_han; jainyong_wang; jiong_yang&philip_s_ hu&wei_wang

Matching motif’s and GO terms  Prediction of the functionality of newly discovered protein motifs.  Gene Ontology (GO).  The goal is to match each individual motif to the GO terms which best represent it’s functions.  Hence in this case the problem may be formulated as: Given a set of transactions D (protein sequences with motifs), a set P of frequent patterns in D to be associated (motifs), a set of candidate patterns P C with explicit semantics (GO terms), our goal is for A p α P and find P’ C P C, which best indicates the semantics of p α.

Details  We use the same dataset and judgments as used in T. Tao, C. Zhai, X. Lu, and H. Fang. A study of statistical methods for function prediction of protein motifs.  We have 12181 sequences, 1097 motifs and 3761 GO terms.  We use the same performance measure as in the above paper (i.e. a variant of MRR (Mean Reciprocal Rank)) to evaluate the effectiveness of the SPA technique on the motif-GO matching problem.  We formulate the problem as G = {g 1, g 2, … g n } be a set of GO terms, given a motif pattern, GO’ ={g 1 ’, g 2 ’,…g n ’} G is a set of correct GO terms for the pattern. We rank G with the SPA system and pick the top ranked terms; where G is either termed as a context unit or as a semantically similar pattern to p.

Matching Gene Synonyms  In biomedical literature, it is common to call the same gene with different names, called gene synonyms.  These synonyms do not appear with each other but are replaceable with one-another. In this experiment we use the SPA technique to extract SSP ( Semantically Similar Patterns).  We construct a list of 100 synonyms, randomly selected from the BioCre-AtIvE Task 1B; which is basically a collection of abstracts of different papers.. We extract all sentence s which contain at least one synonym from the list, keeping the support above 3 we get a list of 41 synonyms.  We then mix the synonyms which belong to different genes and use the algorithm to extract the matching synonyms.  As in the previous case we use MRR to calculate the efficiency of the algorithm.

Results

Related Work  To our knowledge the problem of semantic pattern annotation has not been well studied.  More frequent pattern mining work focus on discovering frequent patterns and do not address the problem of pre-processing.  The work proposed to shrink the size of the dataset are not efficient at removing redundancy in the patterns discovered.  None of the works provide information more than basic statistical information.  Recent researches develop techniques to approximate, summarize a frequent pattern. Although these explore some context information, none of them are provide in-depth semantic information.  Context analysis quite common in natural language, but focus more on non-redundant word based contexts, which is different than pattern contexts.  Although not optimal, the general methods proposed here can be well applied to these tasks.

Conclusions  Existing mining works generate a large set of frequent patterns without providing information to interpret them.  We propose the novel problem of semantic pattern annotation (SPA) – generating semantic annotations for frequent patterns  We propose algorithms to exploit context modeling and semantic analysis to generate semantic annotations automatically. The proposed methods are quite general and can deal with any type of frequent pattern with context information.  We evaluated out approach with 3 different datasets. The results of which show that our methods can generate semantic annotations efficiently.  As shown, the proposed methods can be applied to many interesting real world tasks through selecting different context units.  A major goal for future research is to fully develop the potential of the proposed framework by studying alternative instantiations.

Acknowledgements

Generating Semantic Annotations for Frequent Patterns Using Context Analysis.

Similar presentations

Presentation on theme: "Generating Semantic Annotations for Frequent Patterns Using Context Analysis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Generating Semantic Annotations for Frequent Patterns Using Context Analysis.

Similar presentations

Presentation on theme: "Generating Semantic Annotations for Frequent Patterns Using Context Analysis."— Presentation transcript:

Similar presentations

About project

Feedback