Download presentation
Presentation is loading. Please wait.
Published byKory Owens Modified over 9 years ago
1
1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University of Illinois at Urbana-Champaign November 3, 2015
2
2 Frequent Patterns AB AB ABEABF C CD CDE EF EF DECE D AE BEBF AF Frequent Pattern Mining ( [Agrawal & Srikant 94] and many others) ABE CDE ABF CDEF ABEF …… Database Itemsets: diapermilk ; camerafilm ; … Sequential Patterns:... Mining Closed Frequent Graph Patterns … … Mining Graph and Structured Patterns in... Subgraph Patterns: …
3
3 Frequent Patterns AB AB ABEABF C CD CDE EF EF DECE D AE BEBF AF Toward Understanding the Patterns -- Find Canonical Patterns ABE CDE ABF CDEF ABEF …… Database CDEF 1.0 0.90.8 ( Yan et al ‘05) ( Xin et al ‘05)
4
4 Do they all make sense? What do they mean? How are they useful? diaperbeer female sterile (2) tekele Our goal: Annotate patterns with semantic information morphological info. and simple statistics Semantic Information Not all frequent patterns are useful, only those with meanings… Toward Understanding the Patterns -- How to Interpret Patterns?
5
5 Challenges How can we represent the semantics of a frequent pattern? (Annotate a pattern with what?) How can we infer pattern semantics? (How to annotate?) How can we do it in a general way? (Do it for all kinds of patterns) Once such annotations are generated, what can we use them for? (Applications)
6
6 Word: “pattern” – from Merriam-Webster A Dictionary Analogy Non-semantic info. Examples of Usage Definitions indicating semantics Synonyms Related Words
7
7 What about a “Pattern Dictionary”? -- Semantic Pattern Annotation (SPA) PatternWord: function; pronunciation; date; etc.Non-Semantic: A form or model proposed for … Definitions: a dressmaker’s pattern Examples: design, device, Synonyms motif, motive… a pattern of dissent original, constellation …Related words: “latent semantic analysis”Pattern:sequential; close; sup = 0.1%Non-Semantic: “indexing”, “semantic”, “S. Dumais”, Context Indicators (CI): “singular value decomposition”, … index by latent semantic analysis Representative Transactions: probablist latent semantic analysis “latent semantic indexing”, Semantically similar Patterns (SSP): “LSA”, “PLSA”
8
8 How Can We Generate Such an Entry? ABE CDE ABF CDEF ABEF PatternAB NonSup = 60% CIAB, E, F, EF … Trans.ABE; ABEF SSPsCD; … Database Semantic Annotations P 2 : CD P3:P3: P 1 : AB Pn:Pn: … Frequent Patterns … PatternCD …… ? How to infer the semantics of a frequent pattern?
9
9 Continue the Analogy… You’ll know the meaning of a pattern by its context “You shall know a word by the company it keeps.” - Firth 1957 Data … association … pattern … MINE … algorithm … mountain … Africa … diamond … MINE … weight … {C,D}: { … Printer, Film, Camera, Lens, … } {A,B}: { … Baby, Milk, Diaper, Toy, Soymilk… } Pattern Context
10
10 Our Approach: Model the Context ABE CDE ABF CDEF ABEF PatternAB NonSup = 60% CIAB, E, F, EF Trans.ABE; ABEF SSPsCD; … P 2 : CD P 1 : AB Pn:Pn: … DatabaseFrequent Patterns Semantic Annotations … PatternCD …… Context Units Context Units = Objects co-occurring with p
11
11 Semantic Analysis with Context Models Task1: Model the context of a frequent pattern Based on the Context Model… Task2: Extract strongest context indicators Task3: Extract representative transactions Task4: Extract semantically similar patterns
12
12 Task1: Context Modeling - A Vector Space Model ABE CDE ABF CDEF ABEF PatternAB NonSup = 60% CIAB, E, F, EF Trans.ABE; ABEF SSPsCD; … P 2 : CD P 1 : AB Pn:Pn: … Database Frequent Patterns Semantic Annotations … PatternCD …… Context Units Context Unit Weight: Context Similarity: Co-occurrence Mutual Information …… Cosine Similarity Pearson Coefficient ……
13
13 Context Unit Selection diapermilkbabywearlotion cameramemory stickprinter t1t1 t2t2 Valid Context Units: In general, Context Units are frequent patterns Single items diapermilkprinter,,…, t1t1 t2t2 transactions milklotion itemsets camera
14
14 Context Unit Selection: Redundancy Removal Problem: too many valid context units, most are redundant –{ Diaper, milk, babywear }: “diaper”, “diaper, milk”, “milk, babywear”, “milk, lotion”, … Solution: –use close patterns –micro-clustering: (hierarchical, one-pass) Jaccard Distance (γ: threshold to stop clustering):
15
15 Task2: Extract Context Indicators ABE CDE ABF CDEF ABEF PatternAB NonSup = 60% CIAB, EF, ABE.. Trans.ABE; ABEF SSPsCD; … P 2 : CD P 1 : AB Pn:Pn: … Database Frequent Patterns Semantic Annotations … PatternCD …… Context Units Context Unit Weighting AB 3.0 EF 2.0 ABE 1.0 …
16
16 Task3: Extract Representative Transactions ABE CDE ABF CDEF ABEF PatternAB NonSup = 60% CIAB, E, F, EF Trans.ABEF; ABE SSPsCD; … P 1 : AB DatabaseFrequent Patterns Semantic Annotations … PatternCD …… Context Units 3.0, 0, …,2.0, …, 1.0 1.0, 0, …,1.0, …, 1.0 T1:T1: Semantic Similarity T 5 0.8 T 1 0.6 T 3 0.6 … T5:T5:
17
17 Task4: Extract Semantically Similar Patterns ABE CDE ABF CDEF ABEF PatternAB NonSup = 60% CIAB, E, F, EF Trans.ABEF; ABE SSPsCD; … P 1 : AB DatabaseFrequent Patterns Semantic Annotations … PatternCD …… Context Units 3.0, 0, …,2.0, …, 1.0 0, 3.0, …,2.0, …, 0.5 Semantic Similarity CD 0.7 BF 0.5 EF 0.3 … AB: P k : EF P 2 : CD
18
18 Experiments Three different real world applications –Annotating DBLP title/authors Patterns –Motif/Gene-Ontology (GO) matching –Gene Synonyms extraction Study the effectiveness of the proposed SPA methods Explore applications of SPA to different real world tasks
19
19 Annotating DBLP Co-authorship and Title Pattern Substructure Similarity Search in Graph Databases X.Yan, P. Yu, J. Han …… …… Database: TitleAuthors Frequent Patterns P 1 : { x_yan, j_han } Frequent Itemset P 2 : “substructure search” Frequent Sequential Pattern Pattern{ x_yan, j_han} NonSup = … CI{p_yu}, graph pattern, … Trans.gSpan: graph-base…… SSPs{ j_wang }, {j_han, p_yu}, … Semantic Annotations Context Units
20
20 DBLP Results: Frequent Itemset Context Indicator (CI) graph; {philip_yu}; mine close; graph pattern; index approach; sequential pattern; … Representative Transactions (Trans) > gSpan: graph-base substructure pattern mining; > mining close relational graph connect constraint; … Semantically Similar Patterns (SSP) {jiawei_han, philip_yu}; {jian_pei, jiawei_han}; {jiong_yang, philip_yu, wei_wang}; … Pattern= {xifeng_yan, jiawei_han} Annotations:
21
21 DBLP Results: Freq. Seq. Pattern Context Indicator (CI) {w_bruce_croft}; web information; full text; {monika_rauch_hezinger}; {james_p_callan}; … Representative Transactions (Trans) > web information retrieval > language model information retrieval Semantically Similar Patterns (SSP) information use; web information; probabilistic information; information filter; text information; … Pattern= “Information … retrieval” Annotations:
22
22 Motif-GO Matching GO term 1 GO term 2 GO term 3 GO term 4 GO term 5 Sequence 1 Sequence 2 Sequence 3 motif1motif2 motif3 motif4motif5 motif2 ? Motif: a subsequence pattern in the sequences Gene Ontology (GO) terms: annotating the functionality of sequence, motifs
23
23 Motif-GO Matching (Cont.) GOTerm1; GOTerm2; GOTerm3 GOTerm3 …… Database: GO termsProtein Sequence Frequent Patterns P 2 : GOTerm2 Single Item Pattern PatternMotif1 Non CIGOTerm1, GOTerm3, … Trans. SSPsGOTerm1, GOTerm2, … Semantic Annotations Context Units P 1 : Motif1 Sequential Pattern Motif 1 Motif-GO matching Motif1 GOTerm1 GOTerm2
24
24 Motif/GO Matching: Evaluation Gold standard generated by human experts Measure: Mean reciprocal rank (MRR) –Reflects ranking accuracy (the higher the better) –1/Rank (0.5 means the correct answer is ranked as the 2 nd ) Results: Mutual InformationCo-occurrence Random Selection0.0023 Context Indicators0.58770.6064 SSPs0.40170.4681 Weights for Context Units: Ranking Strategy
25
25 Gene Synonym Extraction Gene Synonyms: –A Sequential Pattern in the textual database –Matching gene synonyms: a challenging and important new problem in mining biology data –Analogy: thesaurus or synonyms in dictionary Gene_idGene Synonyms FBgn0001000female sterile 2 tekele ; fs 2 sz 10 ; tek; fs 2 tek; tekele; …
26
26 Gene Synonym Extraction (Cont.) … D. melanogaster gene Female sterile (2) Tekele … … Female sterile (2) Tekele, abbreviated as Fs(2)Tek … … Database: Biomedical Sentences Frequent Patterns P 1 : female sterile (2) tekele Sequential Pattern Patternfemale sterile (2) tekele Non CI Trans. SSPs Fs(2)Tek, female sterile, fs 2 sz 10, … Semantic Annotations Context Units Matched Synonyms female sterile (2) tekele Fs(2)Tek fs 2 sz 10 female sterile … P 2 : Fs(2)Tek Sequential Pattern Context Units: context units can be single words or sequential patterns
27
27 Gene Synonym Extraction: Results Effective! MRR > 0.5 frequent pattern >> single words Micro-clustering is useful Running time: hierarchical Running time: one-pass MRR: hierarchical MRR: one-pass
28
28 Conclusions A novel problem: semantical pattern annotation A structured annotation for frequent patterns A general method based on context modeling A general post-processing procedure of frequent pattern mining on any types of pattern Applicable to and effective for quite different tasks Future work: –Tune for specific tasks –Better context unit weights, redundancy removal, etc
29
29 Thanks and Questions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.