1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University.

Slides:



Advertisements
Similar presentations
Association Rule Mining
Advertisements

Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
gSpan: Graph-based substructure pattern mining
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
LOGO Association Rule Lecturer: Dr. Bo Yuan
Ke Liu1, Junqiu Wu2, Shengwen Peng1,Chengxiang Zhai3, Shanfeng Zhu1
WIMS 2014, Thessaloniki, June 2014 A soft frequent pattern mining approach for textual topic detection Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval University of Illinois at Urbana-Champain Zhichen XuYan Chen Northwestern.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
Generating Semantic Annotations for Frequent Patterns Using Context Analysis.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
Summarization of Frequent Pattern Mining. What is FPM? Why being frequent is so important? Application of FPM Decision make/Business Software Debugging.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
1 Discovering Collocation Patterns: from Visual Words to Visual Phrases Junsong Yuan, Ying Wu and Ming Yang CVPR’07.
33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Topic Modeling with Network Regularization Qiaozhu Mei, Deng Cai, Duo Zhang, ChengXiang Zhai University of Illinois at Urbana-Champaign.
Generating Impact-Based Summaries for Scientific Literature Qiaozhu Mei, ChengXiang Zhai University of Illinois at Urbana-Champaign 1.
1 Folksonomy-Based Collabulary Learning Leandro Balby Marinho, Krisztian Buza, Lars Schmidt-Thieme
What Is Sequential Pattern Mining?
ICMLC2007, Aug. 19~22, 2007, Hong Kong 1 Incremental Maintenance of Ontology- Exploiting Association Rules Ming-Cheng Tseng 1, Wen-Yang Lin 2 and Rong.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Text mining.
Concept Clustering, Summarization and Annotation Qiaozhu Mei.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Mining High Utility Itemset in Big Data
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
On Node Classification in Dynamic Content-based Networks.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Introduction of Data Mining and Association Rules cs157 Spring 2009 Instructor: Dr. Sin-Min Lee Student: Dongyi Jia.
This paper was presented at KDD ‘06 Discovering Interesting Patterns Through User’s Interactive Feedback Dong Xin Xuehua Shen Qiaozhu Mei Jiawei Han Presented.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Anant Pradhan PET: A Statistical Model for Popular Events Tracking in Social Communities Cindy Xide Lin, Bo Zhao, Qiaozhu Mei, Jiawei Han (UIUC)
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
CloSpan: Mining Closed Sequential Patterns in Large Datasets Xifeng Yan, Jiawei Han and Ramin Afshar Proceedings of 2003 SIAM International Conference.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Clustering More than Two Million Biomedical Publications Comparing the Accuracies of Nine Text-Based Similarity Approaches Boyack et al. (2011). PLoS ONE.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Opportunities for Text Mining in Bioinformatics (CS591-CXZ Text Data Mining Seminar) Dec. 8, 2004 ChengXiang Zhai Department of Computer Science University.
Introduction to Data Mining by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Mining Frequent Patterns. What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Automatic Labeling of Multinomial Topic Models
Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Discovering Complex Matchings across Web Query Interfaces: A Correlation Mining Approach Bin He Joint work with: Kevin Chen-Chuan Chang, Jiawei Han Univ.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Xifeng Yan Philip S. Yu Jiawei Han SIGMOD 2005 Substructure Similarity Search in Graph Databases.
Gspan: Graph-based Substructure Pattern Mining
Concept Grounding to Multiple Knowledge Bases via Indirect Supervision
Semantic Processing with Context Analysis
Association rule mining
Jiawei Han Department of Computer Science
Jim Hahn Associate Professor
Latent Semantic Analysis
Presentation transcript:

1 Generating Semantic Annotations for Frequent Patterns with Context Analysis Qiaozhu Mei, Dong Xin, Hong Cheng, Jiawei Han, ChengXiang Zhai University of Illinois at Urbana-Champaign November 3, 2015

2 Frequent Patterns AB AB ABEABF C CD CDE EF EF DECE D AE BEBF AF Frequent Pattern Mining ( [Agrawal & Srikant 94] and many others) ABE CDE ABF CDEF ABEF …… Database Itemsets: diapermilk ; camerafilm ; … Sequential Patterns:... Mining Closed Frequent Graph Patterns … … Mining Graph and Structured Patterns in... Subgraph Patterns: …

3 Frequent Patterns AB AB ABEABF C CD CDE EF EF DECE D AE BEBF AF Toward Understanding the Patterns -- Find Canonical Patterns ABE CDE ABF CDEF ABEF …… Database CDEF ( Yan et al ‘05) ( Xin et al ‘05)

4 Do they all make sense? What do they mean? How are they useful? diaperbeer female sterile (2) tekele Our goal: Annotate patterns with semantic information morphological info. and simple statistics Semantic Information Not all frequent patterns are useful, only those with meanings… Toward Understanding the Patterns -- How to Interpret Patterns?

5 Challenges How can we represent the semantics of a frequent pattern? (Annotate a pattern with what?) How can we infer pattern semantics? (How to annotate?) How can we do it in a general way? (Do it for all kinds of patterns) Once such annotations are generated, what can we use them for? (Applications)

6 Word: “pattern” – from Merriam-Webster A Dictionary Analogy Non-semantic info. Examples of Usage Definitions indicating semantics Synonyms Related Words

7 What about a “Pattern Dictionary”? -- Semantic Pattern Annotation (SPA) PatternWord: function; pronunciation; date; etc.Non-Semantic: A form or model proposed for … Definitions: a dressmaker’s pattern Examples: design, device, Synonyms motif, motive… a pattern of dissent original, constellation …Related words: “latent semantic analysis”Pattern:sequential; close; sup = 0.1%Non-Semantic: “indexing”, “semantic”, “S. Dumais”, Context Indicators (CI): “singular value decomposition”, … index by latent semantic analysis Representative Transactions: probablist latent semantic analysis “latent semantic indexing”, Semantically similar Patterns (SSP): “LSA”, “PLSA”

8 How Can We Generate Such an Entry? ABE CDE ABF CDEF ABEF PatternAB NonSup = 60% CIAB, E, F, EF … Trans.ABE; ABEF SSPsCD; … Database Semantic Annotations P 2 : CD P3:P3: P 1 : AB Pn:Pn: … Frequent Patterns … PatternCD …… ? How to infer the semantics of a frequent pattern?

9 Continue the Analogy… You’ll know the meaning of a pattern by its context “You shall know a word by the company it keeps.” - Firth 1957 Data … association … pattern … MINE … algorithm … mountain … Africa … diamond … MINE … weight … {C,D}: { … Printer, Film, Camera, Lens, … } {A,B}: { … Baby, Milk, Diaper, Toy, Soymilk… } Pattern Context

10 Our Approach: Model the Context ABE CDE ABF CDEF ABEF PatternAB NonSup = 60% CIAB, E, F, EF Trans.ABE; ABEF SSPsCD; … P 2 : CD P 1 : AB Pn:Pn: … DatabaseFrequent Patterns Semantic Annotations … PatternCD …… Context Units Context Units = Objects co-occurring with p

11 Semantic Analysis with Context Models Task1: Model the context of a frequent pattern Based on the Context Model… Task2: Extract strongest context indicators Task3: Extract representative transactions Task4: Extract semantically similar patterns

12 Task1: Context Modeling - A Vector Space Model ABE CDE ABF CDEF ABEF PatternAB NonSup = 60% CIAB, E, F, EF Trans.ABE; ABEF SSPsCD; … P 2 : CD P 1 : AB Pn:Pn: … Database Frequent Patterns Semantic Annotations … PatternCD …… Context Units Context Unit Weight: Context Similarity: Co-occurrence Mutual Information …… Cosine Similarity Pearson Coefficient ……

13 Context Unit Selection diapermilkbabywearlotion cameramemory stickprinter t1t1 t2t2 Valid Context Units: In general, Context Units are frequent patterns Single items diapermilkprinter,,…, t1t1 t2t2 transactions milklotion itemsets camera

14 Context Unit Selection: Redundancy Removal Problem: too many valid context units, most are redundant –{ Diaper, milk, babywear }: “diaper”, “diaper, milk”, “milk, babywear”, “milk, lotion”, … Solution: –use close patterns –micro-clustering: (hierarchical, one-pass) Jaccard Distance (γ: threshold to stop clustering):

15 Task2: Extract Context Indicators ABE CDE ABF CDEF ABEF PatternAB NonSup = 60% CIAB, EF, ABE.. Trans.ABE; ABEF SSPsCD; … P 2 : CD P 1 : AB Pn:Pn: … Database Frequent Patterns Semantic Annotations … PatternCD …… Context Units Context Unit Weighting AB 3.0 EF 2.0 ABE 1.0 …

16 Task3: Extract Representative Transactions ABE CDE ABF CDEF ABEF PatternAB NonSup = 60% CIAB, E, F, EF Trans.ABEF; ABE SSPsCD; … P 1 : AB DatabaseFrequent Patterns Semantic Annotations … PatternCD …… Context Units 3.0, 0, …,2.0, …, , 0, …,1.0, …, 1.0 T1:T1: Semantic Similarity T T T … T5:T5:

17 Task4: Extract Semantically Similar Patterns ABE CDE ABF CDEF ABEF PatternAB NonSup = 60% CIAB, E, F, EF Trans.ABEF; ABE SSPsCD; … P 1 : AB DatabaseFrequent Patterns Semantic Annotations … PatternCD …… Context Units 3.0, 0, …,2.0, …, 1.0 0, 3.0, …,2.0, …, 0.5 Semantic Similarity CD 0.7 BF 0.5 EF 0.3 … AB: P k : EF P 2 : CD

18 Experiments Three different real world applications –Annotating DBLP title/authors Patterns –Motif/Gene-Ontology (GO) matching –Gene Synonyms extraction Study the effectiveness of the proposed SPA methods Explore applications of SPA to different real world tasks

19 Annotating DBLP Co-authorship and Title Pattern Substructure Similarity Search in Graph Databases X.Yan, P. Yu, J. Han …… …… Database: TitleAuthors Frequent Patterns P 1 : { x_yan, j_han } Frequent Itemset P 2 : “substructure search” Frequent Sequential Pattern Pattern{ x_yan, j_han} NonSup = … CI{p_yu}, graph pattern, … Trans.gSpan: graph-base…… SSPs{ j_wang }, {j_han, p_yu}, … Semantic Annotations Context Units

20 DBLP Results: Frequent Itemset Context Indicator (CI) graph; {philip_yu}; mine close; graph pattern; index approach; sequential pattern; … Representative Transactions (Trans) > gSpan: graph-base substructure pattern mining; > mining close relational graph connect constraint; … Semantically Similar Patterns (SSP) {jiawei_han, philip_yu}; {jian_pei, jiawei_han}; {jiong_yang, philip_yu, wei_wang}; … Pattern= {xifeng_yan, jiawei_han} Annotations:

21 DBLP Results: Freq. Seq. Pattern Context Indicator (CI) {w_bruce_croft}; web information; full text; {monika_rauch_hezinger}; {james_p_callan}; … Representative Transactions (Trans) > web information retrieval > language model information retrieval Semantically Similar Patterns (SSP) information use; web information; probabilistic information; information filter; text information; … Pattern= “Information … retrieval” Annotations:

22 Motif-GO Matching GO term 1 GO term 2 GO term 3 GO term 4 GO term 5 Sequence 1 Sequence 2 Sequence 3 motif1motif2 motif3 motif4motif5 motif2 ? Motif: a subsequence pattern in the sequences Gene Ontology (GO) terms: annotating the functionality of sequence, motifs

23 Motif-GO Matching (Cont.) GOTerm1; GOTerm2; GOTerm3 GOTerm3 …… Database: GO termsProtein Sequence Frequent Patterns P 2 : GOTerm2 Single Item Pattern PatternMotif1 Non CIGOTerm1, GOTerm3, … Trans. SSPsGOTerm1, GOTerm2, … Semantic Annotations Context Units P 1 : Motif1 Sequential Pattern Motif 1 Motif-GO matching Motif1 GOTerm1 GOTerm2

24 Motif/GO Matching: Evaluation Gold standard generated by human experts Measure: Mean reciprocal rank (MRR) –Reflects ranking accuracy (the higher the better) –1/Rank (0.5 means the correct answer is ranked as the 2 nd ) Results: Mutual InformationCo-occurrence Random Selection Context Indicators SSPs Weights for Context Units: Ranking Strategy

25 Gene Synonym Extraction Gene Synonyms: –A Sequential Pattern in the textual database –Matching gene synonyms: a challenging and important new problem in mining biology data –Analogy: thesaurus or synonyms in dictionary Gene_idGene Synonyms FBgn female sterile 2 tekele ; fs 2 sz 10 ; tek; fs 2 tek; tekele; …

26 Gene Synonym Extraction (Cont.) … D. melanogaster gene Female sterile (2) Tekele … … Female sterile (2) Tekele, abbreviated as Fs(2)Tek … … Database: Biomedical Sentences Frequent Patterns P 1 : female sterile (2) tekele Sequential Pattern Patternfemale sterile (2) tekele Non CI Trans. SSPs Fs(2)Tek, female sterile, fs 2 sz 10, … Semantic Annotations Context Units Matched Synonyms female sterile (2) tekele Fs(2)Tek fs 2 sz 10 female sterile … P 2 : Fs(2)Tek Sequential Pattern Context Units: context units can be single words or sequential patterns

27 Gene Synonym Extraction: Results Effective! MRR > 0.5 frequent pattern >> single words Micro-clustering is useful Running time: hierarchical Running time: one-pass MRR: hierarchical MRR: one-pass

28 Conclusions A novel problem: semantical pattern annotation A structured annotation for frequent patterns A general method based on context modeling A general post-processing procedure of frequent pattern mining on any types of pattern Applicable to and effective for quite different tasks Future work: –Tune for specific tasks –Better context unit weights, redundancy removal, etc

29 Thanks and Questions