Download presentation
Presentation is loading. Please wait.
Published byJane Morrison Modified over 9 years ago
1
1 Current Research in Data Mining Research Group Current Research in Data Mining Research Group Jiawei Han Data Mining Research Group Department of Computer Science University of Illinois at Urbana-Champaign Acknowledgements: NSF, ARL, ARO, NASA, AFOSR (MURI), DHS, Microsoft, IBM, Yahoo! Labs, LinkedIn, HP Lab & Boeing September 11, 2015
2
2 Outline An Introduction to Data Mining Research Group Pattern Discovery Methods Mining Heterogeneous Information Networks Construction of Heterogeneous Information Networks from Unstructured Data TextCube and OLAP heterogeneous networks Mining Cyber-Physical Systems and Networks Conclusions
3
Data Mining and Data Warehousing Jiawei Han’s Group at CS, UIUC Mining patterns and knowledge discovery from massive data Data mining in heterogeneous information networks Exploring broad applications of data mining 3 Developed popular data mining algorithms: FPgrowth, gSpan, PrefixSpan, RankingCube, TruthFinder, NetClus, RankClass, … 600+ research papers, most cited author/group in data mining ACM Fellow, IEEE Fellow, ACM SIGKDD Innovation Award, W. McDowell Award; Students: ACM KDD Dissertation Awards (2008, 2013), …… Textbook, “Data mining: Concepts and Techniques,” adopted worldwide Funded as NSCTA (Network Science Collaborative Technology Alliance) by ARL [09-14, 15-19], ARO, NIH KnowEnG, NSF, Boeing, MSR, Google, Yahoo!, HP Labs, … Graduated 40+ Ph.D.’s: joined Google, Microsoft Research, Yahoo! Labs, Facebook, Twitter, as well as professors (14) Supervising 17 Ph.D., 4 M.S. students & 5 visitors/postdocs
4
Data Mining Research Group in CS, Univ. Illinois Student Prominent Awards Student Prominent Awards – SIGKDD or SIGMOD Ph.D. Dissertation Awards/ Runner-Ups – 10-year impact paper awards – Best student paper awards, best papers, best posters, … – KDDCUP 2013 Runner Up Award – IBM/Microsoft/NSF/NDSEG Ph.D. Fellowships Graduation: Graduation: – Professors at UVA, UCSB, PSU, U. Buffalo, Northeastern, FSU, MSU, Notre Dame, CUHK, … – Researchers at IBM, MSR, Google Research, Yahoo! Labs, Facebook, Twitter, NEC, etc. 4
5
5 Outline An Introduction to Data Mining Research Group Pattern Discovery Methods Mining Heterogeneous Information Networks Construction of Heterogeneous Information Networks from Unstructured Data TextCube and OLAP heterogeneous networks Mining Cyber-Physical Systems and Networks Conclusions
6
6 Mining Sequential Patterns from Shopping Sequences Sequential pattern mining: Given a set of (shopping) sequences, find the complete set of frequent subsequences A sequence database : a subsequence of Given support threshold min_sup =2, is a sequential pattern SIDsequence 10 20 30 40 Our innovation: (1) PrefixSpan (TKDE’04): 1598 citations (2) CloSpan (SDM’03): 568 (reduce redundancy) (3) FPgrowth (SIGMOD’00): 4956 s= s| : (, 2) s| : (, 4) Idea of PrefixSpan Idea of CloSpan Difficulty to generalize it to biosequence mining: approximate patterns & noise
7
Mining Frequent Subgraph Patterns from Graph DBs GRAPH DATASET (e.g., Chemical Compound Database) FREQUENT PATTERNS (Let MIN SUPPORT = 2) Graph pattern mining: Given a set of graphs, find the complete set of frequent subgraphs Our innovation: (1) gSpan (ICDM’02): 1319 citations (2) CloseGraph (KDD’03): 520 (not to mine subgraphs covered by their super-patterns) 7 Idea of gSpan Graph pattern growth + completeness of right-most extension … G G1G1 G2G2 GnGn k-edge (k+1)-edge At what condition, can we stop searching their Children. i.e., early termination? NCI/NIH AIDS antiviral screen compound data minsup = 5% Extend to mine structures in large single networks (VLDB’11) CloseGraph
8
Graph Indexing and Graph Similarity Search Graph Search: Given a query graph Q, find all the graphs in graph DB containing Q query graph graph DB Graph (G) Graph Index Query:Q Graph Index helps search Our Innovation: gIndex (SIGMOD’04): 419 citations grafil (SIGMOD’05): similarity search gIndex key idea: index on frequent and discriminative substructures (mined) # candidates/query size # indices/ DBsize grafil key idea: explore feature similarity … Query:Q Graph (G) features Approximate features 8
9
CoDense, Mining Frequent Coherent Dense Subgraphs across Multiple Microarray Datasets c 1 c 2 … c m g 1.1.2….2 g 2.4.3….4 … c 1 c 2 … c m g 1.8.6….2 g 2.2.3….4 … c 1 c 2 … c m g 1.9.4….1 g 2.7.3….5 … c 1 c 2 … c m g 1.2.5….8 g 2.7.1….3 …...... f a b c d e f g h i j k a b c d e f g h i j k a b c d e f g h i j k a b d e f g h i j k c...... Frequency: all edges occur in ≥ k graphs Coherency: correlated edge occurrences Density: subgraph is dense ≥ threshold CoDense: ISMB’05: mining noisy micro-array data to derive interesting dense subgraphs (collab. w. USC: Jamine Zhou) Experiment Coherent dense graphs: Red:PHB1,ATP17,MRPL51,MRPL39, MRPL49, MRPL51,PET100 GO:0006091(generation of precursor metabolites and energy; pvalue=0. 001339) Discovery Our innovation:
10
Data Mining Process of CoDense
11
11 Outline An Introduction to Data Mining Research Group Pattern Discovery Methods Mining Heterogeneous Information Networks Construction of Heterogeneous Information Networks from Unstructured Data TextCube and OLAP heterogeneous networks Mining Cyber-Physical Systems and Networks Conclusions
12
Mining Heterogeneous Information Networks Heterogeneous networks: Multiple object types and/or multiple link types Venue PaperAuthor DBLP Bibliographic Network The IMDB Movie Network Actor Movie DirectorMovieStudio info. loss Homogeneous networks are info. loss projection of heterogeneous networks ! The Facebook Network Directly mining information-richer heterogeneous networks Current work: Mining DBLP (CS bibliographic DB), PubMed, news, tweets, data.gov, …
13
Structured Heterogeneous Network Modeling Leads to the New Power of Data Mining! DBLP: A Computer Science bibliographic database A sample publication record in DBLP (>2 M papers, >0.7 M authors, >10 K venues ), … 13 Power of het. network modeling: Treat Author, Venue, Term, Paper all first-class citizens!
14
RankClus: Rank-Based Clustering 14 RankCompete: Organize your photo album automatically! Rank treatments for AIDS from MEDLINE RankClus (EDBT’09)/NetClus (KDD’09): Integrate ranking & clustering for mining heterogeneous info networks DBLP Schema
15
15 RankClass: Integration of Tanking and Classification Knowledge propagation via multi-typed heterogeneous networks ECMLPKDD'10/KDD’11: integrate ranking and classification; small training set; knowledge propagation across typed links; efficient and scalable DatabaseData MiningAIIR Top-5 ranked conf.s VLDBKDDIJCAISIGIR SIGMODSDMAAAIECIR ICDEICDMICMLCIKM PODSPKDDCVPRWWW EDBTPAKDDECMLWSDM Top-5 ranked terms datamininglearningretrieval databasedataknowledgeinformation queryclusteringreasoningweb systemclassificationlogicsearch xmlfrequentcognitiontext DBLP: 4-fields data set (DB, DM, AI, IR) forming a heterog. info. network Rank objects within each class (with extremely limited label information) Obtain High classification accuracy and excellent rankings within each class Our innovation: Potential applications: Biological network mining
16
Anhai Doan CS, Wisconsin Database area PhD: 2002 Meta-Path: Author-Paper-Venue-Paper-Author (APVPA) Jignesh Patel CS, Wisconsin Database area PhD: 1998 Amol Deshpande CS, Maryland Database area PhD: 2004 Jun Yang CS, Duke Database area PhD: 2001 16 Meta-Path Guided Similarity Search in Networks Similarity search: Find similar objects in networks Who are most similar to AnHai Doan? Meta-Path: Meta-level description of a path between two objects Different meta-paths carry rather different semantics DBLP Network Schema Our innovation PathSim (VLDB’11): Similarity search in heterogeneous networks; a balanced similarity measure; user- guidance by selecting different meta-paths Application in biomedical domain IBM: search for close relationships among disease, drugs, treatments, side-effects, and explanations
17
PathPredict: Meta-Path Based Relationship Prediction Network schema 17 Co-author prediction for Jian Pei: Only 42 among 4809 candidates are true first-time co-authors! (Trained based on data collected in [1996, 2002]; Testing period: [2003,2009]) papertopic venue author publishpublish -1 mention -1 mention write write -1 contain/contain -1 cite/cite -1 Meta path-guided prediction: Infer or predict new relationships among multi-typed links PathPredict (ASONAM’11) Co-author prediction (A—P—A) using topological features encoded by meta paths, e.g., (A—P→P—A). Which meta-path is more important? Our contribution Different meta-paths have different prediction power: p-values obtained from the DBLP data Applications Who will be your new coauthors?
18
Truth Analysis: Enhancing the Quality of Heterogeneous Information Networks Motivation: Info. provided can be untrustworthy, error-prone, missing, … Application: handling conflicting claims on biomedical properties w1w1 f1f1 f2f2 w2w2 w3w3 w4w4 f4f4 Info provider Claim o1o1 o2o2 Objects f3f3 IMDB Negative Claim Positive Claim Multiple facts, two-sided claims: Harry Potter Netflix BadSource Correct Claim Incorrect Claim 18 Experimental datasets: Experimental datasets: Large and real datasets Book Authors from abebooks.com Book Authors from abebooks.com (1263 books, 879 sources, 48153 claims, 2420 book- author, 100 labeled) Movie Directors from Bing Movie Directors from Bing (15073 movies, 12 sources, 108873 claims, 33526 movie-director, 100 labeled) TruthFinder (TKDE’08): mutual enhancement of trustworthiness of info providers and claims Latent Truth Model (VLDB’12): modeling two sided truth Our contribution
19
19 Outline An Introduction to Data Mining Research Group Pattern Discovery Methods Mining Heterogeneous Information Networks Construction of Heterogeneous Information Networks from Unstructured Data TextCube and OLAP heterogeneous networks Mining Cyber-Physical Systems and Networks Conclusions
20
Hierarchical Relationship Discovery 20 From partially ordered objects to hierarchy (tree) Based on NLP or other techniques to extract partially ordered objects Using constraints to discover relationships TypeCognitive descriptionPotential definition HomophileParent and child are similar PolarityParent is superior to child Support pattern Patterns frequently occurring with child-parent pairs Forbidden pattern Patterns rarely occurring with child-parent pairs Singleton Potential Type Cognitive description Potential definition Attribute augment Use inherited attributes from parents or children Label propagate Similar nodes share similar parents (or children) Reciprocity Patterns altering in child- parent & parent-child pairs ConstraintsRestrict certain patterns Pairwise Potential Function: Cases Discovery of the Kenny Family Tree
21
Recursive Construction of a Topical Hierarchy by Phrase Mining 21 Topic discovery Topical phrase mining and ranking Recursive construction Term co-occurrence network The Framework of CATHY (Constructing A Topical HierarchY)
22
Growing Parallel Paths (WWW 2011) Result: 22
23
WinaCS: Web Information Network Analysis for Computer Science Database records can be found on link paths! 23
24
Research-Insight [SIGMOD’13 Demo] 24 Advisor-Advisee result for “Kevin Chang” Potential collaborators for “Jiawei Han” Query on “Jim Gray” Query on “Machine Learning”
25
25 Outline An Introduction to Data Mining Research Group Pattern Discovery Methods Mining Heterogeneous Information Networks Construction of Heterogeneous Information Networks from Unstructured Data TextCube and OLAP heterogeneous networks Mining Cyber-Physical Systems and Networks Conclusions
26
Event Cube: An Overview Multidimensional Text Database 98.01 99.02 99.01 98.02 LAX SJCMIA AUS overshoot undershoot birds turbulence Time Location Topic CA FLTX Location 1998 1999 Time Deviation Encounter Topic drill- down roll-up Event Cube Representation Analyst … Multidimensional OLAP, Ranking, Cause Analysis, Topic Summarization/Comparison …… Analysis Support 26 Event Cube: An Organized Approach for Mining and Understanding Anomalous Aviation Events Funded by NASA (2008-2010)
27
Text/Topic Cube: General Idea Heterogeneous: categorical attributes + unstructured text How to combine? Our solution: TimeLocationPlaceEnvironment… Event ReportACN Text data Cube: Categorical Attributes Term/TopicWeight T1W1 T2W2 T3W3 …… Text/Topic Model: Unstructured Text Measure 27
28
Effective OLAP Exploration TopCells (ICDE’ 10): Ranking aggregated cells (objects) in TextCube TEXplorer (CIKM’11): Integrating keyword-based ranking and OLAP exploration Healthcare Reform Healthcare Reform 28
29
EventCube Snapshot: Query Result 29
30
30 Outline An Introduction to Data Mining Research Group Pattern Discovery Methods Mining Heterogeneous Information Networks Construction of Heterogeneous Information Networks from Unstructured Data TextCube and OLAP heterogeneous networks Mining Cyber-Physical Systems and Networks Conclusions
31
MoveMine: Mining Moving Object Databases A system that mines moving object patterns: Z. Li, et al., “MoveMine: Mining Moving Object Databases", SIGMOD’10 (system demo) 31 31
32
Mining Spatiotemporal and Mobility Data #1 #2 #3 #4 density map #1 #2 #4 #3 Longitude Latitude Raw movement data (time series view) Time (hour) Spot #1: Office Spot #2: Commuting city Spot #3: Home Spot #4: Vacation place 32
33
Mining Periodicity in Sparse Data [KDD12] Event has a period of 20 Occurrences of the event happen between 20k+5 to 20k+10 33
34
GeoTopic Discovery: Mining Spatial Text LDM TDM GeoFolk LGTA Geo-tagged photos w. landscape (coast vs. desert vs. mountain) 34 Z. Yin, et a., GeoTopic Discovery and Comparison, WWW'11
35
LPTA: Latent Periodic Topic Analysis: Discovery of Temporal Patterns of Topics Periodic topic: repeating in regular intervals Background topic: covered uniformly over the entire period Bursty topic: A transient topic that is intensively covered only in a certain time period Time distribution of topics Integration of both text and time in analysis 35
36
Social Relationship Mining from Sensor Trace Data T-Motif: a time interval [S,T], that many positive pairs meet at that time few negative pairs meet at that time Ex.: MIT Reality mining dataset: 94 people tracked for 10 months Use only spatiotemporal info Algs. for efficient mining of T-motifs and effective classification 36
37
Mining RFID Data to Explore Trajectories (Factory, T1,T2) (Shipping,T3,T4) (Warehouse, T5,T6) (Shelf, T7,T8) (Checkout,T9,T10) 37 Warehousing and mining RFID data
38
38 Outline An Introduction to Data Mining Research Group Pattern Discovery Methods Mining Heterogeneous Information Networks Construction of Heterogeneous Information Networks from Unstructured Data TextCube and OLAP heterogeneous networks Mining Cyber-Physical Systems and Networks Conclusions
39
39 Conclusions An Introduction to Data Mining Research Group Pattern Discovery Methods Mining Heterogeneous Information Networks Construction of Heterogeneous Information Networks from Unstructured Data TextCube and OLAP heterogeneous networks Mining Cyber-Physical Systems and Networks Lots to be done in this promising research frontier!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.