1 Current Research in Data Mining Research Group Current Research in Data Mining Research Group Jiawei Han Data Mining Research Group Department of Computer Science University of Illinois at Urbana-Champaign Acknowledgements: NSF, ARL, NASA, AFOSR (MURI), DHS, Microsoft, IBM, Yahoo!, HP Lab & Boeing June 3, 2016
2 Outline An Introduction to Data Mining Research Group Mining and OLAPing Information Networks Mining Heterogeneous Information Networks Mining Text-Rich Information Networks OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks Taming the Web: WINACS (Integrated mining of Web structures and contents) Mining Cyber-Physical Systems and Networks Conclusions
Data Mining and Data Warehousing Jiawei Han’s Group at CS, UIUC Mining patterns and knowledge discovery from massive data Data mining in heterogeneous information networks Exploring broad applications of data mining Developed many effective data mining algorithms, e.g., FPgrowth, PrefixSpan, gSpan, StarCubing, CrossMine, RankingCube, CrossClus, RankClus, and NetClus 600+ research papers in conferences and journals Fellow of ACM, Fellow of IEEE, ACM SIGKDD Innovation Award, W. McDowell Award, Daniel Drucker Eminent Faculty Award Textbook, “Data mining: Concepts and Techniques,” adopted worldwide Project lead for NASA EventCube for Aviation Safety [ ] Director of Information Network Academic Research Center funded from Army Research Lab (ARL) [ ] 3
Data Mining Research Group Data Mining Research Group at CS, UIUC 4
New Books on Data Mining & Link Mining 5 Han, Kamber and Pei, Data Mining, 3 rd ed Yu, Han and Faloutsos (eds.), Link Mining, 2010 Sun and Han, Mining Heterogeneous Information Networks, 2012
6 Outline An Introduction to Data Mining Research Group Mining and OLAPing Information Networks Mining Heterogeneous Information Networks Mining Text-Rich Information Networks OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks Taming the Web: WINACS (Integrated mining of Web structures and contents) Mining Cyber-Physical Systems and Networks Conclusions
Mining Heterogeneous Information Networks RankClus/NetClus RankCompete: A Competing Random Walk Model for Rank-Based Clustering DatabaseData MiningAIIR Top-5 ranked conferenc es VLDBKDDIJCAISIGIR SIGMODSDMAAAIECIR ICDEICDMICMLCIKM PODSPKDDCVPRWWW EDBTPAKDDECMLWSDM Top-5 ranked terms datamininglearningretrieval databasedataknowledgeinformation queryclusteringreasoningweb systemclassificationlogicsearch xmlfrequentcognitiontext RankClass [KDD11] Knowledge Propagation in Heterogeneous Network
8 Similarity Search and Role Discovery in Information Networks Path: ITIPath: ITIGITI Which images are most similar to me in Flickr? PathSim [VLDB11] Meta Path-Guided Similarity Search in Networks A “dirty” Information Network (imaginary) Cleaned/Inferred Adversarial Network Chief Insurgent Cell Lead Automa tically infer Role Discovery in Information Networks [KDD’10] AdviseeTop Ranked Advisor TimeNote David M. Blei 1. Michael I. Jordan PhD advisor, John D. Lafferty Postdoc, 2006 Hong Cheng 1. Qiang Yang02-03 MS advisor, Jiawei Han04-08 PhD advisor, 2008 Sergey Brin 1. Rajeev Motawani Unofficial advisor
Meta-Paths & Their Prediction Power List all the meta-paths in bibliographic network up to length 4 Investigate their respective power for coauthor relationship prediction Which meta-path has more prediction power? How to combine them to achieve the best quality of prediction 9
Relationship Prediction in Heterogeneous Info Networks Why Prediction of Co-Author Relationship in DBLP? Prediction of relationships between different types of nodes in heterogeneous networks E.g., what papers should Faloutsos writes? Traditional link prediction: homogeneous networks Co-author networks in DBLP, friendship networks in Facebook Relationship prediction Study the roles of topological features in heterogeneous networks in predicting the co-author relationship building Meta-path guided prediction! Y. Sun, et al., "Co-Author Relationship Prediction in Heterog. Bibliographic Networks", ASONAM'11, July
Guidance: Meta Path in Bibliographic Network Relationship prediction: meta path-guided prediction Meta path relationships among similar typed links share similar semantics and are comparable and inferable 11 papertopic venue author publishpublish -1 mention -1 mention write write -1 contain/contain -1 cite/cite -1 Co-author prediction (A—P—A) using topological features also encoded by meta paths, e.g., citation relations between authors (A—P→P—A)
Case Study in CS Bibliographic Network The learned significance for each meta path under measure “normalized path count” for HP-3hop dataset 12
Case Study: Predicting Concrete Co-Authors High quality predictive power for such a difficult task 13 Using data in T0 =[1989; 1995] and T1 = [1996; 2002] Predict new coauthor relationship in T2 = [2003; 2009]
14 Outline An Introduction to Data Mining Research Group Mining and OLAPing Information Networks Mining Heterogeneous Information Networks Mining Text-Rich Information Networks OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks Taming the Web: WINACS (Integrated mining of Web structures and contents) Mining Cyber-Physical Systems and Networks Conclusions
Structural Layer: follow the same topology as the document network iTopicModel: Model Set-Up & Objective Function iTopicModel: Model Set-Up & Objective Function Graphical model: ϴ i =(ϴ i1, ϴ i2,…, ϴ iT ): Topic distribution for document x i Text Layer: follow PLSA, i.e., for each word, pick a topic z~multi(ϴi), then pick a word w~multi(β z ) Objective function: joint probability X: observed text information G: document network Parameters ϴ: topic distribution β: word distribution ϴ is the most critical, need to be consistent with the text as well as the network structure Structure partText part Can model them separately!
Case Study: Topic Hierarchy Building for DBLP
Probabilistic Topic Models with Network-Based Biased Propagation Text-rich heterogeneous information network Ubiquitous textual documents (news, papers) Connect with users and other objects: Topic propagation Deng, Han et al, “Probabilistic Topic Models with Biased Propagation on Heterogeneous Information Networks”, KDD’11 17 How to discover latent topics and identify clusters of multi-typed objects simultaneously? How can text data and heterogeneous information network mutually enhance each other in topic modeling and other text mining tasks?
Biased Topic Propagation Intuition: InfoNet provides valuable information Different objects have their own inherent information (e.g., D with rich text and U without explicit text) To treat documents with rich text and other objects without explicit text in a different way Topic(D) inherent text + connected U Topic(U) connected D 18 Basic Criterion: (Biased Topic Propagation) The topic of an object without explicit text depends on the topic of the documents it connects The topic of a document is correlated with its objects to some extend, and should be principally determined by its inherent content of the text A simple and unbiased topic propagation does not make much sense
Incorporating Heterogeneous Info. Network 19 L(C): Topic model R(G): Biased propagation
Experiments: DBLP & NSF Awards Data Collection DBLP NSF-Awards Metrics Accuracy (AC) Normalized mutual information (NMI) Results 20
21 Outline An Introduction to Data Mining Research Group Mining and OLAPing Information Networks Mining Heterogeneous Information Networks Mining Text-Rich Information Networks OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks Taming the Web: WINACS (Integrated mining of Web structures and contents) Mining Cyber-Physical Systems and Networks Conclusions
Event Cube: An Overview Multidimensional Text Database LAX SJCMIA AUS overshoot undershoot birds turbulence Time Location Topic CA FLTX Location Time Deviation Encounter Topic drill- down roll-up Event Cube Representation Analyst … Multidimensional OLAP, Ranking, Cause Analysis, Topic Summarization/Comparison …… Analysis Support 22 Event Cube: An Organized Approach for Mining and Understanding Anomalous Aviation Events Funded by NASA ( )
Text/Topic Cube: General Idea Heterogeneous: categorical attributes + unstructured text How to combine? Our solution: TimeLocationPlaceEnvironment… Event ReportACN Text data Cube: Categorical Attributes Term/TopicWeight T1W1 T2W2 T3W3 …… Text/Topic Model: Unstructured Text Measure
24 Effective Keyword Search TopCells (ICDE’ 10): Ranking aggregated cells (objects) in TextCube. Healthcare Reform Healthcare Reform …
Effective OLAP Exploration TEXplorer (submitted): Integrating keyword-based ranking and OLAP exploration 25 Healthcare Reform Healthcare Reform
Effective Event Tracking PET (KDD’ 10): tracking popularity and textual representation of events in social communities (twitter) 26 debate, cost, senate, … pass, success, law, … Healthcare Reform Healthcare Reform benefit, profit, effective, …
27 Outline An Introduction to Data Mining Research Group Mining and OLAPing Information Networks Mining Heterogeneous Information Networks Mining Text-Rich Information Networks OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks Taming the Web: WINACS (Integrated mining of Web structures and contents) Mining Cyber-Physical Systems and Networks Conclusions
Growing Parallel Paths (WWW 2011) Result: 28
Mapping Pages to Records (CIKM’10) Database records can be found on link paths! 29
WinaCS: Web Information Network Analysis for Computer Science Integration of Web structure mining and information network analysis Tim Weninger, Marina Danilevsky, et al., “WinaCS: Construction and Analysis of Web-Based Computer Science Information Networks", ACM SIGMOD'11 (system demo), Athens, Greece, June 2011.WinaCS: Construction and Analysis of Web-Based Computer Science Information Networks 30
31 Outline An Introduction to Data Mining Research Group Mining and OLAPing Information Networks Mining Heterogeneous Information Networks Mining Text-Rich Information Networks OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks Taming the Web: WINACS (Integrated mining of Web structures and contents) Mining Cyber-Physical Systems and Networks Conclusions
32 Discovery of Swarms and Periodic Patterns in Moving Object Data A system that mines moving object patterns: Z. Li, et al., “MoveMine: Mining Moving Object Databases", SIGMOD’10 (system demo) Z. Li, B. Ding, J. Han, and R. Kays, “Mining Hidden Periodic Behaviors for Moving Objects”, KDD’10 (sub) Z. Li, B. Ding, J. Han, and R. Kays, “Swarm: Mining Relaxed Temporal Moving Object Clusters”, VLDB’10 (sub) ← Bird flying paths shown on Google Earth Mined periodic patterns by our new method → ← Convoy discovers only restricted patterns Swarm discovers more patterns →
GeoTopic Discovery: Mining Spatial Text LDM TDM GeoFolk LGTA Geo-tagged photos w. landscape (coast vs. desert vs. mountain) 33 Z. Yin, et a., GeoTopic Discovery and Comparison, WWW'11
34 Outline An Introduction to Data Mining Research Group Mining and OLAPing Information Networks Mining Heterogeneous Information Networks Mining Text-Rich Information Networks OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks Taming the Web: WINACS (Integrated mining of Web structures and contents) Mining Cyber-Physical Systems and Networks Conclusions
35 Conclusions: Towards Mining Data Semantics in Integrated Heterog. Networks Most data objects are linked, forming heterogeneous information networks Most datasets can be “organized” or “transformed” into “structured” multi-typed heterogeneous info. networks Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, … Structures can be progressively mined from less organized data sets by info. network analysis Surprisingly rich knowledge can be mine from such structured heterogeneous info. networks Clustering, ranking, classification, data cleaning, trust analysis, role discovery, similarity search, relationship prediction, …… It is promising to mine data semantics from rich info. networks !
References for the Talk J. Han, Y. Sun, X. Yan, and. S. Yu, “Mining Heterogeneous Information Networks" (tutorial), KDD'10. Ming Ji, Jiawei Han, and Marina Danilevsky, "Ranking-Based Classification of Heterogeneous Information Networks", KDD'11. Y. Sun, J. Han, et al., "RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis", EDBT’09 Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema", KDD’09 Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks”, VLDB'11 Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction in Heterogeneous Bibliographic Networks", ASONAM'11 C. Wang, J. Han, et al.,,, “Mining Advisor-Advisee Relationships from Research Publication Networks", KDD'10. Tim Weninger, Marina Danilevsky, et al., “WinaCS: Construction and Analysis of Web- Based Computer Science Information Networks", ACM SIGMOD'11 (system demo) X. Yin, J. Han, and P. S. Yu, “Truth Discovery with Multiple Conflicting Information Providers on the Web”, IEEE TKDE, 20(6),