1 Current Research in Data Mining Research Group Current Research in Data Mining Research Group Jiawei Han Data Mining Research Group Department of Computer.

Slides:



Advertisements
Similar presentations
A Comparison of Implicit and Explicit Links for Web Page Classification Dou Shen 1 Jian-Tao Sun 2 Qiang Yang 1 Zheng Chen 2 1 Department of Computer Science.
Advertisements

CO-AUTHOR RELATIONSHIP PREDICTION IN HETEROGENEOUS BIBLIOGRAPHIC NETWORKS Yizhou Sun, Rick Barber, Manish Gupta, Charu C. Aggarwal, Jiawei Han 1.
Swarm: Mining Relaxed Temporal Moving Object Clusters
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date : 2014/04/15 Source : KDD’13 Authors : Chi Wang, Marina Danilevsky, Nihit.
1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.
1 Social Influence Analysis in Large-scale Networks Jie Tang 1, Jimeng Sun 2, Chi Wang 1, and Zi Yang 1 1 Dept. of Computer Science and Technology Tsinghua.
Funding Networks Abdullah Sevincer University of Nevada, Reno Department of Computer Science & Engineering.
On Community Outliers and their Efficient Detection in Information Networks Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
WebMiningResearch ASurvey Web Mining Research: A Survey By Raymond Kosala & Hendrik Blockeel, Katholieke Universitat Leuven, July 2000 Presented 4/18/2002.
Query-Based Outlier Detection in Heterogeneous Information Networks Jonathan Kuck 1, Honglei Zhuang 1, Xifeng Yan 2, Hasan Cam 3, Jiawei Han 1 1 University.
An Overview of Our Course:
Honglei Zhuang1, Jing Zhang2, George Brova1,
Scalable Text Mining with Sparse Generative Models
Example Data Sets Prior Research Join related objects to form independent compound objects, cluster normally (Yin et al., 2005). Use attribute-based distance.
Overview of Web Data Mining and Applications Part I
Data and Information Systems Laboratory University of Illinois Urbana-Champaign CS 512 Jan 18, 2010 WinaCS Project Web Entity Extraction and Mapping Discovering.
1 A Topic Modeling Approach and its Integration into the Random Walk Framework for Academic Search 1 Jie Tang, 2 Ruoming Jin, and 1 Jing Zhang 1 Knowledge.
POTENTIAL RELATIONSHIP DISCOVERY IN TAG-AWARE MUSIC STYLE CLUSTERING AND ARTIST SOCIAL NETWORKS Music style analysis such as music classification and clustering.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
SIGIR’09 Boston 1 Entropy-biased Models for Query Representation on the Click Graph Hongbo Deng, Irwin King and Michael R. Lyu Department of Computer Science.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Data Mining Techniques
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
1 Current Research in Data Mining Research Group Current Research in Data Mining Research Group Jiawei Han Data Mining Research Group Department of Computer.
Data and Information Systems Laboratory University of Illinois Urbana-Champaign Advanced Data Mining May 4, 2010 Growing Parallel Paths for Entity-Page.
Evolutionary Clustering and Analysis of Bibliographic Networks Manish Gupta (UIUC) Charu C. Aggarwal (IBM) Jiawei Han (UIUC) Yizhou Sun (UIUC) ASONAM 2011.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
How to get the most out of the survey task + suggested survey topics for CS512 Presented by Nikita Spirin.
Computing & Information Sciences Kansas State University Boulder, Colorado First International Conference on Weblogs And Social Media (ICWSM-2007) Structural.
Microsoft Academic Search Search | Explore | Discover Alex D. Wade Director - Scholarly Communication.
Advisor-advisee Relationship Mining from Research Publication Network Chi Wang 1, Jiawei Han 1, Yuntao Jia 1, Jie Tang 2, Duo Zhang 1, Yintao Yu 1, Jingyi.
Overview of CS Class Jiawei Han Department of Computer Science
Discovering Meta-Paths in Large Heterogeneous Information Network
P-Rank: A Comprehensive Structural Similarity Measure over Information Networks CIKM’ 09 November 3 rd, 2009, Hong Kong Peixiang Zhao, Jiawei Han, Yizhou.
On Node Classification in Dynamic Content-based Networks.
Multimodal Information Access and Synthesis A DHS Institute of Discrete Science UIUC Dan Roth Department of Computer Science University of Illinois.
Department of Electrical Engineering and Computer Science Kunpeng Zhang, Yu Cheng, Yusheng Xie, Doug Downey, Ankit Agrawal, Alok Choudhary {kzh980,ych133,
1 LinkClus: Efficient Clustering via Heterogeneous Semantic Links Xiaoxin Yin, Jiawei Han Univ. of Illinois at Urbana-Champaign Philip S. Yu IBM T.J. Watson.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.
Computing & Information Sciences Kansas State University IJCAI HINA 2015: 3 rd Workshop on Heterogeneous Information Network Analysis KSU Laboratory for.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.
Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi.
Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences Lu Bai,
CSCE 5073 Section 001: Data Mining Spring Overview Class hour 12:30 – 1:45pm, Tuesday & Thur, JBHT 239 Office hour 2:00 – 4:00pm, Tuesday & Thur,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.
ClusCite:Effective Citation Recommendation by Information Network-Based Clustering Date: 2014/10/16 Author: Xiang Ren, Jialu Liu,Xiao Yu, Urvashi Khandelwal,
Discovering Meta-Paths in Large Heterogeneous Information Network Changping Meng (Purdue University) Reynold Cheng (University of Hong Kong) Silviu Maniu.
Term Project Proposal By J. H. Wang Apr. 7, 2017.
MINING DEEP KNOWLEDGE FROM SCIENTIFIC NETWORKS
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Jiawei Han Computer Science University of Illinois at Urbana-Champaign
Jiawei Han Department of Computer Science
Data Mining: Concepts and Techniques Course Outline
Integrating Meta-Path Selection With User-Guided Object Clustering in Heterogeneous Information Networks Yizhou Sun†, Brandon Norick†, Jiawei Han†, Xifeng.
CS7280: Special Topics in Data Mining Information/Social Networks
Exploring the Power of Links in Data Mining
Community Distribution Outliers in Heterogeneous Information Networks
Relevance Search in Heterogeneous Networks
Jiawei Han Department of Computer Science
Data Warehousing Data Mining Privacy
CSCE 4143 Section 001: Data Mining Spring 2019.
Promising “Newer” Technologies to Cope with the
Presentation transcript:

1 Current Research in Data Mining Research Group Current Research in Data Mining Research Group Jiawei Han Data Mining Research Group Department of Computer Science University of Illinois at Urbana-Champaign Acknowledgements: NSF, ARL, NASA, AFOSR (MURI), DHS, Microsoft, IBM, Yahoo!, HP Lab & Boeing June 3, 2016

2 Outline  An Introduction to Data Mining Research Group  Mining and OLAPing Information Networks  Mining Heterogeneous Information Networks  Mining Text-Rich Information Networks  OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks  Taming the Web: WINACS (Integrated mining of Web structures and contents)  Mining Cyber-Physical Systems and Networks  Conclusions

Data Mining and Data Warehousing Jiawei Han’s Group at CS, UIUC  Mining patterns and knowledge discovery from massive data  Data mining in heterogeneous information networks  Exploring broad applications of data mining Developed many effective data mining algorithms, e.g., FPgrowth, PrefixSpan, gSpan, StarCubing, CrossMine, RankingCube, CrossClus, RankClus, and NetClus 600+ research papers in conferences and journals Fellow of ACM, Fellow of IEEE, ACM SIGKDD Innovation Award, W. McDowell Award, Daniel Drucker Eminent Faculty Award Textbook, “Data mining: Concepts and Techniques,” adopted worldwide Project lead for NASA EventCube for Aviation Safety [ ] Director of Information Network Academic Research Center funded from Army Research Lab (ARL) [ ] 3

Data Mining Research Group Data Mining Research Group at CS, UIUC 4

New Books on Data Mining & Link Mining 5 Han, Kamber and Pei, Data Mining, 3 rd ed Yu, Han and Faloutsos (eds.), Link Mining, 2010 Sun and Han, Mining Heterogeneous Information Networks, 2012

6 Outline  An Introduction to Data Mining Research Group  Mining and OLAPing Information Networks  Mining Heterogeneous Information Networks  Mining Text-Rich Information Networks  OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks  Taming the Web: WINACS (Integrated mining of Web structures and contents)  Mining Cyber-Physical Systems and Networks  Conclusions

Mining Heterogeneous Information Networks RankClus/NetClus RankCompete: A Competing Random Walk Model for Rank-Based Clustering DatabaseData MiningAIIR Top-5 ranked conferenc es VLDBKDDIJCAISIGIR SIGMODSDMAAAIECIR ICDEICDMICMLCIKM PODSPKDDCVPRWWW EDBTPAKDDECMLWSDM Top-5 ranked terms datamininglearningretrieval databasedataknowledgeinformation queryclusteringreasoningweb systemclassificationlogicsearch xmlfrequentcognitiontext RankClass [KDD11] Knowledge Propagation in Heterogeneous Network

8 Similarity Search and Role Discovery in Information Networks Path: ITIPath: ITIGITI Which images are most similar to me in Flickr? PathSim [VLDB11] Meta Path-Guided Similarity Search in Networks A “dirty” Information Network (imaginary) Cleaned/Inferred Adversarial Network Chief Insurgent Cell Lead Automa tically infer Role Discovery in Information Networks [KDD’10] AdviseeTop Ranked Advisor TimeNote David M. Blei 1. Michael I. Jordan PhD advisor, John D. Lafferty Postdoc, 2006 Hong Cheng 1. Qiang Yang02-03 MS advisor, Jiawei Han04-08 PhD advisor, 2008 Sergey Brin 1. Rajeev Motawani Unofficial advisor

Meta-Paths & Their Prediction Power  List all the meta-paths in bibliographic network up to length 4  Investigate their respective power for coauthor relationship prediction  Which meta-path has more prediction power?  How to combine them to achieve the best quality of prediction 9

Relationship Prediction in Heterogeneous Info Networks  Why Prediction of Co-Author Relationship in DBLP?  Prediction of relationships between different types of nodes in heterogeneous networks  E.g., what papers should Faloutsos writes?  Traditional link prediction: homogeneous networks  Co-author networks in DBLP, friendship networks in Facebook  Relationship prediction  Study the roles of topological features in heterogeneous networks in predicting the co-author relationship building  Meta-path guided prediction!  Y. Sun, et al., "Co-Author Relationship Prediction in Heterog. Bibliographic Networks", ASONAM'11, July

Guidance: Meta Path in Bibliographic Network  Relationship prediction: meta path-guided prediction  Meta path relationships among similar typed links share similar semantics and are comparable and inferable 11 papertopic venue author publishpublish -1 mention -1 mention write write -1 contain/contain -1 cite/cite -1  Co-author prediction (A—P—A) using topological features also encoded by meta paths, e.g., citation relations between authors (A—P→P—A)

Case Study in CS Bibliographic Network  The learned significance for each meta path under measure “normalized path count” for HP-3hop dataset 12

Case Study: Predicting Concrete Co-Authors  High quality predictive power for such a difficult task 13  Using data in T0 =[1989; 1995] and T1 = [1996; 2002]  Predict new coauthor relationship in T2 = [2003; 2009]

14 Outline  An Introduction to Data Mining Research Group  Mining and OLAPing Information Networks  Mining Heterogeneous Information Networks  Mining Text-Rich Information Networks  OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks  Taming the Web: WINACS (Integrated mining of Web structures and contents)  Mining Cyber-Physical Systems and Networks  Conclusions

Structural Layer: follow the same topology as the document network iTopicModel: Model Set-Up & Objective Function iTopicModel: Model Set-Up & Objective Function  Graphical model: ϴ i =(ϴ i1, ϴ i2,…, ϴ iT ): Topic distribution for document x i Text Layer: follow PLSA, i.e., for each word, pick a topic z~multi(ϴi), then pick a word w~multi(β z )  Objective function: joint probability X: observed text information G: document network Parameters ϴ: topic distribution β: word distribution ϴ is the most critical, need to be consistent with the text as well as the network structure Structure partText part Can model them separately!

Case Study: Topic Hierarchy Building for DBLP

Probabilistic Topic Models with Network-Based Biased Propagation  Text-rich heterogeneous information network  Ubiquitous textual documents (news, papers)  Connect with users and other objects: Topic propagation  Deng, Han et al, “Probabilistic Topic Models with Biased Propagation on Heterogeneous Information Networks”, KDD’11 17  How to discover latent topics and identify clusters of multi-typed objects simultaneously?  How can text data and heterogeneous information network mutually enhance each other in topic modeling and other text mining tasks?

Biased Topic Propagation Intuition:  InfoNet provides valuable information  Different objects have their own inherent information (e.g., D with rich text and U without explicit text)  To treat documents with rich text and other objects without explicit text in a different way Topic(D)  inherent text + connected U Topic(U)  connected D 18 Basic Criterion: (Biased Topic Propagation)  The topic of an object without explicit text depends on the topic of the documents it connects  The topic of a document is correlated with its objects to some extend, and should be principally determined by its inherent content of the text  A simple and unbiased topic propagation does not make much sense

Incorporating Heterogeneous Info. Network 19 L(C): Topic model R(G): Biased propagation

Experiments: DBLP & NSF Awards  Data Collection  DBLP  NSF-Awards  Metrics  Accuracy (AC)  Normalized mutual information (NMI)  Results 20

21 Outline  An Introduction to Data Mining Research Group  Mining and OLAPing Information Networks  Mining Heterogeneous Information Networks  Mining Text-Rich Information Networks  OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks  Taming the Web: WINACS (Integrated mining of Web structures and contents)  Mining Cyber-Physical Systems and Networks  Conclusions

Event Cube: An Overview Multidimensional Text Database LAX SJCMIA AUS overshoot undershoot birds turbulence Time Location Topic CA FLTX Location Time Deviation Encounter Topic drill- down roll-up Event Cube Representation Analyst … Multidimensional OLAP, Ranking, Cause Analysis, Topic Summarization/Comparison …… Analysis Support 22 Event Cube: An Organized Approach for Mining and Understanding Anomalous Aviation Events Funded by NASA ( )

Text/Topic Cube: General Idea Heterogeneous: categorical attributes + unstructured text How to combine? Our solution: TimeLocationPlaceEnvironment… Event ReportACN Text data Cube: Categorical Attributes Term/TopicWeight T1W1 T2W2 T3W3 …… Text/Topic Model: Unstructured Text Measure

24 Effective Keyword Search  TopCells (ICDE’ 10): Ranking aggregated cells (objects) in TextCube. Healthcare Reform Healthcare Reform …

Effective OLAP Exploration  TEXplorer (submitted): Integrating keyword-based ranking and OLAP exploration 25 Healthcare Reform Healthcare Reform

Effective Event Tracking  PET (KDD’ 10): tracking popularity and textual representation of events in social communities (twitter) 26 debate, cost, senate, … pass, success, law, … Healthcare Reform Healthcare Reform benefit, profit, effective, …

27 Outline  An Introduction to Data Mining Research Group  Mining and OLAPing Information Networks  Mining Heterogeneous Information Networks  Mining Text-Rich Information Networks  OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks  Taming the Web: WINACS (Integrated mining of Web structures and contents)  Mining Cyber-Physical Systems and Networks  Conclusions

Growing Parallel Paths (WWW 2011) Result: 28

Mapping Pages to Records (CIKM’10) Database records can be found on link paths! 29

WinaCS: Web Information Network Analysis for Computer Science Integration of Web structure mining and information network analysis Tim Weninger, Marina Danilevsky, et al., “WinaCS: Construction and Analysis of Web-Based Computer Science Information Networks", ACM SIGMOD'11 (system demo), Athens, Greece, June 2011.WinaCS: Construction and Analysis of Web-Based Computer Science Information Networks 30

31 Outline  An Introduction to Data Mining Research Group  Mining and OLAPing Information Networks  Mining Heterogeneous Information Networks  Mining Text-Rich Information Networks  OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks  Taming the Web: WINACS (Integrated mining of Web structures and contents)  Mining Cyber-Physical Systems and Networks  Conclusions

32 Discovery of Swarms and Periodic Patterns in Moving Object Data  A system that mines moving object patterns: Z. Li, et al., “MoveMine: Mining Moving Object Databases", SIGMOD’10 (system demo)  Z. Li, B. Ding, J. Han, and R. Kays, “Mining Hidden Periodic Behaviors for Moving Objects”, KDD’10 (sub)  Z. Li, B. Ding, J. Han, and R. Kays, “Swarm: Mining Relaxed Temporal Moving Object Clusters”, VLDB’10 (sub) ← Bird flying paths shown on Google Earth Mined periodic patterns by our new method → ← Convoy discovers only restricted patterns Swarm discovers more patterns →

GeoTopic Discovery: Mining Spatial Text LDM TDM GeoFolk LGTA Geo-tagged photos w. landscape (coast vs. desert vs. mountain) 33 Z. Yin, et a., GeoTopic Discovery and Comparison, WWW'11

34 Outline  An Introduction to Data Mining Research Group  Mining and OLAPing Information Networks  Mining Heterogeneous Information Networks  Mining Text-Rich Information Networks  OLAPing (Multi-dimensional analysis) of information networks: TextCube, OLAP heterogeneous networks  Taming the Web: WINACS (Integrated mining of Web structures and contents)  Mining Cyber-Physical Systems and Networks  Conclusions

35 Conclusions: Towards Mining Data Semantics in Integrated Heterog. Networks  Most data objects are linked, forming heterogeneous information networks  Most datasets can be “organized” or “transformed” into “structured” multi-typed heterogeneous info. networks  Examples: DBLP, IMDB, Flickr, Google News, Wikipedia, …  Structures can be progressively mined from less organized data sets by info. network analysis  Surprisingly rich knowledge can be mine from such structured heterogeneous info. networks  Clustering, ranking, classification, data cleaning, trust analysis, role discovery, similarity search, relationship prediction, ……  It is promising to mine data semantics from rich info. networks !

References for the Talk  J. Han, Y. Sun, X. Yan, and. S. Yu, “Mining Heterogeneous Information Networks" (tutorial), KDD'10.  Ming Ji, Jiawei Han, and Marina Danilevsky, "Ranking-Based Classification of Heterogeneous Information Networks", KDD'11.  Y. Sun, J. Han, et al., "RankClus: Integrating Clustering with Ranking for Heterogeneous Information Network Analysis", EDBT’09  Y. Sun, Y. Yu, and J. Han, "Ranking-Based Clustering of Heterogeneous Information Networks with Star Network Schema", KDD’09  Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu, “PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks”, VLDB'11  Y. Sun, R. Barber, M. Gupta, C. Aggarwal and J. Han, "Co-Author Relationship Prediction in Heterogeneous Bibliographic Networks", ASONAM'11  C. Wang, J. Han, et al.,,, “Mining Advisor-Advisee Relationships from Research Publication Networks", KDD'10.  Tim Weninger, Marina Danilevsky, et al., “WinaCS: Construction and Analysis of Web- Based Computer Science Information Networks", ACM SIGMOD'11 (system demo)  X. Yin, J. Han, and P. S. Yu, “Truth Discovery with Multiple Conflicting Information Providers on the Web”, IEEE TKDE, 20(6),