Multi-Modal Bayesian Embeddings for Learning Social Knowledge Graphs Zhilin Yang 12, Jie Tang 1, William W. Cohen 2 1 Tsinghua University 2 Carnegie Mellon University
AMiner: academic social network Research interests
Text-Based Approach List of publicationsResearch interests Infer
Text-Based Approach Term Frequency => “challenging problem” TF-IDF => “line drawing”
Knowledge-Driven Approach List of publications Research interests Infer Artificial Intelligence Data Mining Machine Learning Clustering Association Rules Knowledge bases
Problem: Learning Social Knowledge Graphs Mike Jane Kevin Jing Natural Language Processing Deep Learning for NLP Recurrent networks for NER Deep Learning
Problem: Learning Social Knowledge Graphs Mike Jane Kevin Jing Natural Language Processing Deep Learning for NLP Social network structure Social text Knowledge base Recurrent networks for NER Deep Learning
Problem: Learning Social Knowledge Graphs Mike Jane Kevin Jing Deep Learning Natural Language Processing Deep Learning for NLP Recurrent networks for NER Infer a ranked list of concepts Kevin: Deep Learning, Natural Language Processing Jing: Recurrent Networks, Named Entity Recognition
Challenges Mike Jane Kevin Jing Natural Language Processing Deep Learning for NLP Recurrent networks for NER Deep Learning Two modalities – users and concepts How to leverage information from both modalities? How to connect these two modalities?
Approach Jane Kevin Jing Deep Learning for NLP Recurrent networks for NER Natural Language Processing Deep Learning Learn user embeddings Learn concept embeddings Social KG Model
User Embedding Concept Embedding
User Embedding Concept Embedding Gaussian distribution for user embeddings Gaussian distribution for concept embeddings Align users and concepts
Inference and Learning Collapsed Gibbs sampling Iterate between: 1.Sample latent variables
Inference and Learning Iterate between: 1.Sample latent variables 2.Update parameters Collapsed Gibbs sampling
Inference and Learning Iterate between: 1.Sample latent variables 2.Update parameters 3.Update embeddings Collapsed Gibbs sampling
AMiner Research Interest Dataset 644,985 researchers Terms in these researchers’ publications Filtered with Wikipedia Evaluation Homepage matching 1,874 researchers Using homepages as ground truth LinkedIn matching 113 researchers Using LinkedIn skills as ground truth Code and data available:
Homepage Matching GenVector % GenVector-E % Sys-Base % Author-Topic % NTN % CountKG % Using homepages as ground truth. GenVectorOur model GenVector-EOur model w/o embedding update Sys-BaseAMiner baseline: key term extraction CountKGRank by frequency Author-topicClassic topic models NTNNeural tensor networks
LinkedIn Matching GenVector % GenVector-E % Author-Topic % NTN % CountKG % GenVectorOur model GenVector-EOur model w/o embedding update CountKGRank by frequency Author-topicClassic topic models NTNNeural tensor networks Using LinkedIn skills as ground truth.
Error Rate of Irrelevant Cases MethodError rate GenVector1.2% Sys-Base18.8% Author-Topic1.6% NTN7.2% Manually label terms that are clearly NOT research interests, e.g., challenging problem. GenVectorOur model Sys-BaseAMiner baseline: key term extraction Author-topicClassic topic models NTNNeural tensor networks
Qualitative Study: Top Concepts within Topics Query expansion Concept mining Language modeling Information extraction Knowledge extraction Entity linking Language models Named entity recognition Document clustering Latent semantic indexing GenVector Speech recognition Natural language *Integrated circuits Document retrieval Language models Language model *Microphone array Computational linguistics *Semidefinite programming Active learning Author-Topic
Qualitative Study: Top Concepts within Topics Image processing Face recognition Feature extraction Computer vision Image segmentation Image analysis Feature detection Digital image processing Machine learning algorithms Machine vision GenVector Face recognition *Food intake Face detection Image recognition *Atmospheric chemistry Feature extraction Statistical learning Discriminant analysis Object tracking *Human factors Author-Topic
Qualitative Study: Research Interests Feature extraction Image segmentation Image matching Image classification Face recognition GenVector Face recognition Face image *Novel approach *Line drawing Discriminant analysis Sys-Base
Qualitative Study: Research Interests Unsupervised learning Feature learning Bayesian networks Reinforcement learning Dimensionality reduction GenVector *Challenging problem Reinforcement learning *Autonomous helicopter *Autonomous helicopter flight Near-optimal planning Sys-Base
Online Test MethodError rate GenVector3.33% Sys-Base10.00% A/B test with live users Mixing the results with Sys-Base
Other Social Networks? Mike Jane Kevin Jing Natural Language Processing Deep Learning for NLP Social network structure Social text Knowledge base Recurrent networks for NER Deep Learning
Conclusion Study a novel problem Learning social knowledge graphs Propose a model Multi-modal Bayesian embedding Integrate embeddings into graphical models AMiner research interest dataset 644,985 researchers Homepage and LinkedIn matching as ground truth Online deployment on AMiner
Thanks! Code and data:
Social Networks Mike Jane Kevin Jing AMiner, Facebook, Twitter… Huge amounts of information
Knowledge Bases Computer Science Artificial Intelligence System Deep Learning Natural Language Processing Wikipedia, Freebase, Yago, NELL… Huge amounts of knowledge
Bridge the Gap Mike Jane Kevin Jing Computer Science Artificial Intelligence System Deep Learning Natural Language Processing Better user understanding e.g. mine research interests on AMiner
Approach Social networkKnowledge base User embeddings Concept embeddings Social KG Model Social text Copy picture
Model Documents (one per user) Concepts for the user Parameters for topics
Model Generate a topic distribution for each document (from a Dirichlet)
Model Generate Gaussian distribution for each embedding space (from a Normal Gamma)
Model Generate the topic for each concept (from a Multinomial)
Model Generate the topic for each user (from a Uniform)
Model Generate embeddings for users and concepts (from a Gaussian)
Model General
Inference and Learning Collapsed Gibbs sampling for inference Update the embedding during learning Different from LDAs with discrete observed variables Sample latent variables Update parameters Update Embeddings Add picture
Methods for Comparison MethodDescription GenVectorOur model GenVector-EOur model w/o embedding update Sys-BaseAMiner baseline: key term extraction CountKGRank by frequency Author-topicClassic topic models NTNNeural tensor networks