LINE: Large-scale Information Network Embedding Jian Tang Microsoft Research Asia Acknowledgements: Meng Qu, Mingzhe Wang, Qiaozhu Mei, Ming Zhang
Ubiquitous Large-scale Information Networks Social network World Wide Web Internet of Things (IOT) Citation network … Real-world networks are very large, e.g., Facebook social network: ~ 1 billion users WWW: ~50 billion webpages Internet of Things Very challenging in analyzing large-scale networks Sparse High-dimension
Deep Learning is very Successful in Many Domains Natural language Speech Network ? Deep Learning Image
Deep Learning for Network Embedding + Deep Learning Sparse, high-dimension Dense, low-dimension Potentially useful in many domains&applications Text embedding Link prediction Ranking, Recommendation Node classification Network Visualization
Natural Language/Text Unsupervised text (e.g., words and documents) embedding degree network edge node word document classification text embedding Text representation, e.g., word and document representation, … … Deep learning has been attracting increasing attention … A future direction of deep learning is to integrate unlabeled data … The Skip-gram model is quite effective and efficient … Information networks encode the relationships between the data objects … Word co-occurrence network text information network word … classification doc_1 doc_2 doc_3 doc_4 Word-document network Free text
Natural Language/Text Predictive text (e.g., words and documents) embedding degree network edge node word document classification text embedding text information network word … classification doc_1 doc_2 doc_3 doc_4 Text representation, e.g., word and document representation, … … label document Deep learning has been attracting increasing attention … A future direction of deep learning is to integrate unlabeled data … The Skip-gram model is quite effective and efficient … Information networks encode the relationships between the data objects … null Word co-occurrence network Word-document network text information network word … classification label_2 label_1 label_3 Free text Word-label network
Social Network User embedding Friend recommendation User classification
Academic Network Author, paper, venue embedding Recommend related authors, papers, venues Author, paper, venue classification Author Paper Venue
Enterprise Network People, document, project, embedding Recommend related people, documents, projects People, document, project classification
Related Work Classical graph embedding algorithms MDS, IsoMap, LLE, Laplacian Eigenmap etc Hard to scale up Graph factorization (Ahmed et al. 2013) Not specifically designed for network embedding Usually for undirected graphs DeepWalk (Perozzi et al. 2014) Lack a clear objective function Only designed for networks with binary edges
Our Approach: LINE Applicable for various types of networks Directed, undirected, and/or weighted Has a clear objective function Preserve the first-order and second-order proximity between the vertices Very scalable Effective and efficient optimization algorithm through asynchronous stochastic gradient descent Only take a couple of hours to embed network with millions of nodes, billions of edges on a single machine
What LINE has DONE so Far Unsupervised text embedding (Tang et al. WWW’15) Outperforms Skipgram through embedding the word co-occurrence network Outperforms ParagraphVEC through embedding the word-document network Predictive text embedding (Tang et al. KDD’15) Outperforms CNN on long documents, comparable on short documents More scalable than CNN Has few parameters to tune Social & Citation Network embedding (Tang et al. WWW’15) Outperforms DeepWalk and graph factorization Tang et al. LINE: Large-scale Information Network Embedding. WWW’15 Tang et al. PTE: Predictive text embedding through large-scale heterogeneous text networks. KDD’15
First-order Proximity The local pairwise proximity between the vertices Determined by the observed links However, many links between the vertices are missing Not sufficient for preserving the entire network structure 1 2 3 4 5 6 7 8 9 10 Vertex 6 and 7 have a large first-order proximity
Second-order Proximity The proximity between the neighborhood structures of the vertices Mathematically, the second-order proximity between each pair of vertices (u,v) is determined by: 1 2 3 4 5 6 7 8 9 10 𝑝 𝑢 =( 𝑤 𝑢1 , 𝑤 𝑢2 , …, 𝑤 𝑢 𝑉 ) Vertex 5 and 6 have a large second-order proximity 𝑝 𝑣 =( 𝑤 𝑣1 , 𝑤 𝑣2 , …, 𝑤 𝑣 𝑉 ) 𝑝 5 =(1,1, 1,1,0,0,0,0,0,0) “The degree of overlap of two people’s friendship networks correlates with the strength of ties between them” --Mark Granovetter 𝑝 6 =(1,1, 1,1,0,0,5,0,0,0) “You shall know a word by the company it keeps” --John Rupert Firth
Preserving the First-order Proximity Given an undirected edge 𝑣 𝑖 , 𝑣 𝑗 , the joint probability of 𝑣 𝑖 , 𝑣 𝑗 𝑝 1 𝑣 𝑖 , 𝑣 𝑗 = 1 1+exp(− 𝑢 𝑖 𝑇 ⋅ 𝑢 𝑗 ) 𝑢 𝑖 : Embedding of vertex 𝑣 𝑖 𝑣 𝑖 𝑝 1 𝑣 𝑖 , 𝑣 𝑗 = 𝑤 𝑖𝑗 ( 𝑖 ′ , 𝑗 ′ ) 𝑤 𝑖 ′ 𝑗 ′ Objective: KL-divergence 𝑂 1 =𝑑( 𝑝 1 ⋅,⋅ , 𝑝 1 ⋅,⋅ ) ∝− 𝑖,𝑗 ∈𝐸 𝑤 𝑖𝑗 log 𝑝 1 ( 𝑣 𝑖 , 𝑣 𝑗 )
Preserving the Second-order Proximity Given a directed edge ( 𝑣 𝑖 , 𝑣 𝑗 ), the conditional probability of 𝑣 𝑗 given 𝑣 𝑖 is: 𝑝 2 𝑣 𝑗 | 𝑣 𝑖 = exp ( 𝑢 𝑗 ′𝑇 ⋅ 𝑢 𝑖 ) 𝑘=1 |𝑉| exp ( 𝑢 𝑘 ′𝑇 ⋅ 𝑢 𝑖 ) 𝑢 𝑖 : Embedding of vertex i when i is a source node; 𝑢 𝑖 ′ : Embedding of vertex i when i is a target node. 𝑝 2 𝑣 𝑗 | 𝑣 𝑖 = 𝑤 𝑖𝑗 𝑘∈𝑉 𝑤 𝑖𝑘 Objective: 𝑂 2 = 𝑖∈𝑉 𝜆 𝑖 𝑑( 𝑝 2 ⋅ 𝑣 𝑖 , 𝑝 2 ⋅ 𝑣 𝑖 ) 𝜆 𝑖 : Prestige of vertex in the network 𝜆 𝑖 = 𝑗 𝑤 𝑖𝑗 ∝− 𝑖,𝑗 ∈𝐸 𝑤 𝑖𝑗 log 𝑝 2 ( 𝑣 𝑗 | 𝑣 𝑖 )
Preserving both Proximity Concatenate the embeddings individually learned by the two proximity First-order Second-order
Multiplied by the weight of the edge 𝑤 𝑖𝑗 Optimization Stochastic gradient descent + Negative Sampling Randomly sample an edge and multiple negative edges The gradient w.r.t the embedding with edge (i, j) Multiplied by the weight of the edge 𝑤 𝑖𝑗 𝜕 𝑂 2 𝜕 𝑢 𝑖 = 𝑤 𝑖𝑗 ⋅ 𝜕 log 𝑝 2 ( 𝑣 𝑗 | 𝑣 𝑖 ) 𝜕 𝑢 𝑖 Problematic when the weights of the edges diverge The scale of the gradients with different edges diverges Solution: edge sampling Sample the edges according to their weights and treat the edges as binary 𝑂(𝑑K 𝐸 ) Complexity: Linear to the dimension d, the number of negative samples K, and the number of edges |E|
Discussion Embedding Vertices of small degrees Embedding New Vertices Sparse information in the neighborhood Solution: expand the neighbors by adding higher-order neighbors e.g., neighbors of neighbors breadth-first search only consider the second-order neighbors Fix existing embeddings, and optimize w.r.t the new ones Objective − 𝑗∈𝑁 𝑖 𝑤 𝑗𝑖 log p 1 ( v j , v i ) or − 𝑗∈𝑁 𝑖 𝑤 𝑗𝑖 log p 2 ( v j | v i )
Unsupervised Text Embedding Word analogy Entire Wikipedia articles => word co-occurrence network (~2M words, 1B edges) Algorithm Semantic(%) Syntactic(%) Overall Running time GF 61.38 44.08 51.93 2.96h DeepWalk 50.79 37.70 43.65 16.64h Skipgram 69.14 57.94 63.02 2.82h LINE(1st) 58.08 49.42 53.35 2.44h LINE(2nd ) 73.79 59.72 66.10 2.55h Effectiveness: LINE(2nd) >LINE(1st)>GF>DeepWalk LINE(2nd) >Skipgram!! Efficiency: LINE(1st)>LINE(2nd)> Skipgram>GF>DeepWalk
Unsupervised Text Embedding Example of nearest words Word Proximity Type Top Similar Words good 1st luck, bad, faith, assume, nice 2nd decent, bad, excellent, lousy, reasonable Information provide, provides, detailed, facts, verifiable information, ifnormaiton, informations, nonspammy, animecons graph graphs, algebraic, finite, symmetric, topology graphs, subgraph, matroid, hypergraph, undirected learn teach, learned, inform, educate, how learned, teach, relearn, learnt, understand
Unsupervised Text Embedding Text classification Word co-occurrence network (w-w) , word-document network (w-d) to learn the word embedding Document embedding as the average word embeddings in the document Results on long documents 20 newsgroup, Wikipedia article, IMDB 20NG Wikipedia IMDB Type Algorithm Micro-F1 Macro-F1 Unsupervised Embedding Skipgram 70.62 68.99 75.80 75.77 85.34 PV 75.13 73.48 76.68 76.75 86.76 LINE(w-w) 72.78 70.95 77.72 86.16 LINE(w-d) 79.73 78.40 80.14 80.13 89.14 LINE (w-w +w-d) 78.74 77.39 79.91 79.94 89.07 LINE(w-w) > Skipgram(Google) LINE(w-d) > PV (Google) LINE(w-d) > LINE(w-w)
Unsupervised Text Embedding Text classification Word co-occurrence network (w-w) , word-document network (w-d) to learn the word embedding Document embedding as the average word embeddings in the document Results on short documents DBLP paper title (DBLP), movie review (MR), Tweets (Twitter) DBLP MR Twitter Type Algorithm Micro-F1 Macro-F1 Unsupervised Embedding SkipGram 73.08 68.92 67.05 73.02 73.00 PV 67.19 62.46 67.78 71.29 71.18 LINE(w-w) 73.98 69.92 71.07 71.06 73.19 73.18 LINE(w-d) 71.50 67.23 69.25 69.24 LINE (w-w +w-d) 74.22 70.12 71.13 71.12 73.84 LINE(w-w) > Skipgram LINE(w-d) > PV LINE(w-w) > LINE(w-d)
Predictive Text Embedding Predictive text embedding through embedding heterogeneous text network Word co-occurrence network (w-w), word-document network (w-d), word-label network (w-l) Results on long documents 20NG Wikipedia IMDB Type Algorithm Micro-F1 Macro-F1 Unsupervised Embedding LINE(w-d) 79.73 78.40 80.14 80.13 89.14 Predictive CNN 80.15 79.43 79.25 79.32 89.00 LINE(w-l) 82.70 81.97 79.00 79.02 85.98 LINE(ALL) 84.20 83.39 82.51 82.49 89.80 LINE(ALL) >CNN
Predictive Text Embedding Predictive text embedding through embedding heterogeneous text network Word co-occurrence network (w-w), word-document network (w-d), word-label network (w-l) Results on short documents DBLP MR Twitter Type Algorithm Micro-F1 Macro-F1 Unsupervised Embedding LINE (w-w + w-d) 74.22 70.12 71.13 71.12 73.84 Predictive CNN 76.16 73.08 72.71 72.69 75.97 75.96 LINE(w-l) 76.45 72.74 73.44 73.42 73.92 73.91 LINE(ALL) 77.15 73.61 73.58 73.57 75.21 LINE(ALL) ≈ CNN
Document Visualization Train(LINE(l-w)) Train(LINE(d-w)) Test(LINE(l-w)) Test(LINE(d-w))
Social Network Embedding Node classification Community as the ground truth Algorithm 10% 20% 30% 40% 50% 60% 70% 80% 90% GF 53.23 53.68 53.98 54.14 54.32 54.38 54.43 54.50 54.48 DeepWalk 60.38 60.77 60.90 61.05 61.13 61.18 61.19 61.29 61.22 DeepWalk(256dim) 60.41 61.09 61.35 61.52 61.69 61.76 61.80 61.91 61.83 LINE(1st) 63.27 63.69 63.82 63.92 63.96 64.03 64.06 64.17 64.10 LINE(2nd) 62.83 63.24 63.34 63.44 63.55 63.59 63.66 LINE(1st+2nd) 63.20** 63.97** 64.25** 64.39** 64.53** 64.55** 64.61** 64.75** 64.74** LINE(1st+2nd)>LINE(1st)>LINE(2nd )>DeepWalk>GF
Author Citation Network Author classification Algorithm 10% 20% 30% 40% 50% 60% 70% 80% 90% DeepWalk 63.98 64.51 64.75 64.81 64.92 64.99 65.00 64.90 LINE-SGD(2nd) 56.64 58.95 59.89 60.20 60.44 60.61 60.58 60.73 60.59 LINE(2nd ) 62.49 (64.69**) 63.30 (65.47**) 63.63 (65.85**) 63.77 (66.04**) 63.84 (66.19**) 63.94 (66.25**) 63.96 (66.30**) 64.00 (66.12**) (66.06**) LINE(2nd )>DeepWalk>LINE-SGD(2nd )
Paper Citation Network Paper classification Algorithm 10% 20% 30% 40% 50% 60% 70% 80% 90% DeepWalk 52.83 53.80 54.34 54.75 55.07 55.13 55.48 55.42 55.90 LINE(2nd ) 58.42 (60.10**) 59.58 (61.06**) 60.29 (61.46**) 60.78 (61.73**) 60.94 (61.85**) 61.20 (62.10**) 61.39 (62.21**) (62.25**) 61.79 (62.80**) LINE(2nd ) > DeepWalk
Network Layouts Coauthor network 18,561 authors and 207,074 edges “Data mining” “Machine learning” “Computer vision” (a) Graph factorization (b) DeepWalk (c) LINE(2nd )
Scalability (a) Speed up v.s. #threads (b) Micro-F1 v.s. #threads
Take Away Deep learning for networks! A large-scale network embedding model LINE Preserves the first-order and second-order proximity General, scalable Useful in many applications Outperforms unsupervised word embedding algorithm Skipgram Outperforms unsupervised document embedding algorithm paragraphVEC Outperforms supervised document embedding approach CNN on long documents State-of-the-art performance in social & citation network embedding
Thanks! Open Source: https://github.com/tangjianpku/LINE Jian Tang jiatang@microsoft.com And the source code of LINE is available online. Thanks for your attention!