Download presentation
Presentation is loading. Please wait.
1
Apache Spark & Complex Network
Rui Wu
2
Big Data is not far from us
Computers, phones, cars (snapshot) 90 percentage of all the data we have in 2013 were generated between 2011 and 2013 (IBM 2013) Big Data, new challenges
3
5Vs of Big data Volume: The data size is huge. Example: Facebook 300PB, the most important people, based on betweenness centrality. Velocity: Data input and output speeds are very fast. Example: 300 hours/minute to YouTube (YouTube 2014). Network on subscribers, the most impactive eigenvector centrality Variety: The data can be various types. Example: Twitter, videos, pure words, and images. Hard to predict who a user will follow next based on the node (user) similarity
4
5Vs of Big data Veracity: The data can be messy and mixed with noisy data. Example: temperature relations in different spots. The sensor collects wrong data, such as -999 celsius Value: If the data cannot be turned into values, it is useless. For example, the most popular star in Twitter, a network model based on the number of followers (indegree). Hard to answer the question from the raw data.
5
Why Apache Spark not Apache Hadoop
in memory calculation, speed faster Using Resilient Distributed Dataset (RDD), read- only dataset, stored over Apache Spark cluster machines. Overcome the linear dataflow structure on distributed system
6
One interesting thing about Spark
You only have three choices: Java, Scala, Python (Officially) Spark process--JVM process Guess: Why Java Virtual Machine?
7
GraphX There are many tools and libraries, but not maintaining or old version Hadoop & Spark GraphX: Apache Spark component to handle graph computations
8
Let’s talk more based on parameters
Degree Distribution Clustering coefficient Shortest path Random Walk Community detection
10
Degree distribution Based on my survey, nobody has done it with Apache Spark. Too easy? GraphX can get every node in-degree and out-degree+node number. Bingo!
12
Modularity (enumerative)
13
Assortative coefficient (enumerative)
14
Assortative Coefficient
Based on my survey, no Apache Spark tool (including GraphX) to calculate assortative coefficient directly Easy to calculate this parameter using Apache Spark matrix and the formula introduced in the complex network lecture
16
Clustering coefficient
GraphX officially rejects to solve this problem. Someone tried to solve it--github & it is still rejected
17
Clustering coefficient
What you know and what you can do are two different things. Even if your program works, you still need to modify it or persuade others to use it.
18
Shortest path Commonly Question: shortest path between two nodes
GraphX has the function: programming-guide.html
19
Hop Distribution Hop distribution of all the paths between two nodes.
A is the source: 1 hop: 2, 2 hops: 3...
20
Hop Distribution There are few papers about this. Here is one [1]
Effective and complex. However, their work only focuses on wireless network [1] Kuo, J.C. and Liao, W., Hop count distribution of multihop paths in wireless networks with arbitrary node density: Modeling and its applications. IEEE Transactions on Vehicular Technology, 56(4), pp
21
Node Centrality GraphX officially: No centralities functions.
Third party libraries: packages.org/?q=tags%3A% 22Graph%22
22
Random Walk Pagerank and path length--GraphX done
I think we still need to have a look of pagerank. Because it is our next assignment.
24
K-core k-core: a substracture that each node connects to at least k members within the group
25
Community detection Leskovec and his colleagues (Stanford + Yahoo): survey different algorithm on large graph Based on conductance--simple and effective Conductance: ratio between the number of edges within the community and the number of edges leaving the community (higher, better) Code, Data, and Papers:
26
Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.