Apache Spark & Complex Network

Slides:

Advertisements

Similar presentations

Network Matrix and Graph. Network Size Network size – a number of actors (nodes) in a network, usually denoted as k or n Size is critical for the structure.

Advertisements

Oracle Labs Graph Analytics Research Hassan Chafi Sr. Research Manager Oracle Labs Graph-TA 2/21/2014.

CSE 5243 (AU 14) Graph Basics and a Gentle Introduction to PageRank 1.

Analysis and Modeling of Social Networks Foudalis Ilias.

Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

Lecture 9 Measures and Metrics. Structural Metrics Degree distribution Average path length Centrality Degree, Eigenvector, Katz, Pagerank, Closeness,

Big Data Workflows N AME : A SHOK P ADMARAJU C OURSE : T OPICS ON S OFTWARE E NGINEERING I NSTRUCTOR : D R. S ERGIU D ASCALU.

Leveraging Big Data: Lecture 11 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo.

Projects ( ) Ida Mele. Rules Students have to work in teams (max 2 people). The project has to be delivered by the deadline that will be published.

Topic 13 Network Models Credits: C. Faloutsos and J. Leskovec Tutorial

Network Aware Resource Allocation in Distributed Clouds.

X-Stream: Edge-Centric Graph Processing using Streaming Partitions

© 2015 IBM Corporation UNIT 2: BigData Analytics with Spark and Spark Platforms 1 Shelly Garion IBM Research -- Haifa.

Clustering of protein networks: Graph theory and terminology Scale-free architecture Modularity Robustness Reading: Barabasi and Oltvai 2004, Milo et al.

Salah A. Aly,Moustafa Youssef, Hager S. Darwish,Mahmoud Zidan Distributed Flooding-based Storage Algorithms for Large-Scale Wireless Sensor Networks Communications,

Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.

+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.

Slides are modified from Lada Adamic

Advanced Software Engineering PROJECT November 2015.

Data Structures and Algorithms in Parallel Computing Lecture 3.

SUPPLY CHAIN OF BIG DATA. WHAT IS BIG DATA?  A lot of data  Too much data for traditional methods  The 3Vs  Volume  Velocity  Variety.

Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.

Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.

Informatics tools in network science

Fast Parallel Algorithms for Edge-Switching to Achieve a Target Visit Rate in Heterogeneous Graphs Maleq Khan September 9, 2014 Joint work with: Hasanuzzaman.

Next Generation of Apache Hadoop MapReduce Owen

Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.

PySpark Tutorial - Learn to use Apache Spark with Python

EE327 Final Representation Qianyang Peng F

Centralities (Gephi and Python)

Big Data is a Big Deal!.

A Peta-Scale Graph Mining System

Big Data is a Big Deal! Capstone Project

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Topo Sort on Spark GraphX Lecturer: 苟毓川

Uncovering the Mystery of Trust in An Online Social Network

ANOMALY DETECTION FRAMEWORK FOR BIG DATA

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Tutorial: Big Data Algorithms and Applications Under Hadoop

New Characterizations in Turnstile Streams with Applications

Spark Presentation.

PEGASUS: A PETA-SCALE GRAPH MINING SYSTEM

Comparison of Social Networks by Likhitha Ravi

Hadoop Clusters Tess Fulkerson.

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Graph Analysis by Persistent Homology

Ministry of Higher Education

CSCI1600: Embedded and Real Time Software

湖南大学-信息科学与工程学院-计算机与科学系

Section 7.12: Similarity By: Ralucca Gera, NPS.

Methodology & Current Results

CS110: Discussion about Spark

Why Social Graphs Are Different Communities Finding Triangles

Department of Computer Science University of York

Big Data Overview.

Overview of big data tools

Spark and Scala.

Working with Spark With Focus on Lab3.

3.3 Network-Centric Community Detection

Big Data: Four Vs Salhuldin Alqarghuli.

Big Data Analytics: Exploring Graphs with Optimized SQL Queries

Recitation #4 Tel Aviv University 2017/2018 Slava Novgorodov

Apache Hadoop and Spark

Working with Spark With Focus on Lab3.

Analyzing Massive Graphs - ParT I

CSCI1600: Embedded and Real Time Software

CS639: Data Management for Data Science

Analysis of Large Graphs: Overlapping Communities

Presentation transcript:

Apache Spark & Complex Network Rui Wu

Big Data is not far from us Computers, phones, cars (snapshot) 90 percentage of all the data we have in 2013 were generated between 2011 and 2013 (IBM 2013) Big Data, new challenges

5Vs of Big data Volume: The data size is huge. Example: Facebook 300PB, the most important people, based on betweenness centrality. Velocity: Data input and output speeds are very fast. Example: 300 hours/minute to YouTube (YouTube 2014). Network on subscribers, the most impactive eigenvector centrality Variety: The data can be various types. Example: Twitter, videos, pure words, and images. Hard to predict who a user will follow next based on the node (user) similarity

5Vs of Big data Veracity: The data can be messy and mixed with noisy data. Example: temperature relations in different spots. The sensor collects wrong data, such as -999 celsius Value: If the data cannot be turned into values, it is useless. For example, the most popular star in Twitter, a network model based on the number of followers (indegree). Hard to answer the question from the raw data.

Why Apache Spark not Apache Hadoop in memory calculation, speed faster Using Resilient Distributed Dataset (RDD), read- only dataset, stored over Apache Spark cluster machines. Overcome the linear dataflow structure on distributed system

One interesting thing about Spark You only have three choices: Java, Scala, Python (Officially) Spark process--JVM process Guess: Why Java Virtual Machine? https://0x0fff.com/spark-architecture/

GraphX There are many tools and libraries, but not maintaining or old version Hadoop & Spark GraphX: Apache Spark component to handle graph computations http://spark.apache.org/graphx/

Let’s talk more based on parameters Degree Distribution Clustering coefficient Shortest path Random Walk Community detection

Degree distribution Based on my survey, nobody has done it with Apache Spark. Too easy? GraphX can get every node in-degree and out-degree+node number. Bingo!

Modularity (enumerative)

Assortative coefficient (enumerative)

Assortative Coefficient Based on my survey, no Apache Spark tool (including GraphX) to calculate assortative coefficient directly Easy to calculate this parameter using Apache Spark matrix and the formula introduced in the complex network lecture

Clustering coefficient GraphX officially rejects to solve this problem. Someone tried to solve it--github & it is still rejected https://github.com/amplab/graphx/pull/148

Clustering coefficient What you know and what you can do are two different things. Even if your program works, you still need to modify it or persuade others to use it.

Shortest path Commonly Question: shortest path between two nodes GraphX has the function: https://spark.apache.org/docs/0.9.1/graphx- programming-guide.html

Hop Distribution Hop distribution of all the paths between two nodes. A is the source: 1 hop: 2, 2 hops: 3...

Hop Distribution There are few papers about this. Here is one [1] Effective and complex. However, their work only focuses on wireless network [1] Kuo, J.C. and Liao, W., 2007. Hop count distribution of multihop paths in wireless networks with arbitrary node density: Modeling and its applications. IEEE Transactions on Vehicular Technology, 56(4), pp.2321-2331.

Node Centrality GraphX officially: No centralities functions. Third party libraries: https://spark- packages.org/?q=tags%3A% 22Graph%22

Random Walk Pagerank and path length--GraphX done http://spark.apache.org/docs/latest/graphx-programming-guide.html#pagerank I think we still need to have a look of pagerank. Because it is our next assignment.

K-core k-core: a substracture that each node connects to at least k members within the group

Community detection Leskovec and his colleagues (Stanford + Yahoo): survey different algorithm on large graph Based on conductance--simple and effective Conductance: ratio between the number of edges within the community and the number of edges leaving the community (higher, better) Code, Data, and Papers: http://snap.stanford.edu/ncp/

Thank you!