Adaptive Dynamic Bipartite Graph Matching: A Reinforcement Learning Approach Yansheng Wang 1, Yongxin Tong 1, Cheng Long 2, Pan Xu 3, Ke Xu 1, Weifeng.

Adaptive Dynamic Bipartite Graph Matching: A Reinforcement Learning Approach
Yansheng Wang 1, Yongxin Tong 1, Cheng Long 2, Pan Xu 3, Ke Xu 1, Weifeng Lv 1 1 Beihang University 2 Nanyang Technological University 3 University of Maryland, College Park Good afternoon everyone. I am Yansheng Wang from Beihang University. Today I’ll present our work Adaptive Dynamic Bipartite Graph Matching:A Reinforcement Learning Approach. This is a collaborative work with Yongxin, Ke and Weifeng from Beihang University, Cheng from Nanyang Technological University and Pan from University of Maryland, College Park

Outline Background and Motivation Problem Statement Our Solutions
Experiments Conclusion This is the outline

Experiments Conclusion I will first introduce the background and the motivation

Solved by the Hungarian method in polynomial time
Background Bipartite graph matching Traditional applications Assignment problem, vehicle scheduling problem, etc. Perform well in offline scenarios 3 1 2 4 5 6 Solved by the Hungarian method in polynomial time I think everyone is quite familiar with the bipartite graph matching problem; This is a weighted bipartite graph and we want to match the nodes in order to maximize the total weights The problem can be solved by the Hungarian algorithm in polynomial time, which is first developed by Harold Kuhn in 1950s It has been applied to various problems like…, and its good performance in offline scenarios has been validated Harold W. Kuhn

Online matching is more and more important
Background Emergence of online scenarios Transportation Medical Economic Taxi-hailing, Ride sharing, … Mutual blood donation, Kidney exchange, … Two-sided market, Crowdsourcing, … However, with the emergence of mobile Internet and sharing economy, online scenario is becoming more and more common for matching problems, where the supply and demand come into a system dynamically. For example, xxx,xxx in xxx area,… , they all have real-world apps, so online matching is becoming more and more important Online matching is more and more important

Existing Research Problem model: online bipartite matching
Solution: online algorithms under instantaneous constraint 1 Objective: maximize the sum of weights Nodes arrive and leave dynamically 2 2 3 3 4 Match as soon as the node arrives (instantaneous constraint) 4 1 5 2 There are some existing research on this problem, which often models it with online bipartite matching, where nodes arrive and leave dynamically and the objective is to maximize the sum of weights between matching pairs. Also, it assumes that each node needs to be matched as soon as its arrival, which is also known as the spontaneous constraint. And existing solutions are often to design online algorithms under this constraint 6 Y. Tong et al, Online mobile microtask allocation in spatial crowdsourcing. In ICDE 2016.

Motivation Instantaneous constraint is too strong sometimes
If nodes can wait (match in a batch manner) More information can be gathered Likely to meet better candidates in the future 17:00 17:30 I have a task of labeling 500 pictures Accuracy:70% Accuracy:95% Waiting for response 2 drivers nearby have received your order Passengers can wait for a short time before being served Requseters are willing to wait for more reliable workers However, from our point of view, the spontaneous constraint is too strong sometimes. For example, in a taxi hailing app, passengers can…, So if nodes can … There are also existing works using fixed batch-based methods L. Kazemi et al. Geocrowd: enabling query answering with spatial crowdsourcing. In GIS 2012.

Limitation of existing work
Strong assumptions: instantaneous constraint Batch manner: fixed batch and lacking in global theoretical guarantee Let’s conclude the limitation… Y. Tong et al, Online mobile microtask allocation in spatial crowdsourcing. In ICDE 2016. Y. Tong et al, Flexible dynamic task assignment in real-time spatial data. In VLDB 2017. P. Cheng et al, An experimental evaluation of task assignment in spatial crowdsourcing. In VLDB 2018 L. Kazemi et al. Geocrowd: enabling query answering with spatial crowdsourcing. In GIS 2012. L. Kazemi et al. GeoTruCrowd: trustworthy query answering with spatial crowdsourcing. In GIS 2013.

Contribution Devise a novel adaptive batch-based framework
Analyze the global theoretical guarantee Propose an effective and efficient reinforcement learning based solution We make the following contribution in this paper…

Experiments Conclusion Next, we will state our problem formally

Problem Statement Dynamic bipartite graph 𝑩=(𝑳,𝑹,𝑬) 𝑹= 𝒋∈ ℕ ∗
Arrival time 3 1 2 4 5 6 𝑹= 𝒋∈ ℕ ∗ 𝑳={𝒊∈ ℕ ∗ } We call such a graph B=…

Problem Statement Dynamic bipartite graph 𝑬⊆𝑳×𝑹 𝒘(𝟑,𝟏)=𝟐 𝒘(𝟒,𝟓)=𝟎 1 2
Arrival time 3 1 2 4 5 𝒘(𝟑,𝟏)=𝟐 𝒘(𝟒,𝟓)=𝟎 E is the edge set 2 6

Node 3 will vanish at time step 6
Problem Statement Dynamic bipartite graph Duration of nodes: Node 3 will vanish at time step 6 Arrival time 3 1 2 4 5 6 5 3 Lifetime of node 3 1 Every node has a duration… 2 2 6 4

Problem Statement Dynamic bipartite graph Matching allocation:
Arrival time 1 6 Matching allocation: 𝑴={ 𝟑,𝟏 , 𝟒,𝟐 ,(𝟔,𝟓)} 2 2 5 3 3 3 4 1 4 1 In online scenario, the graph is dynamically changing as follow 5 2 2 6 4 Utility score: 𝑼 𝑩,𝑴 =𝟐+𝟑+𝟐=𝟖

Problem Statement Dynamic Bipartite Graph Matching (DBGM) Problem
Given dynamic bipartite graph 𝑩,where each node appears in online scenario Find matching allocation 𝑴 to maximize the total utility, i.e., 𝐦𝐚𝐱 𝑴 𝑼(𝑩,𝑴) Decisions can be made freely. (Without assumption of instantaneous constraint) Now we can define the xxx problem as follow

Theoretical guarantee of an online algorithm in worst case
Problem Statement Evaluation Metric Competitive ratio in adversarial model 𝑪𝑹= 𝐦𝐢𝐧 𝑩 𝑼(𝑩,𝑴) 𝑶𝒑𝒕(𝑩) Theoretical guarantee of an online algorithm in worst case Offline optimum We use the CR in AM as the evaluation

Experiments Conclusion Next we will introduce our solution to the problem, namely the adaptive batch-based solution

Our framework We propose an adaptive batch-based framework 1 6
Time of arrival 1 6 Cumulate the dynamically coming nodes into a batch Match all the nodes in the batch 2 2 5 3 3 3 4 1 4 1 First, we propose the ABB framework, which consists of the following steps Adaptively adjust the size of batch 5 2 2 6 4

Our framework We propose an adaptive batch-based framework Time of arrival 1 6 Cumulate the dynamically coming nodes into a batch Match all the nodes in the batch Challenge 1: How optimal is an adaptive batch-based framework in theory? 2 2 5 3 3 3 4 Challenge 2: How to implement an optimal strategy to split batches adaptively? 1 4 1 In this framework, two challenges need to be addressed Adaptively adjust the size of batch 5 2 2 6 4

Solution to the 1st Challenge
Hungarian algorithm as the in-batch algorithm 1 Time of arrival 1 6 1 Strategy 𝝈=(𝟏,𝟎,𝟎,𝟎,𝟎,𝟏,𝟏) ∈ 𝑺 𝑩 = 𝟐 𝑩 +𝟏 2 2 5 3 2 3 3 4 3 1 4 𝑼 𝑩, 𝑺 𝑩 = 𝐦𝐚𝐱 𝝈∈ 𝑺 𝑩 𝑼(𝑩,𝝈) 1 4 We will discuss about the solution to the 1st challenge first. We formalize Two cases… 5 2 2 5 6 4 We use the Hungarian algorithm in each batch 6

Theoretical result Our framework can achieve constant competitive ratio 𝟏 𝑪−𝟏 (Where 𝐶 is a constant) We also prove the competitive ratio is tight We answer the open question: Whether a batch-based solution can achieve a global theoretical guarantee? This is the theoretical result Thus we answer the open question:

Solution to the 2nd Challenge
Observation 3 1 2 4 5 6 ? utility gain=𝟎 𝝈 𝟏 =𝟎 A sequential decision making problem . ? utility gain=𝟔 𝝈 𝟒 =𝟏 Now let me introduce our solution to the 2nd challenge, that is, how to effectively implement a batch splitting strategy… We have the following observations about the problem,… Modeled by a Markov decision process (MDP)

MDP modeling State: number of left and right nodes (for simplification) Action: whether to split the batch now (𝟎 or 𝟏) Reward: sum of weights of the matching (𝟎 if not to match) Baseline: Q-learning Purpose: learning a mapping from states (𝒔) to actions (𝒂) in order to maximize the sum of rewards Basic idea: a function 𝑸(𝒔,𝒂) to record the score of 𝒔→𝒂 Here is how we model the problem with MDP We first apply the classical Q-learning algo as the baseline method

Baseline: Q-learning an example of inferring (with learned parameters) Learned Q-table 1 6 State a=0 a=1 (0,0) 0.97 0.02 (0,1) 0.85 0.59 (1,0) 0.82 0.45 (1,1) 0.66 0.71 (0,2) 0.88 0.53 (2,0) 0.79 0.44 (1,2) 0.61 (2,1) 0.65 (2,2) 0.42 … Here is an running example of how Q-learning makes decisions,

Baseline: Q-learning an example of inferring (with learned parameters) Learned Q-table 1 6 State a=0 a=1 (0,0) 0.97 0.02 (0,1) 0.85 0.59 (1,0) 0.82 0.45 (1,1) 0.66 0.71 (0,2) 0.88 0.53 (2,0) 0.79 0.44 (1,2) 0.61 (2,1) 0.65 (2,2) 0.42 …

Baseline: Q-learning an example of inferring (with learned parameters) Learned Q-table 1 6 State a=0 a=1 (0,0) 0.97 0.02 (0,1) 0.85 0.59 (1,0) 0.82 0.45 (1,1) 0.66 0.71 (0,2) 0.88 0.53 (2,0) 0.79 0.44 (1,2) 0.61 (2,1) 0.65 (2,2) 0.42 … 2 5

Baseline: Q-learning an example of inferring (with learned parameters) Learned Q-table 1 6 State a=0 a=1 (0,0) 0.97 0.02 (0,1) 0.85 0.59 (1,0) 0.82 0.45 (1,1) 0.66 0.71 (0,2) 0.88 0.53 (2,0) 0.79 0.44 (1,2) 0.61 (2,1) 0.65 (2,2) 0.42 … 2 2 5 3 3 3

Baseline: Q-learning an example of inferring (with learned parameters) Learned Q-table 1 6 State a=0 a=1 (0,0) 0.97 0.02 (0,1) 0.85 0.59 (1,0) 0.82 0.45 (1,1) 0.66 0.71 (0,2) 0.88 0.53 (2,0) 0.79 0.44 (1,2) 0.61 (2,1) 0.65 (2,2) 0.42 … 2 2 5 3 3 3 4 1 4

Baseline: Q-learning an example of inferring (with learned parameters) Learned Q-table 1 6 State a=0 a=1 (0,0) 0.97 0.02 (0,1) 0.85 0.59 (1,0) 0.82 0.45 (1,1) 0.66 0.71 (0,2) 0.88 0.53 (2,0) 0.79 0.44 (1,2) 0.61 (2,1) 0.65 (2,2) 0.42 … 2 2 5 3 3 3 4 1 4 1 5 2

Baseline: Q-learning an example of inferring (with learned parameters) Learned Q-table 1 6 State a=0 a=1 (0,0) 0.97 0.02 (0,1) 0.85 0.59 (1,0) 0.82 0.45 (1,1) 0.66 0.71 (0,2) 0.88 0.53 (2,0) 0.79 0.44 (1,2) 0.61 (2,1) 0.65 (2,2) 0.42 … 2 2 5 3 3 3 4 1 4 1 5 2 2 4 6

Baseline: Q-learning Q-table is updated following 𝑸 𝒔,𝒂 ←𝑸 𝒔,𝒂 +𝜶[𝒓𝒆𝒘𝒂𝒓𝒅+ 𝐦𝐚𝐱 𝒂 ′ 𝑸 𝒔 ′ , 𝒂 ′ −𝑸(𝒔,𝒂)] Bellman backups The updating process of the Q-table follows this equation… The Q-learning algorithm converges to optimum given sufficient training data Watkins C. J. , Dayan P. Q-learning. Machine learning, 1992.

Observation 1 2 6 3 4 5 1 ? 2 ? Searching space of general Q-learning is too large, leading to inefficiency problem 3 ? 4 ? 5 However, the classical Q-learning can not capture the unique properties of our problem We have observed that… Therefore, the searching space… 6 Some checks are unnecessary with a too small batch size Many nodes will vanish with a too large batch size

Restricted Q-learning (RQL) Only consider batches with size ∈ 𝒍 𝒎𝒊𝒏 , 𝒍 𝒎𝒂𝒙 Reformulate states and actions state 𝒔 → (𝒔,𝒍) 𝒍: how many rounds have we waited for action 𝒂∈{𝟎,𝟏} → 𝒂∈ 𝒍 𝒎𝒊𝒏 , 𝒍 𝒎𝒂𝒙 We have devised another RL-based solution, namely… The idea is

An example of RQL Suppose 𝒍 𝒎𝒊𝒏 =𝟑, 𝒍 𝒎𝒂𝒙 =𝟓 Learned Q-table 1 6 State a=3 a=4 a=5 … (1,2,3) 0.59 0.62 0.55 (1,2,4) 0.0 0.51 0.47 (1,2,5) 0.56 0.44 (2,1,3) 0.37 0.29 0.12 (2,2,4) 0.58 (2,2,5) 0.41 (2,3,5) 0.42 𝒍=𝟏,skip Let’s see a running example of RQL

An example of RQL Suppose 𝒍 𝒎𝒊𝒏 =𝟑, 𝒍 𝒎𝒂𝒙 =𝟓 Learned Q-table 1 6 State a=3 a=4 a=5 … (1,2,3) 0.59 0.62 0.55 (1,2,4) 0.0 0.51 0.47 (1,2,5) 0.56 0.44 (2,1,3) 0.37 0.29 0.12 (2,2,4) 0.58 (2,2,5) 0.41 (2,3,5) 0.42 2 5 𝒍=𝟐,skip

An example of RQL Suppose 𝒍 𝒎𝒊𝒏 =𝟑, 𝒍 𝒎𝒂𝒙 =𝟓 Learned Q-table 1 6 State a=3 a=4 a=5 … (1,2,3) 0.59 0.62 0.55 (1,2,4) 0.0 0.51 0.47 (1,2,5) 0.56 0.44 (2,1,3) 0.37 0.29 0.12 (2,2,4) 0.58 (2,2,5) 0.41 (2,3,5) 0.42 2 2 5 3 3 3 𝒍=𝟑,check table

An example of RQL Suppose 𝒍 𝒎𝒊𝒏 =𝟑, 𝒍 𝒎𝒂𝒙 =𝟓 Learned Q-table 1 6 State a=3 a=4 a=5 … (1,2,3) 0.59 0.62 0.55 (1,2,4) 0.0 0.51 0.47 (1,2,5) 0.56 0.44 (2,1,3) 0.37 0.29 0.12 (2,2,4) 0.58 (2,2,5) 0.41 (2,3,5) 0.42 2 2 5 3 3 3 𝒍=𝟑,check table Better split at 𝒍=𝟒

An example of RQL Suppose 𝒍 𝒎𝒊𝒏 =𝟑, 𝒍 𝒎𝒂𝒙 =𝟓 Learned Q-table 1 6 State a=3 a=4 a=5 … (1,2,3) 0.59 0.62 0.55 (1,2,4) 0.0 0.51 0.47 (1,2,5) 0.56 0.44 (2,1,3) 0.37 0.29 0.12 (2,2,4) 0.58 (2,2,5) 0.41 (2,3,5) 0.42 2 2 5 3 3 3 4 𝒍=𝟒,check table 1 4

An example of RQL Suppose 𝒍 𝒎𝒊𝒏 =𝟑, 𝒍 𝒎𝒂𝒙 =𝟓 Learned Q-table 1 6 State a=3 a=4 a=5 … (1,2,3) 0.59 0.62 0.55 (1,2,4) 0.0 0.51 0.47 (1,2,5) 0.56 0.44 (2,1,3) 0.37 0.29 0.12 (2,2,4) 0.58 (2,2,5) 0.41 (2,3,5) 0.42 2 2 5 3 3 3 4 𝒍=𝟒,check table 1 4 Better split now!

An example of RQL Suppose 𝒍 𝒎𝒊𝒏 =𝟑, 𝒍 𝒎𝒂𝒙 =𝟓 Learned Q-table 1 6 State a=3 a=4 a=5 … (1,2,3) 0.59 0.62 0.55 (1,2,4) 0.0 0.51 0.47 (1,2,5) 0.56 0.44 (2,1,3) 0.37 0.29 0.12 (2,2,4) 0.58 (2,2,5) 0.41 (2,3,5) 0.42 2 2 5 3 3 3 4 1 4 1 5 3 𝒍=𝟏,skip

An example of RQL Suppose 𝒍 𝒎𝒊𝒏 =𝟑, 𝒍 𝒎𝒂𝒙 =𝟓 Learned Q-table 1 6 State a=3 a=4 a=5 … (1,2,3) 0.59 0.62 0.55 (1,2,4) 0.0 0.51 0.47 (1,2,5) 0.56 0.44 (2,1,3) 0.37 0.29 0.12 (2,2,4) 0.58 (2,2,5) 0.41 (2,3,5) 0.42 2 2 5 3 3 3 4 1 4 1 5 3 2 4 6 𝒍=𝟐,skip

An example of RQL Suppose 𝒍 𝒎𝒊𝒏 =𝟑, 𝒍 𝒎𝒂𝒙 =𝟓 Learned Q-table 1 6 State a=3 a=4 a=5 … (1,2,3) 0.59 0.62 0.55 (1,2,4) 0.0 0.51 0.47 (1,2,5) 0.56 0.44 (2,1,3) 0.37 0.29 0.12 (2,2,4) 0.58 (2,2,5) 0.41 (2,3,5) 0.42 2 2 5 3 3 3 4 1 4 1 5 3 2 4 6 5 1 7 𝒍=𝟑,check table

An example of RQL Suppose 𝒍 𝒎𝒊𝒏 =𝟑, 𝒍 𝒎𝒂𝒙 =𝟓 Learned Q-table 1 6 State a=3 a=4 a=5 … (1,2,3) 0.59 0.62 0.55 (1,2,4) 0.0 0.51 0.47 (1,2,5) 0.56 0.44 (2,1,3) 0.37 0.29 0.12 (2,2,4) 0.58 (2,2,5) 0.41 (2,3,5) 0.42 2 2 5 3 3 3 4 1 4 1 5 3 2 4 6 5 1 7 𝒍=𝟑,check table Better split now!

Experiments Conclusion Next, we will show some experimental results

Experiments Datasets Compared methods Synthetic data varying
Distribution of edge weights Graph sparsity Duration of nodes Arriving density of nodes Cardinality Scalability Real data from Didi chuxing about 10K nodes per hour 400K for training and 10K for testing Compared methods Greedy algorithm (GR) TGOA from ICDE16 Fixed-batch algorithm (FB) We use a synthetic dataset and a real dataset,…

RQL is the most effective RQL is efficient in running time
Experiments Impact of edge distribution RQL is the most effective RQL is efficient in running time The memory cost is not high (about 23MB) Here are results of impact of edge distributions

Varying the arriving density Varying the graph sparsity
Experiments Impact of sparsity and arriving density Varying the arriving density of the nodes Varying the graph sparsity

Varying the maximal duration of tasks/workers
Experiments Results on real data from Didi chuxing Here are results on real datasets Varying the maximal duration of tasks/workers

Experiments Conclusion Finally, we conclude our work

Conclusion Propose a novel adaptive batch-based framework that guarantees a constant competitive ratio Devise effective and efficient RL-based solutions to learn how to split the batches adaptively Extensive experiments on both real and synthetic datasets show that our solution outperforms the state- of-the-arts.

Q & A Thank You

theoretical analysis Assumption: Duration has upper bound 𝑪≥𝟐 Theorem 1 : 𝑪𝑹 𝒆𝒙𝒑𝒊𝒓𝒆 = 𝟏 𝑪−𝟏 Theorem 2 : 𝟏 𝑪−𝟏 ≤ 𝑪𝑹 𝒓𝒆𝒎𝒂𝒊𝒏 < 𝟐 𝑪−𝟐 (𝑪≥𝟑) Unmatched nodes remain in the batch Unmatched nodes expire

Solution framework Main idea of proof
Construct a good enough strategy using offline guide 𝑴 ∗ : Pick the edge with the largest weight in 𝑴 ∗ and put it in a batch Delete edges in 𝑴 ∗ that could not be matched in that batch Pick and delete edges repeatedly until 𝑴 ∗ =∅ We have to bound number of deleted edges in each round Two cases of deleted edges: (𝒋−𝒊−𝟏) edges at most + 𝐂−𝟏−(𝒋−𝒊+𝟏) edges at most = 𝑪−𝟐 edges at most

Problem: it is memory consuming Optimization: quantization techniques State a=2 a=3 a=4 a=5 … (10,10,2) 0.45 0.46 0.48 0.49 (10,11,2) 0.50 (11,10,2) 0.44 (11,11,2) 1 slot:𝑸( 𝟏𝟎~𝟏𝟏,𝟏𝟎~𝟏𝟏 ,𝟐~𝟑) 压缩后的表也画一下，complexity 1 slot:𝑸( 𝟏𝟎~𝟏𝟏,𝟏𝟎~𝟏𝟏 ,𝟒~𝟓)

Utility score is not damaged Even better in some cases
Experiments Impact of quantization Utility score is not damaged Even better in some cases Memory cost is largely decreased and remains stable

Spontaneous v.s. Batch-based
1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 1 1 5 5 2 5 5 Now let’s see an example of spontaneous decision versus batch-based decision 2 6 6 6 6 The total utility is = 5 The total utility is = 8

Problem Statement Dynamic bipartite graph Online scenario
Arrival time 1 1 6 t = 1 Matching result: 𝑴={ 𝟑,𝟏 , 𝟒,𝟐 ,(𝟔,𝟓)} 2 2 2 t = 2 5 3 t = 3 3 3 3 4 t = 4 1 4 4 1 5 5 t = 5 In online scenario, the graph is dynamically changing as follow 2 2 6 6 4 t = 6 Utility score: 𝑼 𝑩,𝑴 =𝟐+𝟑+𝟐=𝟖

Adaptive Dynamic Bipartite Graph Matching: A Reinforcement Learning Approach Yansheng Wang 1, Yongxin Tong 1, Cheng Long 2, Pan Xu 3, Ke Xu 1, Weifeng.

Similar presentations

Presentation on theme: "Adaptive Dynamic Bipartite Graph Matching: A Reinforcement Learning Approach Yansheng Wang 1, Yongxin Tong 1, Cheng Long 2, Pan Xu 3, Ke Xu 1, Weifeng."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Adaptive Dynamic Bipartite Graph Matching: A Reinforcement Learning Approach Yansheng Wang 1, Yongxin Tong 1, Cheng Long 2, Pan Xu 3, Ke Xu 1, Weifeng.

Similar presentations

Presentation on theme: "Adaptive Dynamic Bipartite Graph Matching: A Reinforcement Learning Approach Yansheng Wang 1, Yongxin Tong 1, Cheng Long 2, Pan Xu 3, Ke Xu 1, Weifeng."— Presentation transcript:

Similar presentations

About project

Feedback