Network-Wide Bike Availability Clustering Using the College Admission Algorithm: A Case Study of San Francisco Bay Area Hesham Rakha, Ph.D., P.Eng. Samuel Reynolds Pritchard Professor of Engineering, Charles E. Via, Jr. Dept. of Civil & Environmental Engineering Director, Center for Sustainable Mobility, Virginia Tech Transportation Institute Courtesy Professor, Bradley Dept. of Electrical and Computer Engineering
Center for Sustainable Mobility Presentation Outline Introduction Proposed algorithm Application Results Conclusion Research (n.d.). Retrieved November 11, 2016, from http://cep-probation.org/research-note/ Hesham Rakha Center for Sustainable Mobility
Center for Sustainable Mobility Introduction Clustering is an unsupervised learning technique that identifies underlying structure (natural grouping) of unlabeled data. What is a natural grouping among these objects? (quoted from SORAC Fall meeting’s presentation for Dr. Mohammed Elhenawy) Hesham Rakha Center for Sustainable Mobility
Center for Sustainable Mobility Introduction (cont.) Clustering is an unsupervised learning technique that identifies underlying structure (natural grouping) of unlabeled data. Finding a good clustering depends on the clustering criterion and the final aim of the clustering algorithm. One clustering solution Another clustering solution Blue cluster Rectangular cluster Circles cluster Red cluster (quoted from SORAC Fall meeting’s presentation for Dr. Mohammed Elhenawy) Hesham Rakha Center for Sustainable Mobility
Center for Sustainable Mobility Introduction (cont.) Therefore, the classical clustering techniques such as Kmeans, Fuzzy, DBSCAN clustering algorithms are blind! Only clusters the data based on one parameter: distance Maximize dispersion between clusters (minimize distortion inside clusters). So, it doesn’t consider other attributes such as the shape or color of the point! What is the solution? Come up with a new algorithm to maximum dispersion considering the shape or color of the point (i.e. maximize purity simultaneously!) Cluster.2 Cluster.1 Hesham Rakha Center for Sustainable Mobility
Center for Sustainable Mobility Proposed algorithm Built using two well-known algorithms, namely the College Admission (CA) algorithm and the K-median algorithm. CA tries to match between colleges and applicants with the goal of finding the optimal solution that satisfies both colleges and applicants through a series of iterations. K-median is similar to K-means but using median instead of mean. Combines the advantages of both supervised and unsupervised algorithms. It is a multi-objective algorithm where the impurity and distance in the cluster are minimized simultaneously. It matches between the clusters (minimizing distances) and data points (maximizing purity) until it converges. Hesham Rakha Center for Sustainable Mobility
Proposed algorithm - example 1 2 . n Maximize purity Minimize distance Data points Cluster 1 Cluster 2 Cluster 3 Preference list 3 Point 10 Point 3 Point 4 Point 2 Point 20 Point 13 Point 15 Clusters
How to find the optimal k? Consensus Clustering (CC) Hesham Rakha Center for Sustainable Mobility
Selecting the Optimal Number of Clusters Consensus Clustering (CC) subsamples the data set and calculates the consensus rate between all pairs of samples. It creates a similarity matrix that identifies the number of times two data points are assigned to the same cluster centroid, 𝑘∈𝐾, that can be used to show the degree of stability for each K. One of the measures for CC that can show the cluster stability is the cumulative distribution function (CDF) against consensus rate. Every curve represents a K, and the more the curve is flat, the more stable the number of clusters K is. Hesham Rakha Center for Sustainable Mobility
A Case Study of San Francisco Bay Area, Bike Sharing System Retrieved June 19, 2017, from http://www.cloud9living.com/san-francisco Hesham Rakha Center for Sustainable Mobility
Center for Sustainable Mobility Background Bike Sharing System (BSS): Last mile transportation solution. Sustainable urban transportation system. Environmentally friendly. Efficient and effective solution for traffic jams. Affordable. Retrieved June 19, 2017, from https://www.portlandpedalpower.com/blog/2014/03/bike-share-alternatives-developing-the-future-of-cycling/ Hesham Rakha Center for Sustainable Mobility
Center for Sustainable Mobility Background More than 37,000 stations in 50 countries! But, we always have a balancing problem! Number of recent research studies was conducted trying to rebalance the bike stations by anticipating the bike availability at each station. Retrieved June 19, 2017, from http://www.shareable.net/blog/what-can-we-learn-from-the-bike-sharing-world-map Hesham Rakha Center for Sustainable Mobility
Objective & Contribution Find the network-wide availability patterns and how these patterns evolve temporally with the goal of detecting imbalances in the BBS. Contribution We proposed a multi-objective clustering algorithm based on two algorithms. The proposed algorithm tries to cluster 15-minute entries of the bike availability across the network and find the similarity between them according to day-of-week and time-of-day. This provides an expected pattern of bikes usage for each cluster. Thereafter, we addressed when and where the system would be imbalanced. Hesham Rakha Center for Sustainable Mobility
Center for Sustainable Mobility Data Set Docking stations data collected from August 2013 to August 2015 in the San Francisco Bay Area as shown below. 70 stations (70 dimensions). Retrieved June 19, 2017, from https://www.kaggle.com Hesham Rakha Center for Sustainable Mobility
Data Set (Cont.) Reduced from one-minute to 15-minute. 48,000 entries. Using the proposed algorithm, we try to find the similarity between these entries and cluster them with regard to this similarity (bike availability) and the recorded time (time of day or day of week). Bikes availability every 15 minutes = 48,000 entries Hesham Rakha Center for Sustainable Mobility
Center for Sustainable Mobility Results: Day of week K=2 is the optimal k. Cumulative distribution function against consensus index value for each cluster (hours). Hesham Rakha Center for Sustainable Mobility
Results: Day of week (cont.) Tuesdays Wednesdays Thursdays Mondays Fridays Saturdays Sundays The probability of the day of week to be in one of the three clusters (k=3). Hesham Rakha Center for Sustainable Mobility
Results: Day of week (cont.) Available bikes of the three clusters for each station in the network. Hesham Rakha Center for Sustainable Mobility
Center for Sustainable Mobility Results: Time of day K=3 is the optimal k. Cumulative distribution function against consensus index value for each cluster (hours). Hesham Rakha Center for Sustainable Mobility
Results: Time of day (Cont.) Peak hours 8 a.m. to 5 p.m. Non-peak hours 6 p.m. to 7 a.m. The probability of hour to be in one of the two clusters (k=2). Hesham Rakha Center for Sustainable Mobility
Results: Time of day (Cont.) Available bikes of the two clusters for each station in the network. Hesham Rakha Center for Sustainable Mobility
Center for Sustainable Mobility Conclusion A new supervised algorithm was proposed overcome the classical clustering algorithms limitations. It was tested on a BSS in the San Francisco Bay Area. It was used to anticipate the bikes availability across the network with respect to time of day and day of week. The results show the days of week can be grouped into three clusters with an associated patter of bike availability. The time of day was clustered into two groups, peak and non-peak hours. The exploratory spatial-temporal analysis shows the BBS can be balanced with minimum cost and effort. Hesham Rakha Center for Sustainable Mobility
Center for Sustainable Mobility Questions? Hesham Rakha Center for Sustainable Mobility