CS548 Fall 2016 Clustering Showcase

Slides:



Advertisements
Similar presentations
AMCS/CS229: Machine Learning
Advertisements

Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Clustering Basic Concepts and Algorithms
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining Cluster Analysis: Advanced Concepts and Algorithms Figures for Chapter 9 Introduction.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining Cluster Analysis: Basic Concepts and Algorithms Figures for Chapter 8 Introduction.
1 Data Mining Techniques Instructor: Ruoming Jin Fall 2006.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining: Exploring Data Figures for Chapter 3 Introduction to Data Mining by Tan, Steinbach,
Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.
An Effective Fuzzy Clustering Algorithm for Web Document Classification: A Case Study in Cultural Content Mining Nils Murrugarra.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Lecture 20: Cluster Validation
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
UNSUPERVISED LEARNING David Kauchak CS 451 – Fall 2013.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Clustering Patrice Koehl Department of Biological Sciences National University of Singapore
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Information Bottleneck Method & Double Clustering + α Summarized by Byoung Hee, Kim.
Guided By, Prof. Dr. Dirk C. Mattfeld, M.Sc. Jan Brinkmann Pattern Recognition in Multiple Bikesharing Systems for Comparability Presented By, Athiq Ahamed.
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
MIS2502: Data Analytics Clustering and Segmentation Jeremy Shafer
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Fuzzy Logic in Pattern Recognition
Clustering Patrice Koehl Department of Biological Sciences
Network-Wide Bike Availability Clustering Using the College Admission Algorithm: A Case Study of San Francisco Bay Area Hesham Rakha, Ph.D., P.Eng. Samuel.
Establishing a bike-sharing system in the city of Salzburg
Clustering (3) Center-based algorithms Fuzzy k-means
数据挖掘 Introduction to Data Mining
Topic 3: Cluster Analysis
Clustering Evaluation The EM Algorithm
Lecture Notes for Chapter 9 Introduction to Data Mining, 2nd Edition
Machine Learning Week 1.
CSE 4705 Artificial Intelligence
Data Mining Anomaly Detection
Outlier Discovery/Anomaly Detection
A Hybrid PCA-LDA Model for Dimension Reduction Nan Zhao1, Washington Mio2 and Xiuwen Liu1 1Department of Computer Science, 2Department of Mathematics Florida.
MIS2502: Data Analytics Clustering and Segmentation
Critical Issues with Respect to Clustering
CS 548 Sequence Mining Showcase By Bian Du, Wa Gao, and Cam Jones
Prepared by: Mahmoud Rafeek Al-Farra
Data Mining Anomaly/Outlier Detection
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Image Information Extraction
CSE572, CBS572: Data Mining by H. Liu
Cluster Validity For supervised classification we have a variety of measures to evaluate how good our model is Accuracy, precision, recall For cluster.
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Basic Classification Which is that?.
Nearest Neighbors CSC 576: Data Mining.
FLOSCAN: An Artificial Life Based Data Mining Algorithm
Topic 5: Cluster Analysis
CSE572: Data Mining by H. Liu
Data Mining Anomaly Detection
Inferring Road Networks from GPS Trajectories
Data Mining Anomaly Detection
Clustering Usman Roshan CS 675.
Presentation transcript:

CS548 Fall 2016 Clustering Showcase By Theresa Inzerillo, Xi Liu, Preston Mueller Showcasing work by Patrick Vogel, Torsten Greiser, Dirk Christian Mattfeld On Understanding Bike-Sharing Systems using Data Mining: Exploring Activity Patterns Preston

References [1] P. Vogel, V. Patrick, G. Torsten, and D. C. Mattfeld, “Understanding Bike-Sharing Systems using Data Mining: Exploring Activity Patterns,” Procedia - Social and Behavioral Sciences, vol. 20, pp. 514–523, 2011. [2] P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Pearson Education, 2013. [3] N. Slonim, S. Noam, F. Nir, and T. Naftali, “Unsupervised document classification using sequential information maximization,” in Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR ’02, 2002. [4] Dunn, Joseph C. "A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters." (1973): 32-57. [5] Rousseeuw, Peter J. "Silhouettes: a graphical aid to the interpretation and validation of cluster analysis." Journal of computational and applied mathematics 20 (1987): 53-65. [6] Davies, David L., and Donald W. Bouldin. "A cluster separation measure."IEEE transactions on pattern analysis and machine intelligence 2 (1979): 224-227.

Bike sharing systems and the challenges that come with them. Background Bike sharing systems and the challenges that come with them. Preston

Bike sharing systems Public transportation option for short trips Stations located around the city Challenge: balancing free bikes and free boxes https://upload.wikimedia.org/wikipedia/commons/3/37/Place_de_la_R%C3%A9publique_%28Paris%29%2C_r%C3%A9am%C3%A9nagement%2C_2012-04-05_39.jpg Preston

Geo-BI process: an implementation of KDD [1] Preston

Pre-Processing Removing instances and aggregating data to prepare the data set for clustering experiments. Xi

Data Set Two years of ride data from Vienna’s BSS “Citybike Wien” Approximately 760,000 rides from 2008 and 2009 Over 60 stations Trip duration calculated from pickup and return time stamps [1] Xi

Pre-Processing Rides removed from the data set: Rides that start or end at test stations Rides where bikes are reported as defective or stolen Rides show negative trip durations Rides that last less than 60 seconds & start and end at the same station Rides involving stations with only a few pickups or returns After removing, number of instances decreased by 2% to 743,000, the number of stations dropped to 59 Xi

Xi [1]

Pre-Processing Ride data is aggregated for 24 time windows per day of the week. Normalization: [1] Xi

Data Mining Using K-means, Expectation-Maximization, and the Sequential information bottleneck algorithm to cluster this data Xi

Clustering Hopkins-Statistic: the ratio of nearest neighbor distance between randomly generated and actual data points. The Hopkins-Statistic applied to the normalized pickups and returns at stations yields a value of 0.743 http://www.vub.ac.be/fabi/multi/pcr/chaps/chap8.html https://www.mathworks.com/help/stats/class_clustering.evaluation.daviesbouldinevaluation_plot4.png Xi

K-means algorithm Iterative Centroid-based process Assumes clusters are spherical [2] Xi

K-means algorithm [2]

Expectation-Maximization Similar to k-means Maximizes likelihood that a data point belongs to a particular cluster Changes model parameters [2] Theresa

Expectation-Maximization http://www.ohio.edu/people/yl079811/tutorials/ Theresa

Sequential information bottleneck algorithm (sIB) [3] Preston

Validation of clusters Using Dunn, Silhouette, and Davies-Bouldin indices to evaluate the clustering algorithms used Preston

Dunn Index A measure of compactness Numerator: distance between clusters Denominator: distance between data points within cluster [4] Preston

Silhouette Index A measure of cohesion and separation Goal: Maximize [5] http://blog.data-miners.com/2011/03/cluster-silhouettes.html Preston

Davies-Bouldin Index Goal is to minimize Centroid focused [6] Theresa http://www.turingfinance.com/wp-content/uploads/2015/01/Davies-Bouldin-Index1.png Davies-Bouldin Index Goal is to minimize Centroid focused [6] Theresa

Dunn, Silhouette and Davies-Bouldin Indices Theresa [1]

Results & Discussion Taking a look at the clusters identified and what we can learn from them Theresa

Temporal Clusters Identified Return Morning Pickup Evening (RMPE) Pickup Morning Return Evening (PMRE) Active Night Pickup Morning (ANPM) Average (AVG) Active Daytime (AD) Theresa

Temporal Clustering Theresa [1]

Spatial Clusters [1] Theresa

Any questions?