Smarter Outlier Detection and Deeper Understanding of Large-Scale Taxi Trip Records: A Case Study of NYC Jianting Zhang Department of Computer Science The City College of New York
Outline Introduction Background and Related Work Method and Discussions Experiments and Results Summary
Introduction 3 Taxi trip records ~300 million trips in about two years ~170 million trips (300 million passengers) in /5 of that of subway riders and 1/3 of that of bus riders in NYC The dataset is not perfect... 13,000 Medallion taxi cabs License priced at $600, 000 in 2007 Car services and taxi services are separate Only taxis with Medallion license are for hail (the rule could be under changing outside Manhattan...)
Introduction Medallion# Shift# Trip# Trip_Pickup_DateTime Trip_Dropoff_DateTime Trip_Pickup_Location Trip_Dropoff_Location Start_Lon Start_Lat End_Lon End_Lat Payment_Type Surcharge Total_Amt Rate_Code Passenger_Count Fare_Amt Tolls_Amt Tip_Amt Trip_Time Trip_Distance vendor_name date_loaded store_and_forward time_between_service distance_between_service Start_Zip_Code End_Zip_Code start_x start_y end_x end_y (local projection) Meshed up on purpose due to privacy concerns
Introduction In addition: –Some of the data fields are empty –Pickup and drop-off locations can be in Hudson River –The recorded trip distance/duration can be unreasonable –... Outlier detections for data cleaning are needed Mission can be easier to handle 170 million trips with the help of U 2 SOD-DB
Background and Related Work Existing approaches for outlier detection for urban computing Thresholding: e.g. 200m < dist < 30km Locating in unusual ranges of distributions Spatial analysis: within a region or a land use type Matching trajectory with road segments – treat unmatched ones as outliers Some techniques require complete GPS traces while we only have O-D locations Large-scale Shortest path computing has not been used for outlier detection
Background and Related Work Shortest path computation –Dijkstra and A* –New generation algorithms –Contraction Hierarchy (CH) based Open source implementations of CH: MoNav OSRM Much faster than ArcGIS NA module
Background and Related Work Network Centrality (Brandes, 2008) Node based Edge based Can be easily derived after shortest paths are computed Mapping node/edge between centrality can reveal the connection strengths among different parts of cities
Method and Discussions Raw Taxi trip data Match pickup/drop-off point locations to street segments within Distance D 0 CD >D 1 AND CD>W*RD? Assign pickup/drop-off nodes by picking closer ones Type I outlier (spatial analysis) Compute shortest path CD: Compute shortest distance RD: Recorded trip distance Type II outlier (network analysis) Successful? Aggregate unique (sid,tid) pairs Update centrality measurements
Method and Discussions The approach is approximate in nature –Taxi drivers do not always follow shortest path –Especially for short trips and heavily congested areas –But we only care about aggregated centrality measurements and the errors have a chance to be cancelled out by each other Increasing D 0 will reduce # of type I outliners, but the locations might be mismatched with segments Reducing D 1 and/or W will increase # of type II outliers but may generate false positives.
Experiments and Results Over all distributions of trip distance, time, speed and fare
Experiments and Results Mapping of Computed Shortest Paths Overlaid with NYC Community Districts Map D0=200 feet, D1=3 miles, W=2 166 million trips, 25 million unique ~2.5 millions (1.5%) type I outliers ~ 18,000 type II outliers Shortest path computation completes in less than 2 hours (5,952 seconds) on a single CPU core (2.26 GHZ)
Experiments and Results Examples of Detected Type II Outliers
Experiments and Results Mapping Betweenness Centralities (All hours)
Experiments and Results 00H02H04H 06H08H 10H 12H14H16H 18H 20H22H Legend: Mapping Betweenness Centralities (bi-hourly)
Summary Large-scale taxi trip records are error-prone due to a combination of device, human and information system induced errors – outlier detection and data cleaning are important preprocess steps. Our approach detects outliers that can not be snapped to street segments (through spatial analysis) and/or have significant differences between computed shortest distances and recorded trip distances (network analysis) The work is preliminary - a more comprehensive framework is needed (e.g., incorporating pickup and drop-off times, trip duration and fare information) It would be interesting to generate dynamics of betweenness maps at different traffic conditions, e.g., peak/non-peak, morning/afternoon and weekdays/weekends, and explore connection strengths among NYC regions.
Q&A