Securing Web Service by Automatic Robot Detection KyoungSoo Park, Vivek S. Pai Princeton University Kang-Won Lee, Seraphin Calo IBM T.J. Watson Research Center
KyoungSoo ParkUSENIX Web Robots Automatic agents Web crawlers URL link checkers Malicious robots are widespread Password cracking Referrer/Blog spamming Click frauds on Google search Burning CPU with heavy CGI queries
KyoungSoo ParkUSENIX Contributions Real-time robot detector Fast detection 80% at 20 reqs, 95% at 57 reqs High accuracy 2.4% max false positive rate Low overhead ~200 usec additional delay per page Easy deployment
KyoungSoo ParkUSENIX Operational Scenario Server-side Site Webserver Many-to-one Client-side Firewall/Proxies at LAN Many-to-many MON ServersClients Server infrastructure Client infrastructure
KyoungSoo ParkUSENIX Design Goals Transparency No human intervention Accuracy Minimal false positives Real-time proof Periodic check should be possible Authentication or CAPTCHA not enough Practicality
KyoungSoo ParkUSENIX Observation & Intuition Robot behavior Custom program Goal-oriented No embedded objs No index file Follow hidden links No HW events Human behavior Standard browsers Browsing purpose Cascading style sheets Images Never follow hidden links Mouse & keyboard Humans are easier to detect
KyoungSoo ParkUSENIX Browser Detection “No standard browser” (implies) robot “User-Agent” HTTP header? Use behavioral artifacts (dynamic mods) Redundant embedded objects Empty cascading style sheet (CSS) Invisible images (1x1 JPEG) or mute sounds Hidden links
KyoungSoo ParkUSENIX Human Activity Detection Human activities (implies) human Mouse/keyboard event tracking Most robots don’t generate HW events Dynamically embed JavaScript code MouseMove triggers the event handler Event handler fetches a fake image Semantically & lexically obfuscated
KyoungSoo ParkUSENIX Test with CoDeeN CoDeeN ( Pulling-based CDN on PlanetLab over 3 years 25+ million reqs from 50K clients/day Malicious robots seeking abuse Results for 1-week measurement But changes now permanent
KyoungSoo ParkUSENIX Main Result Robots 71.1% CSS Fetch 28.9% JavaScript Exec 27.1% MouseMove 22.3% Not sure, but human Potential FP, 1.9% JS but No MouseMove Robots
KyoungSoo ParkUSENIX Main Result Robots 71.1% CSS Fetch 28.9% Max False Positive Rate = FP/negatives = /Robots = 1.9/77.7 = 2.4% Only 9% passed (optional) CAPTCHA Only 0.9% followed hidden links
KyoungSoo ParkUSENIX How Fast Can We Detect? 80% 20 reqs 95% 57 reqs
KyoungSoo ParkUSENIX # of CoDeeN Complaints Browser Detection Human Activity Detection
KyoungSoo ParkUSENIX Limitations Defeating browser detection Behave exactly like a standard browser Human activity detection Robots generating mouse/key events Disable JavaScript – 4% Solution Ensemble techniques
KyoungSoo ParkUSENIX Machine Learning (AdaBoost) Three most effective attributes 1. RESPONSE CODE 300% 2. REFERRER % 3. UNSEEN REFERRER % Drawbacks: 1. Heavy computation/memory 2. Pattern may change 3. Human intervention
KyoungSoo ParkUSENIX Conclusions Practical robot detection tool Detect human by Standard browser behavior Human activities “Arms Race” in the end Turing test Most simple bots screened out Ensemble techniques promising