Horizontal Scaling and Reliability Planning and Testing for Heavy Load Steven Goeke Bill Frikken
Outline Project Background Our Motivation Testing Tools, Techniques, and Methods Results Conclusions
Background on Georgia Tech Six Colleges 16,000 Graduate and Undergraduate 5000 Faculty and Staff The NSF ranks Tech 2 nd in engineering R&D and 4 th in industry-sponsored R&D Four Campuses
Background on WMU Carnegie Research Extensive Institution Seven Colleges Six Regional Campuses 28,000 Graduates and Undergraduates 3,500 Faculty and Staff Business Technology Research Park
Motivation It started with Wireless Western –Anytime, anywhere access to resources A better e-communication infrastructure –Multi-platform, open source, end-of-life system Be innovative with the solutions
And then along came SIS Replace a much needed student information system Eliminate Social Security numbers Budget challenges – student records fee Take advantage of a portal solution GoWMU.wmich.edu – portal delivery –Content development in 4 weeks! –SSO (Single Sign-on) capabilities Seamless access to Banner Self-Serve, WebCT, ECS, …
We Want a Portal!! Facilitate student/faculty communication Enhance the student experience Prestige uPortal or Luminis –Banner – 9 years –WebCT – 4 years
Motivation BuzzPort is becoming mission critical Expanding user base Cost savings
Current GT Architecture Luminis 3.2Calendar Portal DBs Firewall(s) Production Banner Self-Service WebCT Banner TestDevelopment GT Network Trusted Network Load Balancer Luminis 3.2Calendar Portal DBsBanner Luminis 3.2Calendar Portal DBsBanner Private Network Others
GT FOS Architecture ResourceCalendar Portal DBs Firewall(s) Production Banner Self-Service Banner TestDevelopment GT Network Trusted Network Load Balancer ResourceCalendar Portal DBsBanner ResourceCalendar Portal DBsBanner WS WebCT Private Network Others
WMU Architecture What technologies deliver these various services –Sun hardware –Cisco Load Balancers –StorageTek D280 Storage Area Network –Single enterprise UserID – “Bronco NetID” –Kerberos –LDAP – Sun JES Directory –“Legacy” provisioning services –Multiple web-authentication schemes
WMU 3-tier architecture
Test and production hardware WMU Test environment –3 – Sun V210’s – 1.334GHz, 2GB –1 backend box - PDS –2 front-end web servers Production environment –2 backend boxes – Sun V480’s – 4, 1.0GHz, 8GB –3 front-end – Sun V210’s – 2, 1.34GHz, 8GB
Performance and growth Back-end services are clustered and highly redundant –Veritas HA Cluster for JES –Dual drive paths to SAN Front-end services are load-balanced Horizontal scaling wherever possible –Multiple SunFire V1xx and V2xx servers
Testing Tools And Techniques Georgia Tech: –Radview Webload –200, 500, 1000 Users –Ramp-Up over 30 minutes to target users –Sustain the load for 30 minutes –Simple Agenda: Login, navigate to a group, post a message, logout –Measure: Login Time, First Page Time, Average Page Time, and Response Time
GT Load Test 1 Date: 3/9/2005 One web server (280R/2x1.2G/2G mem) Time: 3:06PM – 3:44PM Duration: sec 1000 Sessions
GT Load Test 1 - Results Max Time to First Page – sec (1000VC) Max Login Time – sec (1000VC) Average Time to 1 st Page: sec Average Login Time: sec
GT Load Test 2 Date: 3/9/2005 Three web servers Time: 4:04PM – 5:06PM Duration: 500 Sessions
GT Load Test 2 - Results Max Time to First Page – sec (1000VC) Max Login Time – sec (1000VC) Average Time to 1 st Page: sec Average Login Time: sec
Test Tools JMeter –Apache tool to provide load testing and performance-based testing and evaluation Badboy –Export functional test for JMeter load testing 1000 users within 30 minutes
Test Results – WMU initiated Date: 6/8/ Users over 30 minutes –Avg Login Time: 3.5 Seconds –Avg Page Load: ~1 second – 2.4 seconds –Max CPU Utilization 15% Server 1 13% Server 2 –Avg Session Activity – 47 seconds
Test Results – SCT initiated Date: 6/6/ Users over 4 Hours (20 min ramp up) –Avg Login Time: Seconds Max Login: 4.76 Seconds Min Login: Seconds –Avg Page Load: ~1 second – 2.4 seconds –Max CPU Utilization – 54% Single Server – Session Activity over 4 hours
Test Results – Joint evaluation Anticipated environment exceeded expectations 2 Sources provided validation Confidence moving ahead
Luminis FOS – Features & Limitations Limited failover capability - No session persistence Still have single points of failure –Replicate the LDAP –Replicate the DB Horizontal scalability at web tier Phased patching
Conclusions Luminis FOS significant improvement More complex Machine allocation Will we be implementing it?
Next Steps Test result conclusions –More stable testing environment Production considerations –Test needs to resemble production Horizontally scale before putting into production Removing single points of failure
Critical Success Factors Top-level support Good planning Flexible project plan Being “big picture” but still attend to details Solid infrastructure Relationships
Questions?
Contact Information Steven Goeke –Georgia Tech –Buzzport buzzport.gatech.edubuzzport.gatech.edu Bill Frikken –Western Michigan University –GoWMU portal gowmu.wmich.edugowmu.wmich.edu –Office of Information Technology
Contact Information Steven Goeke Bill Frikken –Western Michigan University –GoWMU portal gowmu.wmich.edugowmu.wmich.edu –Office of Information Technology
GT Load Test 3 Date: 3/10/2005 Three web servers Time: 1:21PM – 1:47PM Duration: sec 1000 Sessions
GT Load Test Login/1 st Page Times
GT Load Test 1 – Page/Connect/ Response Time
GT Load Test Login/1 st Page Times
Load Test 2 – Page/Connect/ Response Time
GT Load Test 3 – results Max Time to First Page – sec (786VC) Max Login Time – sec (76VC) Average Time to 1 st Page: sec Average Login Time: sec
GT Load Test Login/1 st Page Times
GT Load Test 3 – Page/Connect/ Response Time
GT Load Test 4 Date: 3/10/2005 Three web servers Time: 2:18PM – 3:15PM Duration: sec 200 Sessions
GT Load Test 4 – results Max Time to First Page – sec (34VC) Max Login Time – sec (150VC) Average Time to 1 st Page: sec Average Login Time: sec
GT Load Test Login/1 st Page Times
GT Load Test 4 – Page/Connect/ Response Time
Results (Acadia1, CPU) 3:06PM-3:44PM (1000VC, 1Tier) 4:04PM-5:06PM(500VC, 3 Tier)
Results (Acadia1, Free Memory) 3:06PM-3:44PM (1000VC, 1Tier) 4:04PM-5:06PM(500VC, 3 Tier)
Results (Acadia2, CPU) 4:04PM-5:06PM(500VC, 3 Tier)
Results (Acadia2, Free Memory) 4:04PM-5:06PM(500VC, 3 Tier)
Results (Acadia3, CPU) 4:04PM-5:06PM(500VC, 3 Tier)
Results (Acadia3, Free Memory) 4:04PM-5:06PM(500VC, 3 Tier)
Results (Biscayne, CPU) 3:06PM-3:44PM (1000VC, 1Tier) 4:04PM-5:06PM(500VC, 3 Tier)
Results (Biscayne, Free Memory) 3:06PM-3:44PM (1000VC, 1Tier) 4:04PM-5:06PM(500VC, 3 Tier)
Results (Acadia1, CPU) 1:21PM-1:47PM (1000VC, 3Tier)
Results (Acadia1, Free Memory) 1:21PM-1:47PM (1000VC, 3Tier)
Results (Acadia2, CPU) 1:21PM-1:47PM (1000VC, 3Tier)
Results (Acadia2, Free Memory) 1:21PM-1:47PM (1000VC, 3Tier)
Results (Acadia3, CPU) 1:21PM-1:47PM (1000VC, 3Tier)
Results (Acadia3, Free Memory) 1:21PM-1:47PM (1000VC, 3Tier)
Results (Biscayne, CPU) 1:21PM-1:47PM (1000VC, 3Tier)
Results (Biscayne, Free Memory) 1:21PM-1:47PM (1000VC, 3Tier)