NCBI Grid Presentation
NCBI Grid Structure NetCache NetSchedule Load Balancer (LBSM) Load Balancer (LBSM) Worker Nodes CGI Gateway
NetCache Problems: 1.HTTP/CGI is stateless protocol Every CGI call is a new run, no previous memory We need session state storage compatible with our Load Balancer 2. Storing information in files does not always work - File system overflow - Not protected against failures - Hard to load balance - No access log - Network access can be problematic (maintenance issues)
NetCache Design objectives: 1.BLOB ID can be used in Web apps URLs, HTML, cookies 2.Universal temporary BLOB storage Can store session info, graphics, sequences, ASN.1, XML 3.Automatic removal of old, unused objects Garbage collection 4.Compatible with NCBI Load Balancing 5.Can work on off-the-shelf hardware 6.High availability, automatic recovery after failures 7.Easy to scale economically by adding components * No RDBMS license * Any Linux, Unix, Windows box can be a NetCache host
NetCache CGI Load Balancer
NetCache CGI Load Balancer
NetCache CGI Load Balancer BLOB
NetCache CGI Load Balancer BLOB NetCache BLOB ID
NetCache CGI BLOB CGI NetCache BLOB ID
NetCache CGI BLOB CGI NetCache BLOB ID
NetCache Typical use cases: 1.Store session info 2.Graphics generated by CGIs 3.Caching results of computational algorithms 4.Cache results of expensive DBMS or search system queries 5.Data exchange between programs
NetSchedule CGI Typical CGI web call scenario:
NetSchedule CGI 30 sec timeout Typical CGI web call scenario: #include for (int i = 0; i < 10000; ++i) { …. }
CGI 30 sec timeout Expired! Typical CGI web call scenario: NetSchedule Reproduced with permission from Oleg O. Moiseyenko
Why do timeouts happen? Peak load hours. In peak hours number of resource-hungry tasks exceed available CPU time. Peak load hours. In peak hours number of resource-hungry tasks exceed available CPU time. CGI used as a platform to implement complex computationally intensive algorithms CGI used as a platform to implement complex computationally intensive algorithms Execution time depends on web user input Execution time depends on web user input user can specify complex criteria user can specify complex criteria user can upload a lot of data user can upload a lot of data
Worker Node NetSchedule NetSchedule CGI
Worker Node NetSchedule NetSchedule CGI
Worker Node NetSchedule NetSchedule CGI NetSchedule JOB ID
Worker Node NetSchedule NetSchedule CGI
Worker Node NetSchedule NetSchedule CGI #include for (int i = 0; i < 10000; ++i) { …. } Progress Report
NetSchedule Push-Pull Model Queue 1Queue 2 Job 1 Job 2 ….. Job 3 NetSchedule server maintains several FIFO queues Push Job Pull Job Worker Nodes CGIs
NCBI Grid Structure NetCache NetSchedule Load Balancer (LBSM) Load Balancer (LBSM) Stores JOB input/output General purpose queue management
NCBI Grid Structure NetCache NetSchedule Load Balancer (LBSM) Load Balancer (LBSM) Stores JOB input/output Worker Node API: Distribution, Logging, Remote Management General purpose queue management
NCBI Grid Structure NetCache NetSchedule Load Balancer (LBSM) Load Balancer (LBSM) CGI front end, and migration toolkit, HTML templates Stores JOB input/output Worker Node API: Distribution, Logging, Remote Management General purpose queue management
High availability All central components (queue and data storage) are duplicated All central components (queue and data storage) are duplicated All components are controlled by NCBI load balancer All components are controlled by NCBI load balancer Protection against back-end (remote CGI) failures - by timeout or via job re-scheduling Protection against back-end (remote CGI) failures - by timeout or via job re-scheduling Remote administration and statistics Remote administration and statistics
Worker node API High level design High level design standard C++ streams standard C++ streams ASN.1, XML serialization ASN.1, XML serialization Support of SMP Support of SMP thread based parallel jobs thread based parallel jobs Remote administrative access to worker nodes Remote administrative access to worker nodes shutdown shutdown availability checking availability checking statistics statistics
Acknowledgements C++ Group (Development) Denis Vakatov - coordination, design Anton Lavrentiev - communication libraries, load balancer Aaron Ucko - threaded server Anatoliy Kuznetsov - NetCache, NetSchedule Maxim Didenko - Grid API, CGI migration framework Other NCBI Groups BLAST Group Tom Madden George Coulouris Yuri Merezhuk Yan Raytselis Ron Edgar Mike DiCuccio Yuri Kapustin Boris Fedorov Mark Johnson - presentation rehearsal