Distributed Systems 2006 Virtual Synchrony II* *With material adapted from Ken Birman, Ben Wang, Bill Burke, Bela Ban
Distributed Systems Plan We skip Section 18.2 Tracking group membership: We’ll base it on 2PC and 3PC Fault-tolerant multicast: We’ll use membership Ordered multicast: We’ll base it on fault-tolerant multicast Tools for solving practical replication and availability problems: we’ll base them on ordered multicast Robust Web Services: We’ll build them with these tools 2PC and 3PC: Our first “tools” (lowest layer)
Distributed Systems Recap: Elements of Virtual Synchrony Support for process groups –Processes may join and leave dynamically –Excluded if (thought to) fail Various reliable, ordered multicast protocols –Strive to replace synchronous, totally-ordered, dynamically uniform protocols View-synchronous delivery –Two processes that are member of the same view receive same set of multicasts during that view Identical process groups views and rankings –Identical sequences of group membership lists Gap-freedom guarantees –If, after a failure, m1 has been delivered to a destination, then any message m0 that should be delivered prior to m1 is also delivered State transfer for joining processes –A new process may obtain group state from existing member –(Will develop this shortly)
Distributed Systems Distributed Algorithms Election Consensus Consistent snapshot Replicated data and synchronization State transfer Load-balancing Fault tolerance –Primary-backup –Coordinator-cohort
Distributed Systems State transfer We need to transfer current state of a process group to joining members? ”State” is application-dependent –Creating state should be done by application itself –E.g., org.jgroups.MessageListener byte[] getState() void setState(byte[] state) Simple approach –Just make all members transfer state to joining process Reasonable approach –Joining process pulls state from one existing member –Take another if first one fails
Distributed Systems Load-Balancing Coordinate members of a group to share workload in order to obtain a speed-ud for parallelism? Use groups communication to implement load- balancing of request... Styles of algorithms –Group decides who should handle request –Client decides who should handle request
Distributed Systems Group Decides Multicast request from client to full membership –May require expensive transfer to all Need deterministic rule for deciding who should handle request –E.g., with abcast, requests may be numbered in a total order –A process may handle the ith request if the process rank is i mod n With cbcast?
Distributed Systems Client affinity schemes Group members provide clients with information used to select appropriate serve to send request to Best choice dependent on data size, fault-tolerance needed, queries/updates,... E.g., –Static assigment of client to specific server + caching, - for very active clients –Pick a random server –Base choice on (approximate) load information Could also be used with previous approach
Distributed Systems Using approximate load information Assume that processes send out load reports using abcast –E.g., 0 for no load, 1 for currently handling 1 request etc. Represent load on group of n servers as a vector: [l 0,..., l n-1 ] –l max = max(l 0,..., l n-1 ) + 1 –[l 0 ’,..., l n-1 ’] = [l max - l 0,..., l max - l n-1 ] –L’ = l 0 ’ l n-1 ’ Map incoming requests to process –Given a request, choose a random number, r, between 0 and L’ By applying pseudo-random generator, same seed at all processes –Now choose process i if l 0 ’ l i ’ <= r < l 0 ’ l i+1 ’ (i < n-1) (Think of l 0 ’,..., l i ’ as points on a line with length L’, then the algorithm selects the segment that r is within)
Distributed Systems Fault tolerance We want to offer clients “fault- tolerant request execution”? We can replace a traditional service with a group of members –Each request is assigned to a primary (ideally, spread the work around) and a backup Primary sends a “cc” of the response to the request to the backup –Backup keeps a copy of the request and steps in only if the primary crashes before replying Sometimes called “coordinator/cohort” just to distinguish from “primary/backup”
Distributed Systems Trade-offs
Distributed Systems Trade-offs Membership –Static Fixed membership, changing connectivity and availability –Dynamic Changing membership, fixed connectivity and availability Consistency –Internal Defined with respect to members of group observing messages –External Defined with respect to external observer (e.g., a database)
Distributed Systems Toolkits Isis Horus Ensemble JGroups Spread
Distributed Systems Features of major virtual synchrony platforms Isis: first and no longer widely used –But was perhaps the most successful; has major roles in NYSE, Swiss Exchange, French Air Traffic Control system (two major subsystems of it), US AEGIS Naval warship –Pioneered use of cbcast –Also was first to offer a publish-subscribe interface that mapped topics to groups
Distributed Systems Features of major virtual synchrony platforms Horus, JGroups and Ensemble –Successors to Isis –These focus on flexible protocol stack linked directly into application address space A stack is a pile of micro-protocols Can assemble an optimized solution fitted to specific needs of the application by plugging together “properties this application requires”, lego-style The system is optimized to reduce overheads of this compositional style of protocol stack Use –JGroups is very popular Used in, e.g., JBoss, OpenSymphony OSCache, Jetty, Tomcat –Ensemble is somewhat popular and supported by a user community –Horus works well but is not widely used.
Distributed Systems Horus/JGroups/Ensemble protocol stacks Application belongs to process group comm nak frag mbrshp fc comm nak frag comm nak frag mbrshp parcld comm nak frag mbrshp merge total
Distributed Systems Spread Toolkit Focused on a sort of “RISC” approach –Very simple architecture and system –Fairly fast, easy to use, rather popular Supports one large group within which user sees many small “lightweight” subgroups that seem to be free-standing Protocols implemented by Spread “agents” that relay messages to apps
Distributed Systems Case: J2EE/JBoss Java 2 Enterprise Edition (J2EE) –Multi-tiered, distributed application model / reference architecture Tiered = physically layered architecture –Technologies to support this reference architecture Enterprise Java Beans (EJB) –Server-side component model –Component: “… a unit of composition with contractually specified interfaces and explicit context dependencies only. A software component can be deployed independently and is subject to composition by third parties” JBoss –Open source J2EE Application Server –Arguably the most popular Java application server
Distributed Systems J2EE Business Context Powerful workstations (and servers) makes distributed computing viable –Programming abstractions for distributed computing needed The Web… –Internet-enabled business systems / enterprise information systems –Specific requirements for web applications Scalability –support variations in load –e.g., Amazon before Christmas Availability –Very small downtime periods –e.g., eBay (400 million transactions/day) Security –Authenticate and authorize users Usability –Different users should have different contents in different forms Performance –Reasonable response times needed –Requests often arrive in bursts –Also, e.g., time-to-market...
Distributed Systems Architectural Solution
Distributed Systems Tiers Client tier –User interfaces Internet browser Standalone Java clients (COM applications) Middle tier –Web tier Web server for handling requests from browsers –Gets request from client tier –Forwards to business component tier –Renders result –Business component tier Core “business logic” –E.g., customers, accounts, …, relationships and rules among these in Web shop Realized by EJBs running in an EJB container “Application server” = middle-tier component server compatible with J2EE Enterprise information systems tier –Databases –Backend systems
Distributed Systems Example: EHR in Ribe Amt
Distributed Systems Specific J2EE-Related Technologies JavaServer Pages (JSP) –Creating dynamic web-based content Java Servlets –Extending functionality of web servers Java Messaging Service (JMS) –Asynchronous point-to-point and many-to-many messaging Java Naming and Directory Interface (JNDI) –Directory-based retrieval of user-defined objects J2EE Connector Architecture –Standard architecture for integrating external systems RMI over IIOP –RMI using OMG’s Internet Inter-Orb Protocol Java DataBase Connectivity (JDBC) –Uniform interface to relational databases Enterprise JavaBeans (EJB)
Distributed Systems EJB Deployment View EJB container –Manage execution of components –Expose platform services
Distributed Systems EJB Code View Need to enable remote clients to access bean –Remote interface –~ Proxy object Manage lifecycle of bean –Home interface –Possibly functionality to locate specific instances –~ Factory object Implement functionality –Bean class Clients use generated stubs
Distributed Systems Detailed Example
Distributed Systems EJB Types Entity beans –Representing business data objects –Data members map to data items stored in associated data base –Container-managed persistence Container loads and stores data No application code required –Bean-managed persistence Bean code responsible itself for persistence –“handcrafted” JDBC Session beans –Business logic and services –“Stateful“ (SFSBs) Can keep state on behalf of client Successive calls go to same component Container handles life-cycle –Passivation –Activation E.g., CommandBean –“Stateless” (SLSBs) Does not keep any state on behalf of client Each successive call delegated to stateless session bean as needed Easy scalability and load balancing E.g., PrintHandlerBean Message-driven beans –Stateless –Asynchronous listener style of invocation
Distributed Systems ”Clustering” and J2EE Scalability –I want to handle x times the number of concurrent access than what I have now High availability –Services are accessible with reasonable (and predictable) response times at any time –E.g., (5 Nines in Telco) Load balancing –A way to obtain high availability and better performance by dispatching incoming requests to different servers –Session affinity (or stickiness) –Checking heart beat Failover –Process can continue when it is re-directed to a “backup” node because the original one fails –What is the policy? Round-robin? Fault tolerance –A service that guarantees strictly correct behavior despite system failure
Distributed Systems JGroups in JBoss? Serverless JMS Clustering –Replication of entity beans, SLSBs and SFSBs –HA-JNDI –Session replication (integrated Tomcat, Jetty) Cache –Replicated transactional clustered cache
Distributed Systems The real world is complex...
Distributed Systems Default JBoss JGroup Configuration <UDP mcast_addr="${jboss.partition.udpGroup: }" mcast_port="45566" ip_ttl="8" ip_mcast="true" mcast_send_buf_size="800000” mcast_recv_buf_size="150000" ucast_send_buf_size="800000" ucast_recv_buf_size="150000" loopback="false"/> <PING timeout="2000" num_initial_members="3" up_thread="true" down_thread="true"/> <FD shun="true" up_thread="true" down_thread="true" timeout="2500" max_tries="5"/> <VERIFY_SUSPECT timeout="3000" num_msgs="3" up_thread="true" down_thread="true"/> <pbcast.NAKACK gc_lag="50" retransmit_timeout="300,600,1200,2400,4800" max_xmit_size="8192" up_thread="true" down_thread="true"/> <UNICAST timeout="300,600,1200,2400,4800" window_size="100" min_threshold="10" down_thread="true"/> <pbcast.STABLE desired_avg_gossip="20000" up_thread="true" down_thread="true"/> <pbcast.GMS join_timeout="5000" join_retry_timeout="2000" shun="true" print_local_addr="true"/>
Distributed Systems Serverless JMS Java Messaging Service based on JGroups –Peer-to-peer architecture rather than Client/Server Client publishing to a topic –Other clients subscribe to topics –The ”publish/subscribe” paradigm Instead of sending message to server, and server distributes to multiple clients: publisher multicasts message –JMS Server just another member Handles persistent messages (DB)
Distributed Systems Serverless JMS Cost: 4 unicastsCost: 1 multicast
Distributed Systems Serverless JMS Clients are still able to publish even when server is down Caveat –works in scenario where client and server are in same multicast-reachable network
Distributed Systems Session Replication in Tomcat Tomcat and Jetty are both Java-based web servers Servlet sessions are replicated across Tomcat processes –New Tomcat instance gets sessions from existing Tomcat instance(s) –Modification (addition, removal of attributes) of session gets replicated
Distributed Systems Session replication in Tomcat Expiry of session will expire session everywhere Last timestamp update External load-balancer distributes requests to Tomcat instances –Round-robin –Sticky, next server on crash
Distributed Systems Summary We gave further examples of what can be built on top of Virtual Synchrony Brief examples of toolkits and uses –Isis, Ensemble Horus, Spread, JGroups J2EE introduction as bonus