A tutorial on building large-scale services

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.
Remote Procedure Call (RPC)
Trace Analysis Chunxu Tang. The Mystery Machine: End-to-end performance analysis of large-scale Internet services.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of.
A CHAT CLIENT-SERVER MODULE IN JAVA BY MAHTAB M HUSSAIN MAYANK MOHAN ISE 582 FALL 2003 PROJECT.
Revision Week 13 – Lecture 2. The exam 5 questions Multiple parts Read the question carefully Look at the marks as an indication of how much thought and.
Concurrency CS 510: Programming Languages David Walker.
Server Architecture Models Operating Systems Hebrew University Spring 2004.
Computer Science 162 Section 1 CS162 Teaching Staff.
Computer Science Lecture 2, page 1 CS677: Distributed OS Last Class: Introduction Distributed Systems – A collection of independent computers that appears.
Chapter 11 Operating Systems
CSE451 Section 6: Spring 2006 Web server & preemption.
Design and Implementation of a Server Director Project for the LCCN Lab at the Technion.
VSP Video Station Protocol Presented by : Mittelman Dana Ben-Hamo Revital Ariel Tal Instructor : Sela Guy Presented by : Mittelman Dana Ben-Hamo Revital.
Lecture 8 Epidemic communication, Server implementation.
Fundamentals of Python: From First Programs Through Data Structures
Socket Programming References: redKlyde ’ s tutorial set Winsock2 for games (gamedev.net)
Performance Testing Case Study
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
1 System Models. 2 Outline Introduction Architectural models Fundamental models Guideline.
Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.
Towards Programmable Enterprise WLANs With Odin
Submitted by: Shailendra Kumar Sharma 06EYTCS049.
ICOM 6115©Manuel Rodriguez-Martinez ICOM 6115 – Computer Networks and the WWW Manuel Rodriguez-Martinez, Ph.D. Lecture 6.
COT 4600 Operating Systems Fall 2009 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:00-4:00 PM.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
ICOM 6115©Manuel Rodriguez-Martinez ICOM 6115 – Computer Networks and the WWW Manuel Rodriguez-Martinez, Ph.D. Lecture 26.
Chapter 4: Interprocess Communication‏ Pages
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Lecture 8 Page 1 CS 111 Online Other Important Synchronization Primitives Semaphores Mutexes Monitors.
Middleware Services. Functions of Middleware Encapsulation Protection Concurrent processing Communication Scheduling.
6.894: Distributed Operating System Engineering Lecturers: Frans Kaashoek Robert Morris
Data Communications and Networks Chapter 9 – Distributed Systems ICT-BVF8.1- Data Communications and Network Trainer: Dr. Abbes Sebihi.
Oracle9i Performance Tuning Chapter 11 Advanced Tuning Topics.
RDA3 Transport Joel Lauener on behalf of the CMW team 26th June, 2013
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
 Cloud Computing technology basics Platform Evolution Advantages  Microsoft Windows Azure technology basics Windows Azure – A Lap around the platform.
Segments Introduction: slides 2–6, 8 10 minutes
Common Application Components
Replication & Fault Tolerance CONARD JAMES B. FARAON
Last Class: Introduction
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
03 – Remote invoaction Request-reply RPC RMI Coulouris 5
Boots Cassel Villanova University
Alternative system models
CHAPTER 3 Architectures for Distributed Systems
CGS 3763 Operating Systems Concepts Spring 2013
Replication Middleware for Cloud Based Storage Service
CSE 451: Operating Systems Autumn 2003 Lecture 16 RPC
CS703 - Advanced Operating Systems
Multiple Processor Systems
Nathan Totten Technical Evangelist Windows Azure
Time Gathering Systems Secure Data Collection for IBM System i Server
Architectures of distributed systems Fundamental Models
Half-Sync/Half-Async (HSHA) and Leader/Followers (LF) Patterns
Architectures of distributed systems Fundamental Models
Multiprocessor and Real-Time Scheduling
CSE 451: Operating Systems Spring 2012 Module 22 Remote Procedure Call (RPC) Ed Lazowska Allen Center
TA: Donghyun (David) Kim
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Architectures of distributed systems
An Introduction to Internetworking
Architectures of distributed systems Fundamental Models
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
CSE 451: Operating Systems Winter 2003 Lecture 16 RPC
Message Passing Systems Version 2
Message Passing Systems
Presentation transcript:

A tutorial on building large-scale services James Ge (戈君) 2018.3.7

As a large-scale online service 7*24 stable Aggregate feedback from users in-time and iterate quickly Highly efficient testing, deployment, maintenance Massive in-parallel development

From the framework perspective

Building a service TCP/IP guarantees reliable data transmissions, however building a service needs more abstractions: What is the format of data transmission? Can multiple requests be sent through one TCP connection simultaneously? How to talk with a cluster with many machines? What should I do when the connection is broken? What if the server does not respond? ...

RPC Abstract network communications as "clients accessing functions on servers“ Data needs to be serialized which is done by protobuf pretty well. Creating and re-using of connections are transparent to users, but users can choose different connection types: short, pooled, single. Machines are discovered by Naming Services. RPC retries when the connection is broken. When server does not respond within given time, client fails with a timeout error.

Serve multiple protocols in one port, or access all sorts of services. https://github.com/brpc/brpc A industrial-grade RPC framework, with 1,000,000+ instances(not counting clients) and thousands kinds of services. Serve multiple protocols in one port, or access all sorts of services. Servers can handle requests synchronously or asynchronously. Clients can access servers synchronously, asynchronously, semi-synchronously, or use combo channels to simplify sharded or parallel accesses. Debug services via http and run profilers. Better latency and throughput.

Different color = different thread Client side Server side work stealing scheduling ABA-free Acceptor 1 bthread for 1 request no locking NS Wait-free Channel 1 LB Socket Socket Event Dispatcher Parse Process Request bthread swap (saving a CS) NS Process Request KeepWrite Channel 2 LB Concurrency within fd Channel 3 Socket Socket Parse Process Request Process Response Parse Event Dispatcher Service 1 Process Response Service 2 Locate context in O(1) time w/o global contention Process Response Parse Different color = different thread

The built-in services

bvar QPS & counting Percentiles & CDF Latency Min & Max System-wide stats Per-second stats Can be within any time window

bvar #include <bvar/bvar.h> bvar::LatencyRecorder g_reader_hole_latency("ds_common_log_channel_reader_hole"); void table_search() { ... base::Timer tm; tm.start(); channel_reader_hole(); tm.stop(); g_reader_hole_latency << tm.u_elapsed(); }

From the architectural perspective

A typical (isomorphic) service Client Load Balancer Naming Service RPC Server Server Server

Add a server Client Sync periodically or by event-driven Load Balancer Naming Service Registration at start Server (Active) Server (Active) Server (Active) Server (Inactive)

Remove a server Client Sync periodically or by event-driven Load Balancer Naming Service Unregistration Server (Active) Server (Active) Server (Active) Server (Inactive) Removable

When a server crashes Retry Client Load Balancer Naming Service Idempotence should be handled properly sometimes Server Server Server Server

When the NS crashes Client Load Balancer Fail to sync Naming Service RPCs are unaffected, however servers are not updated anymore. Server Server Server

Experiments Client Client Experiment Framework Choose exps according to probabilities and layers Load Balancer Naming Service A lot of modules developed by different teams in parallel Module 1 Module 2 Module 3 Module N ….. Exps(■ ■) Exps(■ ■) Server Server Server Independent Mutually exclusive

Collect data …… Machine Server Logs Agent http Machine Server Logs Serving logs User defined variables Tracing logs Profiling results … …… Message Queues Services

Dev cycles Dev repo Last good Releasing branch Add new features Auto tests Periodically Analyze Failure, reject commit Turn off unstable features Data Dashboard Running… Online Deployment More tests Unstable, rollback Previous online ….