Lessons from a SIP Wireless Deployment Jonathan Rosenberg Chief Scientist.

Slides:



Advertisements
Similar presentations
Fall IM 2000 Evfolution of Presence Based Networks Evolution of Presence Based Networks Jonathan Rosenberg Chief Scientist.
Advertisements

IM May 23-25, 2000 Evolution of IP Based Presence Services Evolution of IP-Based Presence Services Jonathan Rosenberg Chief.
VON Europe /19/00 SIP and the Future of VON Protocols SIP and the Future of VON Protocols: Presence and IM Jonathan Rosenberg.
Fall VoN 2000 SIP for IP Communications Jonathan Rosenberg Chief Scientist.
XCAP Tutorial Jonathan Rosenberg.
SQL Server Disaster Recovery Chris Shaw Sr. SQL Server DBA, Xtivia Inc.
Implementing A Simple Storage Case Consider a simple case for distributed storage – I want to back up files from machine A on machine B Avoids many tricky.
Lecture 9 Page 1 CS 236 Online Denial of Service Attacks that prevent legitimate users from doing their work By flooding the network Or corrupting routing.
SQL Server Replication
Networking Theory (part 2). Internet Architecture The Internet is a worldwide collection of smaller networks that share a common suite of communication.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
CS335 Networking & Network Administration Tuesday, May 11, 2010.
Hands-On Microsoft Windows Server 2003 Networking Chapter 7 Windows Internet Naming Service.
CS335 Networking & Network Administration Tuesday, April 20, 2010.
1 CCNA 2 v3.1 Module 8. 2 TCP/IP Suite Error and Control Messages CCNA 2 Module 8.
Chapter 23: ARP, ICMP, DHCP IS333 Spring 2015.
1 Networking A computer network is a collection of computing devices that are connected in various ways in order to communicate and share resources. The.
Maintaining and Updating Windows Server 2008
MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.
Check Disk. Disk Defragmenter Using Disk Defragmenter Effectively Run Disk Defragmenter when the computer will receive the least usage. Educate users.
CLUSTER WEBLOGIC SERVER. 1.Creating clusters and understanding its concept GETTING STARTED.
Understanding and Managing WebSphere V5
FIREWALL TECHNOLOGIES Tahani al jehani. Firewall benefits  A firewall functions as a choke point – all traffic in and out must pass through this single.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 14: Problem Recovery.
VLAN Trunking Protocol (VTP) W.lilakiatsakun. VLAN Management Challenge (1) It is not difficult to add new VLAN for a small network.
Task Scheduler Pro Managing scheduled tasks across the enterprise Joe Vachon Sales Engineer.
Distributed Deadlocks and Transaction Recovery.
Sales Kickoff - ARCserve
Chapter 7: Using Windows Servers to Share Information.
Chapter 4. After completion of this chapter, you should be able to: Explain “what is the Internet? And how we connect to the Internet using an ISP. Explain.
Robert E. Meyers CCNA, CCAI Youngstown State University Manager, Cisco Regional Academy Cisco Networking Academy Program Semester 4, v Chapter 7:
1 Version 3.1 modified by Brierley Module 8 TCP/IP Suite Error and Control Messages.
Chapter 3.  Help you understand different types of servers commonly found on a network including: ◦ File Server ◦ Application Server ◦ Mail Server ◦
Registration Processing for the Wireless Internet Ian Gordon Director, Market Development Entrust Technologies.
Honeypot and Intrusion Detection System
6.1. Transport Control Protocol (TCP) It is the most widely used transport protocol in the world. Provides reliable end to end connection between two hosts.
1 Version 3.0 Module 11 TCP Application and Transport.
Protocol implementation Next-hop resolution Reliability and graceful restart.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 OSI Transport Layer Network Fundamentals – Chapter 4.
Definitions What is a network? A series of interconnected computers, linked together either via cabling or wirelessly. Often linked via a central server.
1 Next Few Classes Networking basics Protection & Security.
CS332, Ch. 26: TCP Victor Norman Calvin College 1.
Chapter 14 Part II: Architectural Adaptation BY: AARON MCKAY.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
NFD Permanent Face Junxiao Shi, Outline what is a permanent face necessity and benefit of having permanent faces guarantees provided by.
® IBM Software Group © 2007 IBM Corporation Best Practices for Session Management
 Distributed file systems having transaction facility need to support distributed transaction service.  A distributed transaction service is an extension.
SIP and MMS Jonathan Rosenberg Chief Scientist. SIP What Is It? European Technology for Enhanced Messaging Specified by 3GPP, WAP Forum Different.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Networking Basics CCNA 1 Chapter 11.
Networks and the Internet Topic 3. Three Important Networking Technologies Networks, Internet, WWW.
Company LOGO Network Management Architecture By Dr. Shadi Masadeh 1.
Internet Flow By: Terry Hernandez. Getting from the customers computer onto the internet Internet Browser
Virtual Machine Movement and Hyper-V Replica
TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
Address Resolution Protocol Yasir Jan 20 th March 2008 Future Internet.
SIP6 Platform Updates Based on CNGI-CERNET2 Network Research Center Tsinghua University.
Chapter 7: Using Network Clients The Complete Guide To Linux System Administration.
By: Brett Belin. Used to be only tackled by highly trained professionals As the internet grew, more and more people became familiar with securing a network.
IST 201 Chapter 11 Lecture 2. Ports Used by TCP & UDP Keep track of different types of transmissions crossing the network simultaneously. Combination.
3.1 Types of Servers.
IP Telephony (VoIP).
Distributed File Systems
Maximum Availability Architecture Enterprise Technology Centre.
The “Internet”.
Transport Layer Unit 5.
Outline Announcements Fault Tolerance.
Fault Tolerance Distributed Web-based Systems
Computer communications
Presentation transcript:

Lessons from a SIP Wireless Deployment Jonathan Rosenberg Chief Scientist

SIP2003 Lessons 2 Background Market: 2.5g Wireless Network Initial Applications: Instant Messaging (IM) and Presence Subscriber Sizing: 500k Initially, Scaling up to Several Million PoPs: 12 Regional PoPs, 2 Centralized Data Centers Servers: Between 100 and 200 Separate Server Processes

SIP2003 Lessons 3 Lessons Summary Data Distribution and Management Is Hard Network-wide Diagnostics Are Essential UDP Non-invites and Failover Interactions

SIP2003 Lessons 4 The Data Distribution Problem SIP Applications Depend on Many Pieces of Data Provisioned data Buddy lists White/black lists Call forwarding numbers Soft-state data Presence Registrations Many Parties Interested in Writing the Data Wireless handset updates its buddy list Web application updates a buddy list Customer care updates a buddy list Many Parties Interested in Reading the Data Wireless handset, to get their current buddy list Web application, to display the current buddy list Customer care, to tell a customer who is on their buddy list Presence server, to support subscriptions to the buddy list Many Parties Interested in Finding Out Changes to the Data Handset – for buddy list synchronization Presence server: to send a SUBSCRIBE request to a new participant Other applications

SIP2003 Lessons 5 Requirements for Data Distribution Network Element Requirements for Data “Close” to the element, for performance reasons Replicated and consistent across all elements in a cluster within a pop Replicated to other pops to provide pop failover Soft-state data replicated to backup servers for failover support Operator Requirements for Data Data survives crashes of any or all network elements Data can be read/written by provisioning and customer support systems Data can be accessed by provisioning and customer support from a single access point, independent of network scale and size Data writes are validated before being propagated Data propagation to elements survives network faults (IP router goes down), element failures, etc. Distribution of provisioned data has minimal to no impact on element performance (i.e., A bulk-load cannot take down a running system) Recovery from data distribution failures needs to be possible

SIP2003 Lessons 6 Key Lessons The Requirements for “Closeness” and “Performance” Conflict with Consistency Requirements Ultimately, the data gets replicated across a potentially large number of elements Large scale replication with transactional integrity is very costly in terms of performance Seek compromise data distribution methods that provide good performance with reduced consistency The Data Distribution Piece Is at Least As Hard, If Not Harder, Than Getting the SIP Pieces Right Try to Solve This Problem Generally, Not Separately for Each Application

SIP2003 Lessons 7 Network Wide Diagnostics Problem Statement Joe calls customer service. He says his phone doesn’t work. When asked what the problem was, he reports that his IM never reached his intended target. He sent it yesterday or perhaps the day before. The Challenge Find the element which failed and identify the specific problem in the deployed production network, without affecting performance of the network.

SIP2003 Lessons 8 Why Is This Challenging? There Are a Multitude of Elements at the “SIP Layer” A variety of proxies A variety of databases A variety of gateways There Are a Multitude of Elements at Other Layers A variety of routers A variety of GGSNs (Gateway GPRS Support Node)/PDSNs(Packet Data Serving Node) A variety of base stations A variety of ethernet switches Continuous Logging Is Not Possible Performance implications You Cannot Replicate the State of the Network When the Failure Occurred Too many users and other variables

SIP2003 Lessons 9 What is the Solution Design for Diagnostics Stimulate Your System Engineer for Evolution Know Your Network

SIP2003 Lessons 10 Design for Diagnostics Extensive “Triggered” Logging Look for conditions that may indicate an error SIP transaction timeout SIP request failure Database timeout Corrupted database data On those conditions, produce mass amounts of trace data Execution stacks Message contents May Need to Store Trace Data in Memory in Sliding Window Sometimes an error on one place caused an error in another Careful Draining of Trace Data Cannot affect runtime performance Centralized Repository for Trace Data Don’t want to have to go to each of the machines Push it to a single place with well- identified correlation identifiers Don’t Forget the Handset! The handset is part of the network It should generate trace data too upon failure! Related To, but Not the Same As Fault Management This is something the network operations guys can’t fix

SIP2003 Lessons 11 Stimulate Your System The Best Problems Are the Ones You Find Before Your Customers Do! Look for Problems Through Active “Probing” of the Network A usage which triggers the logging of data about how it was processed in each element Usage must be a normal one SIP “Probe” Extensions Headers that ask proxies and user agents to generate tracing information about message handling May also designate a destination for sending the data Alternatively, attach it to the message What if its lost? Security Issues Must carefully authenticate the sender of a probed message Otherwise, a great source of dos and other attacks Continuously Send Probes For each use case of your network For each pop or site Vary the transmission times and contents wherever possible IETF Work Just Begun Develop requirements for such probes

SIP2003 Lessons 12 Engineer for Evolution Once You Find the Bug and Prepare a Fix, What Then? Need to Upgrade the Affected Servers Cannot affect run time performance Must be easy to do (so you can do it often!) Must be easy to undo Solution: Automated Software Upgrade Basic Process Vendor sends operator a new version Operator types “install version” at the centralized management console Console determines which servers are affected For each server, gracefully terminates it one at a time Remotely installs upgrade Old one not removed Updates configuration files if needed Remotely verifies upgrade Restarts server, and goes to the next one Old Server Versions and Configurations Are Kept, Rollback Is Allowed Process Must Be Automated and Easy Model: Quicken

SIP2003 Lessons 13 Know Your Network Experience Is Ultimately the Only Way to Find Problems The People Who Design Elements Are Usually Not the Ones Who Have Experience Running Networks of Them Put Processes in Place to Feed Back Experience to the Developers and Architects

SIP2003 Lessons 14 Non-INVITE UDP Failover Problem A SIP non-invite request is sent through a chain of proxies The final proxy has failed Upon transaction timeout, each of them generates a 408 The “winning” 408 depends on relative timing Would like to mark the server as failed so it is not tried again How does each proxy know if the failure was its own next-hop, or some other server downstream? Timeout can occur first anywhere in the chain Downstream 408s are discarded because transaction has timed out Timeout MSG 408 P P P P P P

SIP2003 Lessons 15 Solutions Use TCP TCP will provide a hop-by-hop acknowledgement for the data If next hop fails, your TCP connection reports errors Bring Back 100 Responses for Non-invite Tells a proxy that the next hop got the request Means proxy was alive at the beginning of the transaction Next hop considered dead if no 100 is received Con: extra message traffic Extended Transactions Two transaction timeouts Currently defined one Longer one used to wait for 408 responses from downstream nodes If 408 is received before second timeout, but after first, failure is not the next hop If no 408 is received before second timeout, downstream element has failed Con: additional memory requirements for holding on to state of the transaction Conclusion: Needs to Be Worked in IETF

SIP2003 Lessons 16 Summary Building a Large Scale Distributed SIP Network Is Hard Many of the Problems Are Not Specific to SIP, and Show up in Any Similar System IP networks networks Key General Lessons Data distribution is hard Worry about diagnostics SIP Lesson Non-invite failover problem

Information Resource Jonathan Rosenberg Chief Scientist