Enabling Low Latency and High Reliability for IMS–NFV Muhammad Taqi Raza and Songwu Lu Computer Science Department University of California, Los Angeles
Emerging Multimedia Services There exists a number of interactive multimedia services HD Voice Service over IP Rich Communication Service (chat) UHD Streaming Service Smart City Smart Home Interactive Gaming eMBMS Service Connected Car Services
IP Multimedia Subsystem – IMS IMS is an architectural framework for delivering IP multimedia services IMS Core Telephony Network IMS Signaling (SIP) Proxy Server Serving Server Media Gateway IMS Service (RTP/RTCP) Data Service Internet Base Station LTE Gateway IMS Client Radio Access 4G PS Core
Current / Future Trend: NFV of IMS Network Function Virtualization (NFV) of IMS is a reality A number of operators are using virtualized IMS The Network Function (NF) logic is moved from dedicated boxes to commercial off-the-shelf servers NFV lowers CapEx and OpEx Dedicated Proxy NF Dedicated Serving NF Dedicated Media NF Commodity Commodity Commodity Proxy NF Serving NF Media NF
Our Questions How well virtualized implementation of IMS can: Meet low latency multimedia requirements, and Achieve carrier grade high availability of five nines (99.999%) ? System downtime cannot be more than 864.3 msec / day
Empirical Study Reliability in virtualized IMS (vIMS) Deployed a testbed of OpenIMS over OpenStack Modified OpenIMS source code to meet 3GPP standard
Higher Media-Plane Latencies Finding Media latencies exponentially increase after certain call rate vIMS fails to meet QoS requirements on media traffic
Higher Media-Plane Latencies Analysis Frequent interactions between different modules of different NFs Media Controller and Processor modules interact frequently AS Media Policy MRFC Media Control Policing input Low level media focus MRFP Mixer Notification Floor CTR Processor . . . Modules Network Functions AS NF defines policy MRFC NF converts policy into control plane signals MRFP NF processes media traffic as instructed by control plane signaling
Higher Media-Plane Latencies Analysis Frequent interactions between different modules of different NFs Script Document and Interpreter modules form a loop Modules Network Functions AS Script Document MRFC Interpreter Document Request XML Interpreter Context MRFP Script execution Implementation Platform AS dynamically generates XML scripts Scripts provide media behavior MRFC fetches scripts from AS over HTTP protocol
Traffic Forwarding for Aborted Connection Finding Callee devices keep generating uplink media packets for caller device Speech mute issue for callee devices
Traffic Forwarding for Aborted Connection Analysis Control-plane termination does not stop data-plane flow P-CSCF and MRF keeps two different dialogue finite state machines 183 Session in Progress 202 Accepted ACK Bye Timeout Confirmed Terminated Start Preparative Early Established Moratorium Mortal Morgue 200 OK <Created> Connection progressing Connection connected Start Connected Created Progressing Dialogue state machine at P-CSCF Dialogue state machine at MRF
Traffic Forwarding for Aborted Connection Analysis Frequent interactions between different modules of different NFs Module failure causes hanging state machine Failure of these bridging modules results in control-plane termination. Data-plane being decoupled from control-plane stays connected AS-ILCM AS-OLCM Application Server ILCM OLCM S-CSCF ILSM OLSM Incoming Leg Outgoing Leg P-CSCF MRF PGW Device Data Plane Control Plane
Our Solution Pipelining media plane processing and media control commands to reduce latencies Quickly isolating faulty module by reconfiguring its neighboring modules to improve fault tolerance
Design Overview Predicts future metadata values and request control instructions for all predicted values. (red color in fig.) Reconfigures each module by adding back-up path with each of one-hop neighboring module. (blue color in fig.) Serving-CSCF Incoming Leg Control Session Controller Media Resource Controller Media Control Resource Controller Media Resource Processor Media Processor Media Update Application Server Media Policy Proxy-CSCF Registrar Authentication PDN-Gateway (LTE) Backup Link Primary Link Pre-fetch control info Update Pre-fetch
Pipelining Control and Media Planes Converting serial operations of processing of media packets at MRFP and fetching control instructions from MFRC into parallel operations. Predicting future metadata generated by future media packets When MRFP processes the packets it finds metadata (e.g. jitter, payload, etc.) Prefetching control instructions for these packets Media Resource Controller Media Control Resource Controller Media Resource Processor Media Processor Media Update Application Server Media Policy Incoming Leg Control Pre-fetch control info Update Pre-fetch
Pipelining Control and Media Planes Optimization using batch prefetching Parallel operations do not reduce control instructions fetching loop Use exponential smoothing model to take historical metadata values into account Generate a batch of predicted metadata and prefetch them Media Resource Controller Media Control Resource Controller Media Resource Processor Media Processor Media Update Application Server Media Policy Incoming Leg Control Pre-fetch control info Update Pre-fetch
Fault Isolation and Module Refactoring Failure detection Tracking IMS operations through Finite State Machine (FSM) Failure detection starts in Fast Retry state Propose 5 retries of the failed SIP message before declaring failure 1. Start timer A SIP Received SIP Forwarded Fast Retry 2. SIP Sent 3. Timer A expires 5. SIP Resent Failover 4. Start timer B 6. Timer B expires 7. Modules reconfigured Replay
Fault Isolation and Module Refactoring Failover procedure The module that detected the failure deactivates the link with failed module It loads the failed module’s executable through preloaded configurations It announces module failure to failed module’s neighbors and declares itself as in-service module
Implementation Pipelining control instructions with media plane processing Break the dependency between Media Processor and Media Update modules and setup two interfaces among them Failure detection procedure Detect failure by observing state transitions in an FSM Fail-over procedure Keep record of on-going device session (before failure) through hash table Replay the failed operations
Evaluation Reducing Latencies We increase the call rate 60% of the traffic remains below 100 msec (state-of-the-art case) Frequent interaction between Media Processor and Media Control modules
Evaluation Improving fault tolerance Failure detection time is1 msec for just 7 calls per seconds Our fail-over procedure is 10X quicker than that of Cloud Platform’s
Conclusion IMS software modules interaction can cause higher latencies and bring failures Propose re-factoring of different modules to contain latencies and improving fault tolerance In future, we consider fail-stop failure model as well as data-plane failure cases