Worms, Viruses, and Cascading Failures in networks D. Towsley U. Massachusetts Collaborators: W. Gong, C. Zou (UMass) A. Ganesh, L. Massoulie (Microsoft)
o Internet as enabler of terrific apps
o … but also of malicious behavior worms, viruses o Internet as a complex system critical DNS, BGP infrastructures
Worms and failures o Code Red worm more than 360,000 infected in less than one day disrupted parts of BGP infrastructure o SQL Slammer less than 15 minutes to infect 75,000 hosts congested parts of Internet BGP errors in one network → cascade of faults in BGP in another network
Goals o what are appropriate models? deterministic stochastic o what makes worm/virus/failure virulent? o how does topology affect virulence?
Outline o worms, deterministic models o cascading failures, stochastic models o summary
Worm spreading behavior o scan for vulnerable hosts sequential, random, topological uniform, local preference o virulence sensitive to scanning strategy host speed, bandwidth protocol …
Worm spreading model address space, size o N vulnerable hosts scan rate (per host), N
Simple worm spreading model I(t) - number of infected hosts at time t Epidemic model: with initial condition I(0)
Code Red: model o measurements from two Class A networks scan rate I(t) o epidemic model matches increasing part of observed Code Red data (Staniford) What about decrease? o human countermeasures o congestion Zou, etal, 2002 time scan rate D. Goldsmith K. Eichman
Assumptions o classic epidemic model ignore countermeasures ignore congestion o Code Red parameters = 358/min N = 360,000 uniform scan, 2 32 o I(0) = 10 o 100s minutes to spread
Worm virulence increase o increase I(0) decrease
Worm virulence increase o increase I(0) decrease o smarter scanning
The perfect worm o perfect worm scan vulnerable nodes exactly once o flash worm (Staniford,…) uniform scan of vulnerable nodes ( N)
Perfect Code Red worm o I(0) = 10 = 358/min o N = 360,000 o all hosts infected within 2 sec. o add 2 sec. infection delay -> six-fold slowdown o random scan almost perfect!
o I(0) = 10 = 358/min o N = 360,000 o all hosts infected within 2 sec. o add 2 sec. infection delay -> six-fold slowdown o random scan almost perfect! Perfect Code Red worm
Hitlist, routing worms o hitlist worm increases I(0) o routing worm decreases BGP table information: =.29 2 32 –29% of IP address space
Hitlist, routing worms o Code Red style worm = 358/min o N = 360,000 o hitlist, I(0) = 10,000 o routing worm as effective as hitlist worm o hitlist/routing worm extremely virulent
1 Local preference worm o K subnetworks o p – probability scan local subnet o (1-p) – prob. scan outside local subnet 2 K 1-p p …
Local preference worm o N k, no. vulnerable hosts in subnet k o I k (t), no. infected hosts in subnet k o fits epidemic model for interacting groups set of coupled ODEs
Local preference worm o K = 116 o N k = 360,000/K o I 1 (0) = 10; I k (0) = 0, k>1 = 358/min o provides some of the locality of a routing worm
Questions o topological worms o sequential scan o bandwidth constraints
o topology? o failure recovery?
Topology and fast/slow recovery o model description o general network topologies conditions for fast-slow recovery o specific network topologies complete graphs (BGP routers) hypercubes (peer-to-peer networks) power-law graphs (Internet AS graph; E- mail address book graph)
Susceptible-Infective-Susceptible (SIS) epidemic model Also known as contact process; see [Liggett] o topology: undirected, finite graph G=(V,E), connected ; o X v = 1 if node v down (infected) X v = 0 if node v up (healthy)
Model o {X v v V} Markov process on {0,1} V with jump rates: X v → 1 with rate w → v X w X v → 0 with rate o unique absorbing state at 0 o all other states communicate, 0 is reachable
Time to absorption o system eventually recovers o how long does this take? o T = time to hit 0 (from a given initial condition) how does E[T] depend on G?
Example o G = line segment or ring with n nodes Fix Theorem (Durrett and Liu): There is critical c > 0 such that, if c, then E[T] = O(log n) if c, then log E[T] ≈ n a o signature of phase transition in infinite 1-D lattice.
Fast recovery, spectral radius - spectral radius of graph adjacency matrix, A; n=|V|. Then, P(X(t) 0) ≤ c n ½ exp([ - ]t) Hence, when < , Survival time T satisfies: E(T) ≤ [log(n)+1]/[ - ]
Coupling proof Consider “Branching Random Walk”, i.e. Markov process {Y v } v V Y v → Y v +1 with rate w ~ v Y w = (AY) v Y v → Y v -1 with rate Y v Can couple processes so that, for all t, X(t) ≤ Y(t).
Branching random walk bound By “linearity” of Y, dE[Y(t)]/dt = ( A - I) Y(t), so E[Y(t)] = exp( A - I) Y(0) ; Use P(X(t) 0) ≤ v V E[Y v (t)]
Slow recovery Graph isoperimetric constant: “perimeter” “area” S
Generalized isoperimetric constant
Slow die-out and isoperimetric constant Suppose for some m ≤ n/2, r := [ m ] / > 1 Then, with positive probability, epidemics survive for time at least r m /[2 m] Hence, if m = n , survival time T satisfies log (E[T]) = (n a )
Coupling proof Let |X| = v X v. Then |X| dominates process Z on {0,…,m} with transition rates: z → z+1 at rate z, z → z-1 at rate z. Then study absorption time for Z
Complete graph Here, = n-1, m = n-m By picking m = n a, a < 1, Thresholds: fast recovery if / < 1/(n-1) slow recovery if / > 1/(n-n a )
Hypercube {0,1} d Here, d = log 2 (n) and = d For m=2 k, k < d, m = d-k Hence, for k = d, Thresholds:, fast recovery if / < 1/d slow recovery if / > 1/[d(1- )]
Erdős-Rényi random graph o edge between each pair of nodes present with probability p n independent of others o dense: d n := np n = Ω(log n) then ρ ~ ~ d n with high probability
Star network o spectral radius: n 1/2 isoperimetric constant: m = 1 for all m < n/2 o general results not useful Specialized analysis yields: for arbitrary constant c > 0, if < c/n 1/2, fast recovery, E[T] = O(log(n)) if / > n a-1/2, for a > 0, slow recovery, log(E[T]) = (n a )
Power-law random graph Power-law graph with exponent : number of degree k vertices k - E.g. Internet AS graph with = 2.1 Expected degree PLRG [Chung et al] : o expected degrees w 1 > ··· > w n : edge (i,j) present w.p. w i w j / k w k particular choice: w i = c 1 (i+c 2 ) - 1/( -1)
Power-law random graph (2) Spectral radius of PLRG [Chung et al.,03]: Denote by m max. expected degree (m=w 1 ), and by d average of expected degrees. Then:
PLRG, > 2.5 Epidemics on full graph live longer than on sub-graph. Look at star induced by node 1: slow die-out for / > m -1/2 Compare to spectral radius condition: Fast die-out for / < m -1/2 Two thresholds differ by m ; same gap as for star
PLRG, 2 < < 2.5 Consider top N nodes, for suitable N; Erdős-Rényi core, with isoperimetric constant: = F( ) Gap between thresholds and : constant factor, F( )
Open problems o gap between upper and lower bounds in sparse ER graphs power law random graphs for < 2.5 o spectral radius bound tight in examples, always true? o conditioned on slow recovery, how many nodes are down at intermediate times? o extensions to other graphs and to SIR epidemics
Observations o neither parameter tight o gap for topologies with diverse degrees spectral radius “seems” to be right o nothing between log n and exp(n ) ?
Hitlist, routing worms o hitlist worm increase I(0) o routing worm decrease BGP table information: =.29 2 32 –29% of IP address space /8 aggregation: =.45 2 32 –116 out of 256 possible 8 bit prefixes 0110…0xxx 8
The appearance of phase transitions N=200, k s =1, k l =0.01 Mean time to absorption goes down from 10 47, to about 0 in a matter of few states
Accuracy of fluid model o population: 360,000 scan rate = N(358/min, 1002) normal distr. o scanning space: 2 32 o I(0) =1 o 100 simulations
Accuracy of fluid model o population: 360,000 scan rate = N(358/min, 1002) normal distr. o scanning space: 2 32 o I(0) =10 o 100 simulations
Accuracy of fluid model o population: 360,000 scan rate = N(358/min, 1002) normal distr. o scanning space: 2 32 o I(0) =10 o 100 simulations
Local preference worm o - local scan rate o ’- global scan rate o initial conditions I k (0)
Erdős-Rényi random graph o edge between each pair of nodes present with probability p n independent of others o sparse: p n = c log(n)/n, c > 1. then ρ ≤ c’ log(n), ≥ c’’ log(n) with high probability, for some c’’ < c < c’ o dense: d n := np n = Ω(log n) then ρ ~ ~ d n with high probability