Mapping Influenza A Virus Transmission Networks with Whole Genome Comparisons (Methods) Adrienne Breland TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT
Goal - to characterize global Influenza A Virus transmission as a complex network TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT
TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT Russell (2008) The global circulation of seasonal influenza A (H3N2) viruses Proposed global H3N2 circulation
TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT
Motivation Major Questions Data Genome Comparison Method TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT Outline
Motivation Major Questions Data Genome Comparison Method TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT Outline
Motivation Delineating real disease networks is difficult – Infection tracing: Detecting exact transmission links – Contact tracing: All potential transmission contacts – Diary Based: Subject records all contacts TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT
Motivation TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT Infection tracingContact tracingDiary Based Keeling M & K Eames (2005) Networks and epidemic models. J. R. Soc. Interface 2:
Motivation Delineating real disease networks is very useful TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT
Motivation Delineating real disease networks is very useful -targeting an attack TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT
Motivation Delineating real disease networks is very useful TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT Error and attack tolerance of complex networks. Réka Albert, Hawoong Jeong and Albert-László Barabási
Motivation Delineating real disease networks is very useful TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT al_network.png
Motivation Delineating real disease networks is very useful -correlation coefficients
Motivation Delineating real disease networks is very useful -detecting more probable global routes TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT
Motivation Global routes TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT
Motivation Global routes TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT Breland A, S Nasser, K Schlauch, M Nicolescu, F Harris (2008) Efficient Influenza A Virus Origin Detection. Journal of Electronics and Computer Science, 10;1-12
Motivation Delineating real disease networks is very useful -examine with other spatial data TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT
Motivation Spatial data TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT
Motivation Spatial data TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT VEGETATION
Motivation Spatial data TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT POPULATION
Motivation Spatial data TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT CLIMATE CHANGE
Motivation Major Questions Data Genome Comparison Method TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT Outline
Major questions Location and degree of host jumping Underlying structure (small world, power law..) Subtype independence Re-assortment Geographic routes TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT
Motivation Major Questions Data Genome Comparison Method TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT Outline
Data TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT
Data ≈ 4000 sequences Global regions (i.e. China, U.S., Africa, India...) All subtypes (i.e. H5N1, H1N1,..) All hosts species (Domestic Avian, Wild Avian, etc..) TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT
Data ≈ 374 per year TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT
Data Multiple host types TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT
Data Multiple sub types TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT
Motivation Major Questions Data Genome Comparison Method TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT Outline
Genome Comparisons Similarity matrix, N sequences: N(N-1)/2 comparisons TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT N N N N...21
Romanova,J (2006) The fight against new types of influenza virus. Biotechnology J, 1:
Genome Comparisons 8 segments TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT - 1 N N N N N N N N N N N N N N N N...21 HA ≈ 1750bp NS ≈ 900bp M ≈ 1000bpNA ≈ 1300bpNP ≈ 1500bp PA ≈ 2100bpPB1 ≈ 2200bpPB2 ≈ 2300bp
Genome Comparisons 8 segments TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT
Genome Comparisons TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT Alignment, O(n 2 ), n = max sequence length.....AAAACTTGAACC GGACTTGACCT.....
Genome Comparisons TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA AAGAACCTTTATGACAAGGTTCGACTACA GCTTAGGGATAATGCAAAGGAGCTGGT Alignment-free k-mers, O(n) ∑ = {A,C,G,T/U} 4 k possible k-mers, k≥0 TT TG... AG AC AA frequencyk-word
Genome Comparisons TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA Feature Frequency Profiles (FFP) C k = F k = = Sims GE, Jun SR, Wu GA, Kim SH (2009) Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc Natl Acad Sci U S A., 106(8):
Genome Comparisons Jensen-Shannon Divergence (JS) compare(s 1,s 2 ) TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA P k = FFP(s 1 ), Q k = FFP(s 2 ), M k = (P k + M k )/2 JS(P k,M k ) = 1/2KL(P k,M k ) + 1/2KL(Q k,M k ) KL =
Genome Comparisons TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA k=?
Genome Comparisons TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA k=? k s.t. N(k) ≥ N(k+1) k ≈ 4
Genome Comparisons TTGTGGATTCTTGATCGTCTTTTCTTCAAATGTAT TTATCGTCGCCTTAAATACGGA Actual & Predicted times
Questions/Comments? Thanks