Optimal k-mer superstrings for protein identification and DNA assay design. Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland, College Park
2 k-mer (Sub-)Problems Enumerate: For all (distinct) k-mers, do Existence:...with respect to exact (& inexact) count ¸ x Uniqueness:...with respect to exact & inexact match Near-neighbors:...with respect to inexact match Representation: Represent (distinct) k-mers for other tools Fast annotation of k-mer counts on original sequences
3 Applications of k-mer sets Peptide Identification Represent all amino-acid 30-mers...that occur at least twice in human dbEST PCR Primer Design: Test DNA 20-mer primers for uniqueness What does it mean to be unique? DNA sequencing error / repeat detection Eliminate mers that are too rare or too frequent Pathogen signatures Near-neighbors imply potential false-positives
4 k-mer Superstring Problem Given A set of sequences S = { S 1,..., S n } Sequence database Word size k Find A new set of sequences T = { T 1,..., T m } Such that Total length of T is minimized, and T is complete and correct w.r.t. k-mers of S
5 k-mer Superstring Problem Completeness All of the k-mers of S are represented Correctness No additional k-mers are present Minimize the total representation length Correlates with running time
6 Shortest (common) superstring problem General strings (arbitrary length) Single output string Completeness for input sequences only Classical NP-hard problem Garey and Johnson Approximate within ~ 2.5*OPT Max-SNP hard One of the first algorithmic approaches to genome assembly
7 de Bruijn Sequences de Bruijn sequences represent all words of length k from some alphabet A. A = {0,1}, k = 3: s = A = {0,1}, k = 4: s =
8 de Bruijn Graph: A = {0,1}, k =
9 de Bruijn Sequences & Graphs de Bruijn graphs (k,A): Edges represent length k words from A Each node has in degree |A| out degree |A| Eulerian tour constructs de Bruijn sequence.
10 Sequencing-by- Hybridization-graph ACDEFGI, ACDEFACG, DEFGEFGI
11 Compressed SBH-graph ACDEFGI, ACDEFACG, DEFGEFGI
12 Sequence Databases & CSBH-graphs Original sequences correspond to paths ACDEFGI, ACDEFACG, DEFGEFGI
13 C 3 Enumeration Complete All k-mers are present Correct No other k-mers are present Compact No k-mer is present more than once
14 Correct, Complete, Compact (C 3 ) Enumeration Set of paths that use each edge exactly once ACDEFGEFGI, DEFACG
15 Correct, Complete (C 2 ) Enumeration Set of paths that use each edge at least once ACDEFGEFGI, DEFACG
16 Patching the CSBH-graph Use artificial edges to fix unbalanced nodes
17 Patching the CSBH-graph Use matching-style formulations to choose artificial edges Optimal C 2 /C 3 enumeration in polynomial time. Chinese Postman Problem Edmonds and Johnson, ’73 l-tuple DNA sequencing Pevzner, ’89 Shortest (Common) Superstring MAX-SNP-hard, 2.5 approx algorithm
18 Related work Chinese Postman Problem Undirected graph, weighted edges Shortest path that uses all the edges Solvable in polynomial time Construct minimum weighted matching between nodes of odd-degree Add matching to graph and find Eulerian path Minimize weight of extra edges used
19 C 2 Enumeration Chinese postman problem, except: Directed graph Add edges from nodes with surplus in-degree to nodes with surplus out-degree Fixed cost teleportation option Can always “start” a new sequence Find optimal set of additional edges Transportation problem / min cost flow instance
20 C 3 Enumeration Cost: k #in-#out
21 Reusing Edges ACDHAC EHAC FHAC GHAC D ACDEHAC, ACDFHAC, ACDGHACD
22 C 3 : ACDEHACDFHAC, ACDGHACD Reusing Edges ACDHAC EHAC FHAC GHAC D $ACD
23 C 2 : ACDEHACDFHACDGHAC Reusing Edges ACDHAC EHAC FHAC GHAC D D
24 C 2 Enumeration “Shortcut paths” #in-#out
25 C 3 Enumeration #in-#out Cost: k 0 0 Cost: 0
26 Sample Preparation for Peptide Identification Enzymatic Digest and Fractionation
27 Single Stage MS MS m/z
28 Tandem Mass Spectrometry (MS/MS) Precursor selection m/z
29 Tandem Mass Spectrometry (MS/MS) Precursor selection + collision induced dissociation (CID) MS/MS m/z
30 Peptide Identification For each (likely) peptide sequence 1. Compute fragment masses 2. Compare with spectrum 3. Retain those that match well Peptide sequences from protein sequence databases Swiss-Prot, IPI, NCBI’s nr,... Automated, high-throughput peptide identification in complex mixtures
31 Novel Splice Isoform Human Jurkat leukemia cell-line Lipid-raft extraction protocol, targeting T cells von Haller, et al. MCP LIME1 gene: LCK interacting transmembrane adaptor 1 LCK gene: Leukocyte-specific protein tyrosine kinase Proto-oncogene Chromosomal aberration involving LCK in leukemias. Multiple significant peptide identifications
32 Novel Splice Isoform
33 Novel Splice Isoform
34 Novel Mutation HUPO Plasma Proteome Project Pooled samples from 10 male & 10 female healthy Chinese subjects Plasma/EDTA sample protocol Li, et al. Proteomics (Lab 29) TTR gene Transthyretin (pre-albumin) Defects in TTR are a cause of amyloidosis. Familial amyloidotic polyneuropathy late-onset, dominant inheritance
35 Novel Mutation Ala2→Pro associated with familial amyloid polyneuropathy
36 Novel Mutation
37 Compressed EST Peptide Sequence Database For all ESTs mapped to a UniGene gene: Six-frame translation Eliminate ORFs < 30 amino-acids Eliminate amino-acid 30-mers observed once Compress to C 2 FASTA database Complete, Correct for amino-acid 30-mers Inclusive gene-centric peptide sequence database: Size: < 3% of naïve enumeration, FASTA entries Running time: ~ 1% of naïve enumeration search E-values: ~ 2% of naïve enumeration search results
38 Compressed EST Peptide Sequence Database For all ESTs mapped to a UniGene gene: Six-frame translation Eliminate ORFs < 30 amino-acids Eliminate amino-acid 30-mers observed once Compress to C 2 FASTA database Complete, Correct for amino-acid 30-mers Gene-centric peptide sequence database: Size: < 3% of naïve enumeration, FASTA entries Running time: ~ 1% of naïve enumeration search E-values: ~ 2% of naïve enumeration search results
39 Sequence Databases & CSBH-graphs All k-mers represented by an edge have the same count
40 CSBH-graph subgraphs Quickly determine those that occur twice
41 k-mer (Sub-)Problems Enumerate: For all (distinct) k-mers, do Existence:...with exact (& inexact) count ¸ x Uniqueness:...exact & inexact match Near-neighbors:...inexact match Representation: Represent (distinct) k-mers for other tools Fast annotation of k-mer counts on original sequences
42 Large scale instances! CSBH-graph instances Partition set of all k-mers, determine non-trivial nodes Days on condor grid (250 CPUs) to construct ¸ 100,000,000 nodes and edges (sparse & dense) Min-cost flow instances ¸ 500,000 nodes and edges Algorithms must be linear in problem size Out-of-core Eulerian path algorithm? Currently testing out-of-core connected-components
43 Grid computing Heterogeneous machines Varying disk/memory/MHz/cores capabilities Centralized scheduler Jobs started asynchronously Other jobs may preempt current job Input files may need to be staged 250 simultaneous requests for a 3Gb file? How to guarantee integrity of input files? Problem decomposition may be non-trivial Jobs sizes need to fit the least capable machine Sometimes need to “game” the scheduler Need to ensure the integrity of job output
44 Uniqueness Oracles Oracle for uniqueness of 20-mers in the Human genome (size: 3Gb) Count occurrences in the genome: 0,1,2+ Construct 20-mer superstring for 20-mers with count 1 Construct 20-mer superstring for 20-mers with count > 1 Easy(-ish) for exact sequence match: O(n) Fast automata, hash tables, suffix trees.
45 Polymerase Chain Reaction
46 Polymerase Chain Reaction
47 Inexact sequence match Inexact sequence matching O(n*m*k) Errors/Mismatches (k): 1,2,3 # distinct 20-mers (m): O(n) Achieve expected linear time using a hybrid approach (blastn): Exact search for short chunks of primers Expensive alignment only where chunks match Large chunks ) Fast, but miss occurrences Small chunks ) Slow, find all matches
48 Baeza-Yates Perleberg: Correct and O(n) for small k At least 1 chunk is observed with no error. Small k → Large chunks → Fast and correct Form of locality sensitive hashing Inexact sequence match ≠ = ≠ q g
49 Locality Sensitive Hashing For each primer: store a (set of) hash(es) in hash-table At each position in the genome: look-up a (set of) hash(es) in hash-table if any hash is found, do more expensive check Need to weigh sensitivity (false negatives) vs specificity (false positives) Our application requires speed and no false negatives!
50 Random Projection Choose T templates of l random “care” positions q g
51 Random Projection Choose T templates of l random “care” positions t1t1 g t 1 :
52 Random Projection Choose T templates of l random positions t1t1 t2t2 g t 1 : t 2 :
53 Random Projection Choose T templates of l random positions t1t1 t2t2 g t 1 : t 2 :
54 Random Projection Choose T templates of l random positions t1t1 t2t2 g t 1 : t 2 :
55 Gapped seed-set design problem Given: mer-size: m ( = 20 ) # errors: k ( = 1,2,3) # cares: l ( = 10,12,14 ) Find the smallest set of templates with no false negatives. Minimize running time.
56 Gapped seed set design formulation (for k = 2) Cover the edges of K m with copies of K m-l How many triangles to cover K 6 ? (m = 6, k = 2, l = 3) Some instances of (m,2,m-3) cover each edge exactly once: Steiner triple systems
57 How many triangles cover K 6 ? 15 edges total Is 5 triangles possible?
58 How many triangles cover K 6 ? 15 edges total Is 5 triangles possible? NO!
59 How many triangles cover K 6 ? 15 edges total Is 5 triangles possible? NO! Each node requires 3 triangles Triangles must account for at least 18 “edges”
60 How many triangles cover K 6 ? 15 edges total Is 5 triangles possible? NO! Each node requires 3 triangles Triangles must account for at least 18 “edges”
61 Gapped seed set design formulation #2 Set cover instance: Ground set: all possible placements of the k errors (alignments) Covering sets: all possible placements of the l care positions For (m=20,k=2,l=10), 190 elements, 184,756 sets! Greedy approximation algorithm works
62 Gapped seed set design formulation #3 TemplatesPositions (m) l Remove any k position nodes, at least 1 template must have degree l.
63 Gapped seed set design formulation #3 Polynomial size in terms of number of templates Select T in advance and test whether sufficient. Greedily add 1,2,3,... templates. Apply iteratively to achieve feasible solution
64 Solution for (20,2,10) Positions ********** t 1 ********** t 2 ***** ***** t 3 ***** ***** t 4 ***** ***** t 5 ********** t 6 Need > 4 templates, 6 is optimal
65 Remember the application! We are checking some templates twice! We compute hash(es) at each position in the genome Any template that is a shift of another will be computed at some nearby genomic position!
66 Solution for (20,2,10) Positions ********** t 1 ********** t 2 ***** ***** t 3 ***** ***** t 4 ***** ***** t 5 ********** t 6 Need at most 3 templates...can we do better?
67 Solution for (20,2,10) w/ shift Positions **** ** **** t 1 **** * ***** t 2 Optimal is 2 templates...
68 Gapped seed set design Solution strategies Randomized algorithms Greedy algorithm Directly to set cover instance Indirectly to bipartite instance Integer programming On set cover and bipartite instances Solution of greedy algorithm subproblem...in parallel, using COIN-OR SYMPHONY Branch-and-bound enumeration Solution of greedy algorithm subproblem...in parallel, using COIN-OR ALPS library
69 What about edit-distance? Formulations can be generalized Similar solution strategies can be applied (All) symmetry lost! This may actually be helpful Much harder to solve Is greedy still good? Solutions typically require more templates
70 Uniqueness Oracles Integrated with CSBH-graph construction algorithm Ensure edge-count property is preserved Sequence database of unique / non- unique 20-mers for small genomes D. melanogaster, up to edit-distance 2 Currently working to scale to human...
71 Other Projects / Interests HMMs for Peptide Spectrum Matching with UMd, CS Rapid Microorganism Identification Database Pathogen detection using Spectral Matching with USDA Locality sensitive hashing spectra, peptide sequence Statistical techniques statistical significance importance sampling CSBH-graph applications genome assembly Grid computing Web-applications Relational databases
72 Future Research Directions Extend k-mer superstring algorithms Range of word sizes, variable length words Other sequence properties (Tryptic peptides, T m ) Identification of protein isoforms: Optimize proteomics workflow for isoform detection Identify splice variants in cancer cell-lines (MCF-7) and clinical brain tumor samples Aggressive peptide sequence enumeration dbPep for genomic annotation Open, flexible informatics infrastructure for peptide identification
73 Future Research Directions Proteomics for Microorganism Identification Specificity of tandem mass spectra Revamp RMIDb prototype Incorporate spectral matching Primer design Uniqueness oracle for inexact match in human Integration with Primer3 Tiling, multiplexing, pooling, & tag arrays
74 Acknowledgements Chau-Wen Tseng, Xue Wu UMCP Computer Science Catherine Fenselau, Steve Swatkoski UMCP Biochemistry Calibrant Biosystems PeptideAtlas, HUPO PPP, X!Tandem Funding: National Cancer Institute