Between Optimization Enumeration (on Modeling, and Computation ) 30/May/2012 NII Shonan-meeting (open problem seminar) Takeaki Uno National Institute of Informatics & Graduated School for Advanced Studies
Self Introduction, for Better Understanding Takeaki Uno: Takeaki Uno: National Institute of Informatics (institute for activating joint research) Research Area: Research Area: Algorithms, (data mining, genome science) + + Enumeration, Graph algorithms, Pattern mining, Similarity search (Tree algorithms, Dynamic programming, NP-hardness, Reverse search,…) + + Implementations for Frequent Itemset Mining, for String Similarity Search,… + + Web systems for small business optimizations + + Problem solver, or implementer, for research colleagues
Algorithm vs. Distributed Computation For fast computation, both “algorithm” and “distributed computation” are important However, sometimes they conflict + + distributed algorithms sometimes doesn’t match latest algorithms, but is faster than the latest + + new algorithms (or model) sometime vanish the existing distributed algorithms We should find “fundamental problems/algorithms” that don’t change in future, and fit distributed computation We should find “fundamental problems/algorithms” that don’t change in future, and fit distributed computation
Model between Optimization Enumeration
DifferenceDifference ☆ ☆ Optimization finds the best solution ☆ ☆ Enumeration finds all the solutions they are on the opposite arctic, but both are search methods branch&bound, A*, reverse search, … From the view point of parallel computation, big difference is “in optimization, search results cause the other search processes” ex) ex) branch and bound
Affect to Others “in optimization, search results cause the other search processes” In branch and bound, when a good solution is found, some branches in the other search processes would be pruned (the objects to be found (solution better than this) change) In enumeration it doesn’t occur every process always finds “all solutions” So, enumeration seems to be good, for parallel computation good!pruned all…
Difficulty on Enumeration However, enumeration has a different difficulty In practice, we usually want to enumerate simple structures (easy to find with light computation) + + computational cost = problem size problem loading is bottleneck + + each problem has many solutions output to common disk is too Increasing the complexity of iterations / reducing #solutions are important for modeling
Modeling to What? Recently, many real-world applications have enumeration-taste problems (such as mining, recommendation, Web search…) We want to get “many good solutions” (not one best, not all) Actually, “many” and “good” are difficult to define mathematically How do we do? maybe, there is no perfect solution
Free from Mathematical-Taste There is no perfect solution, and how do we do? Let’s get away from well-defined mathematical world mathematics would have limitations for that kinds the objectives are hardly modeled well, thus we should address fundamentals, and enumerate candidates The bottle necks are big-data and complicated structures, so, let’s focus on these points think about models having advantages on parallel computation. candidates will be polished up by further model/computation
Cluster Mining Cluster mining is not clustering (partition the data), but is to find many (potential) clusters Important in big-data, composed of many kinds of elements s.t. elements can belong to several groups + + Clusters are not well-defined (case-by-case) + + Finding best cluster is far from sufficient + + But, we want to avoid enumeration of all the possibilities (ex., enumeration of all cliques) Consider models having computational advantages
Cluster Computation There are many optimization methods to find a cluster they usually take relatively long time (ex. size ^3) Repeating the methods until the clusters (doubly) cover the data takes long time Finding cliques (dense subgraphs) generates many solutions There have been some methods, for reducing the numbers ex) enumerate all maximal cliques, then delete similar ones comparison is quite heavy; is not local tasks
Possible New Approaches Find maximal cliques (semi-opt), that includes relatively many edges that are not included in other (already obtained) cliques …or, refine the graph so that #cliques will be small, in parallel + + (clique – few edges) (clique) + + isolated edges delete It involves simple but heavy (pairwise) comparison of neighbors clique
On Web Link Graph Tested on Web link data of sites written in Japanese - - 5.5M nodes, 13M edges Clique (biclique) enumeration problem has been already studied for Web link graphs However, there are so many solutions on this graph, we can not terminate in few days (cliques are found up to 10000/sec) We do the graph cleaning
Graph Cleaning Find all pairs of vertices s.t. #common neighbors > 20 make a new graph, whose edges are these pairs ★ ★ Computation is not heavy; 20 min by single thread Enumerate all maximal cliques of size > 20 #cliques is reduced to 8,000 done in a minute Graph cleaning could reduce #cliques, and computation cost
Conclusion (1) Enumeration is a good problem to be parallelized However, the huge number of solutions is the bottleneck By shifting the problem slightly, or introducing some additional information processing, we can have a good problem having small number of solutions … Then, the parallelization of the enumeration, or the parallelization of the additional process would be interesting
GPGPU type Parallelization on Optimization/Enumeration
Styles in Parallelization Usually, the whole algorithm is parallelized In GPGPU (or SIMD), algorithms are too much complicated Address simple problems (major) Address subproblems (minority, of course) algorithm algo parallelization algo algorithm paralle- lization
algorithm Why Minority? Why the “subproblem research” is minority? it is narrow-scoped efficiency is limited (only small part of whole computation) If they are not, subproblem research would be valuable ☆ ☆ narrow-scope find subproblems common to many ☆ ☆ limited efficiency design the original algorithm, so that “subproblem is the heaviest part (99.99%) (of course, without losing the optimality on the computational cost) algorithm paralle- lization
Similar Short String Pair Enumeration Problem: Problem: For given a set S of strings of fixed length l, find all pairs of short strings such that their Hamming distance is at most d. To compare long string, we set S to (all) short substrings of length l, solve the problem, and get similar non-short substrings Also applicable to local similarity detection, such as assembling and mapping ATGCCGCG GCGTGTAC GCCTCTAT TGCGTTTC TGTAATGA ... ATGCCGCG GCGTGTAC GCCTCTAT TGCGTTTC TGTAATGA ... ATGCCGCG & AAGCCGCC GCCTCTAT & GCTTCTAA TGTAATGA & GGTAATGG ... ATGCCGCG & AAGCCGCC GCCTCTAT & GCTTCTAA TGTAATGA & GGTAATGG ...
Application to LSH LSH (Local Sensitive Hash) maps each record to 01-bit Similar records have the same bit in high probability We have to compare to the records having the same LSH bit To reduce #candidates to be compared, we combine many LSHs having bits (up to 1000 bits in total) By introducing Hamming distance, we can reduce #bits to bits …
Existing Approach Actually, this is a new problem but… Similarity search Construct a data structure to find substrings similar to query string difficult to develop fast and complete method so many queries with non-short time for each Homology search algorithms for genomic strings (BLAST) Find small exact matches (same substring pair of 11 letters), and extend the substrings as possible unless the string pair is not similar we find so many exact matches increase the length “11” decreases the accuracy Usually, ignore the frequently appearing strings
Block Combination Consider partition of each string into k ( >d ) blocks If two strings are similar, they share at least k-d same blocks for each string, candidates of similar strings are those having at least k-d same blocks For all combinations of k-d blocks, we find the records having the same blocks (exact search) we have to do several times, but not so many
An Example Find all pairs of Hamming distance at most 1, from ABC 、 ABD 、 ACC 、 EFG 、 FFG 、 AFG 、 GAB A BCDE A BDDE A DCDE C DEFG C DEFF C DEGG A AGAB A BCDE A BDDE A DCDE C DEFG C DEFF C DEGG A AGAB ABCD E ABD DE ADC DE CDEF G CDEF F CDEG G AAG AB ABCD E ABD DE ADC DE CDEF G CDEF F CDEG G AAG AB A BC DE A BD DE A DC DE C DE FG C DE FF C DE GG A AG AB A BC DE A BD DE A DC DE C DE FG C DE FF C DE GG A AG AB ABC DE ABD DE ADC DE CDE FG CDE FF CDE GG AAG AB ABC DE ABD DE ADC DE CDE FG CDE FF CDE GG AAG AB
Comparison; Human/Mouse Genome Comparison of Mouse X and human X chromosome (150MB for each, with 10% error) 15min. By PC Mouse X chr. Human X chr Note: BLASTZ BLASTZ 2weeksMURASAKI 2-3 hours with 1% error
GPGPU Parallelization Strings are partitioned into many small groups Pairwise comparison in each group easy to GPGPU-parallelize ! Reducing #blocks, group-size increases comparison becomes bottleneck (of a good size) × k C d
Other “Subproblems” + + Extracting induced subgraph, for a vertex set + + Find (some) maximal clique + + Find (some) path + + Augment a matching + + Majority voting + + Frequency counting for sequence/graphs + + Range search + + All intersections of line segments + + Convex hull + + Huff transformation + + Dynamic Programming … Also important for optimization
AbstractAbstract Recent “Big Data” increases the necessity of fast computation. However, we would be seeing some limitation of usual approach, especially for enumeration and optimization. In this talk, we will discuss about new research direction; algorithmic modeling. In algorithmic modeling, algorithms are designed for increasing the possibility models that allow fast computation, and models are designed with having first priority on the fast computation. We first discuss the difference between the optimization and enumeration, and clarify why they are difficult to allow fast computation. Then we consider new models locating between optimization and enumeration that allow fast computation, by considering/developing fast algorithms that we can use.