Between Optimization  Enumeration (on Modeling, and Computation ) 30/May/2012 NII Shonan-meeting (open problem seminar) Takeaki Uno National Institute.

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Indexing DNA Sequences Using q-Grams
Huffman Codes and Asssociation Rules (II) Prof. Sin-Min Lee Department of Computer Science.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Approximations of points and polygonal chains
More Efficient Generation of Plane Triangulations Shin-ichi Nakano Takeaki Uno Gunma University National Institute of JAPAN Informatics, JAPAN 23/Sep/2003.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
1 Appendix B: Solving TSP by Dynamic Programming Course: Algorithm Design and Analysis.
LCM: An Efficient Algorithm for Enumerating Frequent Closed Item Sets L inear time C losed itemset M iner Takeaki Uno Tatsuya Asai Hiroaki Arimura Yuzo.
Approximation Algorithms Chapter 5: k-center. Overview n Main issue: Parametric pruning –Technique for approximation algorithms n 2-approx. algorithm.
S. J. Shyu Chap. 1 Introduction 1 The Design and Analysis of Algorithms Chapter 1 Introduction S. J. Shyu.
Point-and-Line Problems. Introduction Sometimes we can find an exisiting algorithm that fits our problem, however, it is more likely that we will have.
A Fairy Tale of Greedy Algorithms Yuli Ye Joint work with Allan Borodin, University of Toronto.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
Heuristic alignment algorithms and cost matrices
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.
1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts.
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.
Backtracking Reading Material: Chapter 13, Sections 1, 2, 4, and 5.
An Efficient Algorithm for Finding Similar Short Substrings from Large Scale String Data May/23/2008 PAKDD 2008 Takeaki Uno National Institute of Informatics,
Lecture 2 Geometric Algorithms. A B C D E F G H I J K L M N O P Sedgewick Sample Points.
Physical Mapping of DNA Shanna Terry March 2, 2004.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.
A Fast Algorithm for Enumerating Bipartite Perfect Matchings Takeaki Uno (National Institute of Informatics, JAPAN)
Cristian Urs and Ben Riveira. Introduction The article we chose focuses on improving the performance of Genetic Algorithms by: Use of predictive models.
Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration May/25/2008 PAKDD 2008 Takeaki Uno (1), Hiroki Arimura (2) (1) National Institute of.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura
An Efficient Algorithm for Enumerating Pseudo Cliques Dec/18/2007 ISAAC, Sendai Takeaki Uno National Institute of Informatics & The Graduate University.
An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining 2/Oct/2007 Discovery Science 2007 Takeaki Uno (National Institute of Informatics)
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,
Presented by: Aneeta Kolhe. Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text.
Speeding Up Enumeration Algorithms with Amortized Analysis Takeaki Uno (National Institute of Informatics, JAPAN)
Genome Homology Visualization by Short Similar Substring Enumeration 30/Sep/2008 RIMS AVEC Workshop Takeaki Uno National Institute of Informatics & Graduated.
1 30 November 2006 An Efficient Nearest Neighbor (NN) Algorithm for Peer-to-Peer (P2P) Settings Ahmed Sabbir Arif Graduate Student, York University.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Association Analysis (3)
1 Some Guidelines for Good Research Dr Leow Wee Kheng Dept. of Computer Science.
Output Sensitive Algorithm for Finding Similar Objects Jul/2/2007 Combinatorial Algorithms Day Takeaki Uno Takeaki Uno National Institute of Informatics,
ICS 353: Design and Analysis of Algorithms Backtracking King Fahd University of Petroleum & Minerals Information & Computer Science Department.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Algorithms for hard problems Parameterized complexity Bounded tree width approaches Juris Viksna, 2015.
Graph Indexing From managing and mining graph data.
James Hipp Senior, Clemson University.  Graph Representation G = (V, E) V = Set of Vertices E = Set of Edges  Adjacency Matrix  No Self-Inclusion (i.
Fast Algorithms for BIG DATA (title means “I make slides according to the interests of audience ) 14/Jan/2012 NII Shonan-meeting (open problem seminar)
::Network Optimization:: Minimum Spanning Trees and Clustering Taufik Djatna, Dr.Eng. 1.
Computational Challenges in BIG DATA 28/Apr/2012 China-Korea-Japan Workshop Takeaki Uno National Institute of Informatics & Graduated School for Advanced.
Approach to Data Mining from Algorithm and Computation Takeaki Uno, ETH Switzerland, NII Japan Hiroki Arimura, Hokkaido University, Japan.
On the Ability of Graph Coloring Heuristics to Find Substructures in Social Networks David Chalupa By, Tejaswini Nallagatla.
CS 326A: Motion Planning Probabilistic Roadmaps for Path Planning in High-Dimensional Configuration Spaces (1996) L. Kavraki, P. Švestka, J.-C. Latombe,
Data Mining and Its Applications to Image Processing
Algorithms and Networks
Fast Sequence Alignments
Frequent-Pattern Tree
Advanced Implementation of Tables
Output Sensitive Enumeration
June 12, 2003 (joint work with Umut Acar, and Guy Blelloch)
Lecture 2 Geometric Algorithms.
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

Between Optimization  Enumeration (on Modeling, and Computation ) 30/May/2012 NII Shonan-meeting (open problem seminar) Takeaki Uno National Institute of Informatics & Graduated School for Advanced Studies

Self Introduction, for Better Understanding Takeaki Uno: Takeaki Uno: National Institute of Informatics (institute for activating joint research) Research Area: Research Area: Algorithms, (data mining, genome science) + + Enumeration, Graph algorithms, Pattern mining, Similarity search (Tree algorithms, Dynamic programming, NP-hardness, Reverse search,…) + + Implementations for Frequent Itemset Mining, for String Similarity Search,… + + Web systems for small business optimizations + + Problem solver, or implementer, for research colleagues

Algorithm vs. Distributed Computation For fast computation, both “algorithm” and “distributed computation” are important However, sometimes they conflict + + distributed algorithms sometimes doesn’t match latest algorithms, but is faster than the latest + + new algorithms (or model) sometime vanish the existing distributed algorithms We should find “fundamental problems/algorithms” that don’t change in future, and fit distributed computation We should find “fundamental problems/algorithms” that don’t change in future, and fit distributed computation

Model between Optimization  Enumeration

DifferenceDifference ☆ ☆ Optimization finds the best solution ☆ ☆ Enumeration finds all the solutions   they are on the opposite arctic, but both are search methods branch&bound, A*, reverse search, … From the view point of parallel computation, big difference is “in optimization, search results cause the other search processes” ex) ex) branch and bound

Affect to Others “in optimization, search results cause the other search processes” In branch and bound, when a good solution is found, some branches in the other search processes would be pruned (the objects to be found (solution better than this) change) In enumeration it doesn’t occur   every process always finds “all solutions” So, enumeration seems to be good, for parallel computation good!pruned all…

Difficulty on Enumeration However, enumeration has a different difficulty In practice, we usually want to enumerate simple structures (easy to find with light computation) +  + computational cost = problem size  problem loading is bottleneck +  + each problem has many solutions  output to common disk is too Increasing the complexity of iterations / reducing #solutions are important for modeling

Modeling to What? Recently, many real-world applications have enumeration-taste problems (such as mining, recommendation, Web search…) We want to get “many good solutions” (not one best, not all) Actually, “many” and “good” are difficult to define mathematically How do we do? maybe, there is no perfect solution

Free from Mathematical-Taste There is no perfect solution, and how do we do? Let’s get away from well-defined mathematical world   mathematics would have limitations for that kinds   the objectives are hardly modeled well, thus we should address fundamentals, and enumerate candidates The bottle necks are big-data and complicated structures, so, let’s focus on these points   think about models having advantages on parallel computation. candidates will be polished up by further model/computation

Cluster Mining Cluster mining is not clustering (partition the data), but is to find many (potential) clusters Important in big-data, composed of many kinds of elements s.t. elements can belong to several groups + + Clusters are not well-defined (case-by-case) + + Finding best cluster is far from sufficient + + But, we want to avoid enumeration of all the possibilities (ex., enumeration of all cliques)   Consider models having computational advantages

Cluster Computation There are many optimization methods to find a cluster they usually take relatively long time (ex. size ^3) Repeating the methods until the clusters (doubly) cover the data takes long time Finding cliques (dense subgraphs) generates many solutions There have been some methods, for reducing the numbers ex) enumerate all maximal cliques, then delete similar ones   comparison is quite heavy; is not local tasks

Possible New Approaches Find maximal cliques (semi-opt), that includes relatively many edges that are not included in other (already obtained) cliques …or, refine the graph so that #cliques will be small, in parallel + + (clique – few edges)  (clique) + + isolated edges  delete It involves simple but heavy (pairwise) comparison of neighbors clique

On Web Link Graph Tested on Web link data of sites written in Japanese - - 5.5M nodes, 13M edges Clique (biclique) enumeration problem has been already studied for Web link graphs However, there are so many solutions on this graph, we can not terminate in few days (cliques are found up to 10000/sec) We do the graph cleaning

Graph Cleaning Find all pairs of vertices s.t. #common neighbors > 20   make a new graph, whose edges are these pairs ★ ★ Computation is not heavy; 20 min by single thread Enumerate all maximal cliques of size > 20   #cliques is reduced to 8,000   done in a minute Graph cleaning could reduce #cliques, and computation cost

Conclusion (1) Enumeration is a good problem to be parallelized However, the huge number of solutions is the bottleneck By shifting the problem slightly, or introducing some additional information processing, we can have a good problem having small number of solutions … Then, the parallelization of the enumeration, or the parallelization of the additional process would be interesting

GPGPU type Parallelization on Optimization/Enumeration

Styles in Parallelization Usually, the whole algorithm is parallelized In GPGPU (or SIMD), algorithms are too much complicated Address simple problems (major) Address subproblems (minority, of course) algorithm algo parallelization algo algorithm paralle- lization

algorithm Why Minority? Why the “subproblem research” is minority?   it is narrow-scoped   efficiency is limited (only small part of whole computation) If they are not, subproblem research would be valuable ☆ ☆ narrow-scope  find subproblems common to many ☆ ☆ limited efficiency  design the original algorithm, so that “subproblem is the heaviest part (99.99%) (of course, without losing the optimality on the computational cost) algorithm paralle- lization

Similar Short String Pair Enumeration Problem: Problem: For given a set S of strings of fixed length l, find all pairs of short strings such that their Hamming distance is at most d. To compare long string, we set S to (all) short substrings of length l, solve the problem, and get similar non-short substrings Also applicable to local similarity detection, such as assembling and mapping ATGCCGCG GCGTGTAC GCCTCTAT TGCGTTTC TGTAATGA ... ATGCCGCG GCGTGTAC GCCTCTAT TGCGTTTC TGTAATGA ... ATGCCGCG & AAGCCGCC GCCTCTAT & GCTTCTAA TGTAATGA & GGTAATGG ... ATGCCGCG & AAGCCGCC GCCTCTAT & GCTTCTAA TGTAATGA & GGTAATGG ...

Application to LSH LSH (Local Sensitive Hash) maps each record to 01-bit   Similar records have the same bit in high probability   We have to compare to the records having the same LSH bit To reduce #candidates to be compared, we combine many LSHs having bits (up to 1000 bits in total)   By introducing Hamming distance, we can reduce #bits to bits …

Existing Approach Actually, this is a new problem but… Similarity search Construct a data structure to find substrings similar to query string   difficult to develop fast and complete method   so many queries with non-short time for each Homology search algorithms for genomic strings (BLAST) Find small exact matches (same substring pair of 11 letters), and extend the substrings as possible unless the string pair is not similar   we find so many exact matches   increase the length “11” decreases the accuracy   Usually, ignore the frequently appearing strings

Block Combination Consider partition of each string into k ( >d ) blocks   If two strings are similar, they share at least k-d same blocks   for each string, candidates of similar strings are those having at least k-d same blocks For all combinations of k-d blocks, we find the records having the same blocks (exact search)   we have to do several times, but not so many

An Example Find all pairs of Hamming distance at most 1, from ABC 、 ABD 、 ACC 、 EFG 、 FFG 、 AFG 、 GAB A BCDE A BDDE A DCDE C DEFG C DEFF C DEGG A AGAB A BCDE A BDDE A DCDE C DEFG C DEFF C DEGG A AGAB ABCD E ABD DE ADC DE CDEF G CDEF F CDEG G AAG AB ABCD E ABD DE ADC DE CDEF G CDEF F CDEG G AAG AB A BC DE A BD DE A DC DE C DE FG C DE FF C DE GG A AG AB A BC DE A BD DE A DC DE C DE FG C DE FF C DE GG A AG AB ABC DE ABD DE ADC DE CDE FG CDE FF CDE GG AAG AB ABC DE ABD DE ADC DE CDE FG CDE FF CDE GG AAG AB

Comparison; Human/Mouse Genome Comparison of Mouse X and human X chromosome (150MB for each, with 10% error) 15min. By PC Mouse X chr. Human X chr Note: BLASTZ BLASTZ 2weeksMURASAKI 2-3 hours with 1% error

GPGPU Parallelization Strings are partitioned into many small groups Pairwise comparison in each group   easy to GPGPU-parallelize ! Reducing #blocks, group-size increases   comparison becomes bottleneck (of a good size) × k C d

Other “Subproblems” + + Extracting induced subgraph, for a vertex set + + Find (some) maximal clique + + Find (some) path + + Augment a matching + + Majority voting + + Frequency counting for sequence/graphs + + Range search + + All intersections of line segments + + Convex hull + + Huff transformation + + Dynamic Programming … Also important for optimization

AbstractAbstract Recent “Big Data” increases the necessity of fast computation. However, we would be seeing some limitation of usual approach, especially for enumeration and optimization. In this talk, we will discuss about new research direction; algorithmic modeling. In algorithmic modeling, algorithms are designed for increasing the possibility models that allow fast computation, and models are designed with having first priority on the fast computation. We first discuss the difference between the optimization and enumeration, and clarify why they are difficult to allow fast computation. Then we consider new models locating between optimization and enumeration that allow fast computation, by considering/developing fast algorithms that we can use.