Download presentation
Presentation is loading. Please wait.
Published byNicholas Hodges Modified over 8 years ago
1
Computational Challenges in BIG DATA 28/Apr/2012 China-Korea-Japan Workshop Takeaki Uno National Institute of Informatics & Graduated School for Advanced Studies
2
Self Introduction, for Better Understanding Takeaki Uno: Takeaki Uno: National Institute of Informatics (institute for activating joint research) Research Area: Research Area: Algorithms, (data mining, genome science) + + Enumeration, Graph algorithms, Pattern mining, Similarity search (Tree algorithms, Dynamic programming, NP-hardness, Reverse search,…) + + Implementations for Data Mining Algorithms, for String Similarity Search,… + + Web systems for small business optimizations + + Problem solver, or implementer, for research colleagues
3
Difficulty on Big Data One of recent main topic on informatics is BIG DATA I will talk my opinions for “ what kind of problems shall we study ”, and “ what kind of approaches shall we take ” from view point of theoretical computer science Difficulties on big data are, roughly, modeling ; noise, diversity, too much symbolic, ill-formulate computation ; time / memory efficient, online / dynamic,… Here we do not consider the topics of natural sciences/engineering - devices, architectures, social implementations, …
4
Profiling Big Data The features of big data are; - - huge size; of course - - noisy; accuracies differ, objectives to get the data differ - - individuality; many kinds of records (people) - - very sparse (graph), but locally dense (has localities) We have disadvantages, but at the same time have advantages - - data can be re-used for other objectives (so, cost is cheap) - - accuracy increases automatically by using many records Locality/minority is important: we have to classify the data (or extract necessary parts) according to the objectives for analysis
5
Objectives for Big Data Big data is used for many objectives by different users, and life span of problems may not be long, according to rapid changes of the world as fundamental research, we should not focus on each specified application To be useful as fundamental tools in the application areas with rapid changes and diversity + + abstract general problems common to many applications + + methods (algorithms, processes) should be simple, as much as possible, and the output can be explained clearly what it is + + models shoube determined according to computational efficiency
6
Algorithms as “Natural Science” For fast computation, algorithms have to fit real-world data have to observe / analyze the behaviors of the algorithms on the real-world data, and design the algorithms according to the results it would be a “natural science approach” (there are not many) feasible hypothesis verify by experiments Parallelization is necessary, but… since best algorithms change, have do from scratch many times (in optimization, improvements by algorithms > architecture) extract simple good problems and parallelize them
7
A Direction; find similar string pairs Motivation : Motivation : Similarity search on string data (especially, genome) index construction is heavy, LSH (approximate hash) is not good, and parallelization is not easy Modeling: Modeling: For given a set S of strings of fixed length l, find all pairs of short strings such that their Hamming distance is at most d. avoid overheads of executing many queries; big data needs many queries, Hamming distance is simple/basic ATGCCGCG GCGTGTAC GCCTCTAT TGCGTTTC TGTAATGA ... ATGCCGCG GCGTGTAC GCCTCTAT TGCGTTTC TGTAATGA ... ATGCCGCG & AAGCCGCC GCCTCTAT & GCTTCTAA TGTAATGA & GGTAATGG ... ATGCCGCG & AAGCCGCC GCCTCTAT & GCTTCTAA TGTAATGA & GGTAATGG ...
8
Block Combination Consider partition of each string into k ( >d ) blocks If two strings are similar, they share at least k-d same blocks for each string, candidates of similar strings are those having at least k-d same blocks For all combinations of k-d blocks, we find the records having the same blocks (exact search) we have to do several times, but not so many
9
Cost for Mismatches No index, thus memory efficient (work space = input data size) Worst case time complexity is n 2, but works fine in real-world data Pairwise comparison can be parallelized easily good subproblem; linear speed up even in GPU computing Combination with LSH works in many kinds of data Euclidean distance, Manhattan distance…
10
Implementation; performance Comparison of Mouse X and human X chromosome (150MB for each, with 10% error) 15min. By PC Mouse X chr. Human X chr Note: BLASTZ BLASTZ 2weeksMURASAKI 2-3 hours with 1% error
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.