Presentation is loading. Please wait.

Presentation is loading. Please wait.

EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández.

Similar presentations


Presentation on theme: "EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández."— Presentation transcript:

1 EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández Estimating the Performance of BLAST runs on the EGEE Grid

2 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Outline The problem. Factors affecting the performance. Experiments on the grid. The Performance. Table of performance per node. Execution model. Conclusions and further work. Uppsala – User Forum 12/4/10 2

3 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Introduction: The Problem Sequence alignment is an key operation in Bioinformatics –It involves computing the comparison of proteomic and genomic samples with respect to annotated databases. It is a part of many Bioinformatics pipelines –Used to search for homologous in the study the functionality of different genes and regions. –Used in the phylogenetic taxonomy. There are many tools developed in the literature –Based on the Smith-Waterman transform (e.g. BLAST). –Based on Hash Tables (e.g. SSAHA, BLAT). –Based on Burroughs-Wheeler Transform (e.g. BWA, Bowtie). –And combinations of them (e.g. SSAHA2). Uppsala – User Forum 12/4/10 3

4 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Introduction: The Tool BLAST (Basic Local Alignment Search Tool) is the most widely used tool for performing the alignment of any length novel sequences against the ones contained in a determined database. –Although it could be inefficient for many cases, It has a proven reputation. Because a normal use case entails the alignment of millions of sequences, this kind of experiments are very computationally intensive (it demands years of CPU computation). Uppsala – User Forum 12/4/10 4

5 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Introduction: The Approach Problem is massive parallel and fits the requirements for using efficiently Grid infrastructures. The parallelization process follows the High-Throughput paradigm: –Segmenting the input data file into several chunks which are aligned in independent computation nodes. Uppsala – User Forum 12/4/10 5 nr

6 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Introduction: The Issues Two factors are key for the general performance –A good selection of the resources for computing and storage. –A good partition strategy. This not only affects the response time, but also the failure ratio –Queues have a limitation in the maximum job executing time. –Fault-tolerance automatic resubmission also needs to know if a job is executing slowly or simply it is blocked. Thus, a key issue when performing the scheduling of thousands of jobs is estimating the response time of the tasks. –Create a model to estimate the execution time of a job. Uppsala – User Forum 12/4/10 6

7 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Factors affecting the performance Quasi-Deterministic Factors –Application dependant.  Input data file size.  DB file size.  Similarity.  Number of hits. –Resource dependant.  SPECint (SI00).  SPECfp (SF00).  Memory. Undetermisitic Factors –Load dependant  Queue size  Average waiting time –Site dependant  Site availability.  Specific job failure rates. Uppsala – User Forum 12/4/10 7 Obtained through experiments Not yet covered in the study

8 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Influence of the data size The purpose of this experiment was to analyse the influence of the data size (input file and database). Using the UniProt database, several files with different sizes were generated and executed in the same machine. The results show that the input file size and database size have a direct linear impact on the response time. Uppsala – User Forum 12/4/10 8

9 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Influence of the similarity The heuristic nature of BLAST accelerates the comparison of two clearly unrelated sequences To check the influence of the similarity on the searches, three new versions of the UniProt database were produced, replacing 1%, 5% and 10% of their contents respectively. As it can be seen, the response time is independent from the similarity between sequences. –This factor will not be considered in the final performance model. Uppsala – User Forum 12/4/10 9

10 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Influence of the parameter bhits An experiment has been executed in different computers for different values of the “BLAST Hits” argument –Values range from Xxxx to XXX. –Although the number of results produced in the output increases accordingly, no effect is observed on the response time nor the failure rate. Uppsala – User Forum 12/4/10 10

11 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Influence of Resource Performance The GlueSchema includes the values of Benchmarks SpecInt 2000 and SpecFloat 2000. 9 Different job types have been executed on 19 sites where this information was published. EGEE'08 - Vangelis Floros 11

12 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Spec Benchmark Uppsala – User Forum 12/4/10 12 The SpecInt and SpecFloat benchmarks do not seem to be correlated with actual performance for BLAST –Correlation index ranges from 0.08 to 0.55 with poor values of significance (up to 77% of randomness). This might be caused by different factors –Unsuitability of the benchmark for BLAST. –Lack of accuracy of the benchmark, which is not computed in many sites but obtained from tables. Moreover, the values of the benchmarks are not always published. A new estimator is needed. T1T2T3T4T5T6T7T8T9 Pearson -0,433-0,309-0,533-0,42-0,521-0,549-0,4130,0790,55 p 0,0520,1750,0140,060,0170,0110,0810,7720,089

13 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Empirical estimator Ratio of average performance with respect to a fixed node –Same tests are repeated in all nodes for different data sizes and speed-ups/downs are computed. –A single coefficient is obtained from the sum of all the computing times of all (the same number and type) of jobs. Obviously, much more correlation is shown (high significance, always above 98%, but generally above 99,99%). Uppsala – User Forum 12/4/10 13 T1T2T3T4T5T6T7T8T9 Pearson0,8880,7880,9570,8970,9780,960,8890,6910,73 p1,86E-073,33E-056,65E-118,72E-082,10E-133,12E-118,33E-070,0040,015

14 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Empirical estimator Uppsala – User Forum 12/4/10 14

15 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Estimating the model Current conclusions –Direct (linear) dependence on the Input data size. –Direct (linear) dependence on the database size. –No dependency on the similarity or number of hits. –Direct dependence on the process speed factor. –Low dependence on the memory, except for saturations. Proposed model Where – P Sp, P Tm and K are the unknown parameters of the regression model. – T inp and T BD are the fixed values for the size of data and database. – Sp is the ratio between the response time in a reference site and each other site. – T inp_basal is 0,5 y T BD_basal is 50. Uppsala – User Forum 12/4/10 15

16 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Real execution time per node Uppsala – User Forum 12/4/10 16

17 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Different model for each node Uppsala – User Forum 12/4/10 17

18 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Single model with the speed adjustment Uppsala – User Forum 12/4/10 18

19 Enabling Grids for E-sciencE EGEE-III INFSO-RI-222667 Conclusions and further work This work presents a performance model for estimating the response time of BLAST runs in the EGEE grid –Direct dependence with the performance indicator and the data size of input and reference database. –No direct dependence with memory or blast hits. Very important parameter for load balancing and pre- emptive resubmission. The work will be extended, introducing new parameters from other components –Workload Manager System. –Local Resource Management Systems. –Bandwidth between CEs y SEs. Uppsala – User Forum 12/4/10 19


Download ppt "EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández."

Similar presentations


Ads by Google