EGEE-III INFSO-RI-222667 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández.

Slides:



Advertisements
Similar presentations
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Advertisements

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Challenge the future Delft University of Technology Blade Load Estimations by a Load Database for an Implementation in SCADA Systems Master Thesis.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
A Grid Resource Broker Supporting Advance Reservations and Benchmark- Based Resource Selection Erik Elmroth and Johan Tordsson Reporter : S.Y.Chen.
Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Stuart K. PatersonCHEP 2006 (13 th –17 th February 2006) Mumbai, India 1 from DIRAC.Client.Dirac import * dirac = Dirac() job = Job() job.setApplication('DaVinci',
Ch 4. The Evolution of Analytic Scalability
Self-Organizing Agents for Grid Load Balancing Junwei Cao Fifth IEEE/ACM International Workshop on Grid Computing (GRID'04)
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
Predicting performance of applications and infrastructures Tania Lorido 27th May 2011.
Lecture 2 Process Concepts, Performance Measures and Evaluation Techniques.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
EGEE-III INFSO-RI Enabling Grids for E-sciencE I. Blanquer(1), V. Hernandez(1), G. Aparicio (1), M. Pignatelli(2), J. Tamames(2)
An Empirical Study of Choosing Efficient Discriminative Seeds for Oligonucleotide Design Won-Hyong Chung and Seong-Bae Park Dept. of Computer Engineering.
Grid infrastructure analysis with a simple flow model Andrey Demichev, Alexander Kryukov, Lev Shamardin, Grigory Shpiz Scobeltsyn Institute of Nuclear.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Ignacio Blanquer Vicente Hernández Bioinformatics.
Enabling Grids for E-sciencE EGEE-III INFSO-RI Using DIANE for astrophysics applications Ladislav Hluchy, Viet Tran Institute of Informatics Slovak.
1 Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics CH-1015 Lausanne, Switzerland Borrowed from Heinz Stockinger June.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America EELA Demo: Blast in Grids Ignacio Blanquer.
Job scheduling algorithm based on Berger model in cloud environment Advances in Engineering Software (2011) Baomin Xu,Chunyan Zhao,Enzhao Hua,Bin Hu 2013/1/251.
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
Univariate Linear Regression Problem Model: Y=  0 +  1 X+  Test: H 0 : β 1 =0. Alternative: H 1 : β 1 >0. The distribution of Y is normal under both.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA2 Quality Plan for EGEE III Geneviève.
Enabling Grids for E- sciencE EGEE and gLite are registered trademarks EGEE-III INFSO-RI Analysis of Overhead and waiting times.
INFSO-RI Enabling Grids for E-sciencE SALUTE – Grid application for problems in quantum transport E. Atanassov, T. Gurov, A. Karaivanova,
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
O PTIMAL SERVICE TASK PARTITION AND DISTRIBUTION IN GRID SYSTEM WITH STAR TOPOLOGY G REGORY L EVITIN, Y UAN -S HUN D AI Adviser: Frank, Yeong-Sung Lin.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid-enabled parameter initialization for.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Design of an Expert System for Enhancing.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Services for advanced workflow programming.
Enabling Grids for E-sciencE INFSO-RI Tools for CIC Operations, Bologna, 24th May Monitoring workflow in EGEE GOC DB is used to get the list.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Web Portal for Chemists M. Sterzel,
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks BiG: A Grid Service to Distribute Large BLAST.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CSTgrid: a whole genome comparison tool for.
Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
INFSO-RI Enabling Grids for E-sciencE Activities of the UPV in NA4- Biomed Ignacio Blanquer Vicente Hernández Universidad Politécnica.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks John Gordon SA1 Face to Face CERN, June.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE III User Forum – Clermont Ferrand Analysis of Metagenomes on the EGEE Grid Gabriel.
IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks InGRID: A Generic Autonomous Expert System.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Practical using WMProxy advanced job submission.
INFSO-RI Enabling Grids for E-sciencE gLite Test and Certification Effort Nick Thackray CERN.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks VOMS & Reliability Vincenzo Ciaschini & Andrea.
SA2 All Hands Meeting — V. Konoplev — 27 March 2009 – Rome Enabling Grids for E-sciencE Veniamin Konoplev (RRC-KI) All Hands meeting GARR/Rome.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Bioinformatics activity Christophe BLANCHET.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Mining Job Monitoring Data Automatic Error.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Robin McConnell Activity Manager UEDIN (NeSC)
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarksEGEE-III INFSO-RI MPI on the grid:
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GOCDB4 Gilles Mathieu, RAL-STFC, UK An introduction.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Constraints on primordial non-Gaussianity.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
INFSO-RI Enabling Grids for E-sciencE EGEE is a project funded by the European Union under contract IST Report from.
Information System Evolution Enabling Grids for E-sciencE EGEE-III INFSO-RI LDAP LDAP_ADD LDAP_MODIFY Query Merge Update Provider Plugin LDIF.
Comparison of different gearshift prescriptions
Ch 4. The Evolution of Analytic Scalability
Fast Sequence Alignments
Presentation transcript:

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández Estimating the Performance of BLAST runs on the EGEE Grid

Enabling Grids for E-sciencE EGEE-III INFSO-RI Outline The problem. Factors affecting the performance. Experiments on the grid. The Performance. Table of performance per node. Execution model. Conclusions and further work. Uppsala – User Forum 12/4/10 2

Enabling Grids for E-sciencE EGEE-III INFSO-RI Introduction: The Problem Sequence alignment is an key operation in Bioinformatics –It involves computing the comparison of proteomic and genomic samples with respect to annotated databases. It is a part of many Bioinformatics pipelines –Used to search for homologous in the study the functionality of different genes and regions. –Used in the phylogenetic taxonomy. There are many tools developed in the literature –Based on the Smith-Waterman transform (e.g. BLAST). –Based on Hash Tables (e.g. SSAHA, BLAT). –Based on Burroughs-Wheeler Transform (e.g. BWA, Bowtie). –And combinations of them (e.g. SSAHA2). Uppsala – User Forum 12/4/10 3

Enabling Grids for E-sciencE EGEE-III INFSO-RI Introduction: The Tool BLAST (Basic Local Alignment Search Tool) is the most widely used tool for performing the alignment of any length novel sequences against the ones contained in a determined database. –Although it could be inefficient for many cases, It has a proven reputation. Because a normal use case entails the alignment of millions of sequences, this kind of experiments are very computationally intensive (it demands years of CPU computation). Uppsala – User Forum 12/4/10 4

Enabling Grids for E-sciencE EGEE-III INFSO-RI Introduction: The Approach Problem is massive parallel and fits the requirements for using efficiently Grid infrastructures. The parallelization process follows the High-Throughput paradigm: –Segmenting the input data file into several chunks which are aligned in independent computation nodes. Uppsala – User Forum 12/4/10 5 nr

Enabling Grids for E-sciencE EGEE-III INFSO-RI Introduction: The Issues Two factors are key for the general performance –A good selection of the resources for computing and storage. –A good partition strategy. This not only affects the response time, but also the failure ratio –Queues have a limitation in the maximum job executing time. –Fault-tolerance automatic resubmission also needs to know if a job is executing slowly or simply it is blocked. Thus, a key issue when performing the scheduling of thousands of jobs is estimating the response time of the tasks. –Create a model to estimate the execution time of a job. Uppsala – User Forum 12/4/10 6

Enabling Grids for E-sciencE EGEE-III INFSO-RI Factors affecting the performance Quasi-Deterministic Factors –Application dependant.  Input data file size.  DB file size.  Similarity.  Number of hits. –Resource dependant.  SPECint (SI00).  SPECfp (SF00).  Memory. Undetermisitic Factors –Load dependant  Queue size  Average waiting time –Site dependant  Site availability.  Specific job failure rates. Uppsala – User Forum 12/4/10 7 Obtained through experiments Not yet covered in the study

Enabling Grids for E-sciencE EGEE-III INFSO-RI Influence of the data size The purpose of this experiment was to analyse the influence of the data size (input file and database). Using the UniProt database, several files with different sizes were generated and executed in the same machine. The results show that the input file size and database size have a direct linear impact on the response time. Uppsala – User Forum 12/4/10 8

Enabling Grids for E-sciencE EGEE-III INFSO-RI Influence of the similarity The heuristic nature of BLAST accelerates the comparison of two clearly unrelated sequences To check the influence of the similarity on the searches, three new versions of the UniProt database were produced, replacing 1%, 5% and 10% of their contents respectively. As it can be seen, the response time is independent from the similarity between sequences. –This factor will not be considered in the final performance model. Uppsala – User Forum 12/4/10 9

Enabling Grids for E-sciencE EGEE-III INFSO-RI Influence of the parameter bhits An experiment has been executed in different computers for different values of the “BLAST Hits” argument –Values range from Xxxx to XXX. –Although the number of results produced in the output increases accordingly, no effect is observed on the response time nor the failure rate. Uppsala – User Forum 12/4/10 10

Enabling Grids for E-sciencE EGEE-III INFSO-RI Influence of Resource Performance The GlueSchema includes the values of Benchmarks SpecInt 2000 and SpecFloat Different job types have been executed on 19 sites where this information was published. EGEE'08 - Vangelis Floros 11

Enabling Grids for E-sciencE EGEE-III INFSO-RI Spec Benchmark Uppsala – User Forum 12/4/10 12 The SpecInt and SpecFloat benchmarks do not seem to be correlated with actual performance for BLAST –Correlation index ranges from 0.08 to 0.55 with poor values of significance (up to 77% of randomness). This might be caused by different factors –Unsuitability of the benchmark for BLAST. –Lack of accuracy of the benchmark, which is not computed in many sites but obtained from tables. Moreover, the values of the benchmarks are not always published. A new estimator is needed. T1T2T3T4T5T6T7T8T9 Pearson -0,433-0,309-0,533-0,42-0,521-0,549-0,4130,0790,55 p 0,0520,1750,0140,060,0170,0110,0810,7720,089

Enabling Grids for E-sciencE EGEE-III INFSO-RI Empirical estimator Ratio of average performance with respect to a fixed node –Same tests are repeated in all nodes for different data sizes and speed-ups/downs are computed. –A single coefficient is obtained from the sum of all the computing times of all (the same number and type) of jobs. Obviously, much more correlation is shown (high significance, always above 98%, but generally above 99,99%). Uppsala – User Forum 12/4/10 13 T1T2T3T4T5T6T7T8T9 Pearson0,8880,7880,9570,8970,9780,960,8890,6910,73 p1,86E-073,33E-056,65E-118,72E-082,10E-133,12E-118,33E-070,0040,015

Enabling Grids for E-sciencE EGEE-III INFSO-RI Empirical estimator Uppsala – User Forum 12/4/10 14

Enabling Grids for E-sciencE EGEE-III INFSO-RI Estimating the model Current conclusions –Direct (linear) dependence on the Input data size. –Direct (linear) dependence on the database size. –No dependency on the similarity or number of hits. –Direct dependence on the process speed factor. –Low dependence on the memory, except for saturations. Proposed model Where – P Sp, P Tm and K are the unknown parameters of the regression model. – T inp and T BD are the fixed values for the size of data and database. – Sp is the ratio between the response time in a reference site and each other site. – T inp_basal is 0,5 y T BD_basal is 50. Uppsala – User Forum 12/4/10 15

Enabling Grids for E-sciencE EGEE-III INFSO-RI Real execution time per node Uppsala – User Forum 12/4/10 16

Enabling Grids for E-sciencE EGEE-III INFSO-RI Different model for each node Uppsala – User Forum 12/4/10 17

Enabling Grids for E-sciencE EGEE-III INFSO-RI Single model with the speed adjustment Uppsala – User Forum 12/4/10 18

Enabling Grids for E-sciencE EGEE-III INFSO-RI Conclusions and further work This work presents a performance model for estimating the response time of BLAST runs in the EGEE grid –Direct dependence with the performance indicator and the data size of input and reference database. –No direct dependence with memory or blast hits. Very important parameter for load balancing and pre- emptive resubmission. The work will be extended, introducing new parameters from other components –Workload Manager System. –Local Resource Management Systems. –Bandwidth between CEs y SEs. Uppsala – User Forum 12/4/10 19