Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Natasha Pavlovikj, Kevin Begcy, Sairam Behera, Malachy Campbell, Harkamal Walia, Jitender S.Deogun University of Nebraska-Lincoln Evaluating Distributed.
SALSA HPC Group School of Informatics and Computing Indiana University.
A Grid implementation of the sliding window algorithm for protein similarity searches facilitates whole proteome analysis on continuously updated databases.
Introduction to Web services MSc on Bioinformatics for Health Sciences May 2006 Arnaud Kerhornou Iván Párraga García INB.
Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
The Protein Data Bank (PDB)
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
Asynchronous Solution Appendix Eleven. Training Manual Asynchronous Solution August 26, 2005 Inventory # A11-2 Chapter Overview In this chapter,
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Genome Annotation BCB 660 October 20, From Carson Holt.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
What is Unix Prepared by Dr. Bahjat Qazzaz. What is Unix UNIX is a computer operating system. An operating system is the program that – controls all the.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Multiple testing correction
The BioBox Initiative: Bio-ClusterGrid Gilbert Thomas Associate Engineer Sun APSTC – Asia Pacific Science & Technology Center.
Christopher Jeffers August 2012
Lecture 7 Interaction. Topics Implementing data flows An internet solution Transactions in MySQL 4-tier systems – business rule/presentation separation.
Getting Access to FutureGrid CTS Conference 2011 Philadelphia May Geoffrey Fox
DynamicBLAST on SURAgrid: Overview, Update, and Demo John-Paul Robinson Enis Afgan and Purushotham Bangalore University of Alabama at Birmingham SURAgrid.
SEMESTER PROJECT PRESENTATION CS 6030 – Bioinformatics Instructor Dr.Elise de Doncker Chandana Guduru Jason Eric Johnson.
March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE.
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
1 Overview of the Application Hosting Environment Stefan Zasada University College London.
Wenjing Wu Computer Center, Institute of High Energy Physics Chinese Academy of Sciences, Beijing BOINC workshop 2013.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.
SALSA HPC Group School of Informatics and Computing Indiana University.
1 Sergio Maffioletti Grid Computing Competence Center GC3 University of Zurich Swiss Grid School 2012 Develop High Throughput.
Jodi Humann, Stephen Ficklin, Taein Lee, Chun-Huai Cheng, Sook Jung, Jill Wegrzyn, David Neale and Dorrie Main An easy to use, web-based solution for specialty.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
1 Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics CH-1015 Lausanne, Switzerland Borrowed from Heinz Stockinger June.
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America EELA Demo: Blast in Grids Ignacio Blanquer.
The BioBox Initiative: Bio-ClusterGrid Maddie Wong Technical Marketing Engineer Sun APSTC – Asia Pacific Science & Technology Center.
A Genomics View of Unix. General Unix Tips To use the command line start X11 and type commands into the “xterm” window A few things about unix commands:
Interactive Workflows Branislav Šimo, Ondrej Habala, Ladislav Hluchý Institute of Informatics, Slovak Academy of Sciences.
Lettuce/Sunflower EST CGPDB project. Data analysis, assembly visualization and validation. Alexander Kozik, Brian Chan, Richard Michelmore. Department.
Having a Blast! on DiaGrid Carol Song Rosen Center for Advanced Computing December 9, 2011.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández.
ARGOS (A Replicable Genome InfOrmation System) for FlyBase and wFleaBase Don Gilbert, Hardik Sheth, Vasanth Singan { gilbertd, hsheth, vsingan
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Big Data to Knowledge Panel SKG 2014 Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China August Geoffrey Fox
May07-02: Parking Meter Clint Hertz: Team Leader Austyn Trace: Communications Nick Hollander Christian Baldus.
A Mobile Library Management System Advisor: Dr. Shen Student: Ananta Gampaa November 8 th,2005.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Matthew Farrellee Computer Sciences Department University of Wisconsin-Madison Condor and Web Services.
Millions of Jobs or a few good solutions …. David Abramson Monash University MeSsAGE Lab X.
MGRID Architecture Andy Adamson Center for Information Technology Integration University of Michigan, USA.
NCBI: something old, something new. What is NCBI? Create automated systems for knowledge about molecular biology, biochemistry, and genetics. Perform.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Microsoft Visual Basic 2015 CHAPTER ONE Introduction to Visual Basic 2015 Programming.
Zach Miller Computer Sciences Department University of Wisconsin-Madison Supporting the Computation Needs.
CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Galaxy based BLAST submission to distributed high throughput computing resources Rob Quick and Soichi Hayashi Open Science Grid Operations Indiana University.
SCI-BUS is supported by the FP7 Capacities Programme under contract nr RI Improving the Swiss Grid Proteomics Portal Peter Kunszt, Lorenz Blum,
BLAST: Basic Local Alignment Search Tool Robert (R.J.) Sperazza BLAST is a software used to analyze genetic information It can identify existing genes.
HPC In The Cloud Case Study: Proteomics Workflow
Tim Hall Oracle ACE Director
HPC In The Cloud Case Study: Proteomics Workflow
Basics of BLAST Basic BLAST Search - What is BLAST?
Lettuce/Sunflower EST CGPDB project.
Ch 4. The Evolution of Analytic Scalability
Introduction to Apache
Basic Local Alignment Search Tool (BLAST)
Presentation transcript:

Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1

Outline 2  Objective  EST Sequence Assembly  The Problem  SWARM  Tools  Results  Future Work

Objective  Use the SWARM service and leverage the High Performance clusters for EST Sequence Assembly. 3

EST Sequence Assembly  ESTs are a collection of random cDNA sequences, sequenced from a cDNA library.  The ESTs are clustered and assembled to form contigs.  The contigs are then used to identify potential unknown genes, by Blasting against a known protein database. 4

The Problem  The input is typically large, of the order of 1 million sequences.  Memory intensive  Time consuming  Involves multiple programs 5

SWARM  A high-level job scheduling Web service framework, developed by the Pervasive Technology Institute – Indiana University.  Can submit millions of jobs to several high performance clusters and monitor their status.  extensible, lightweight, and easily installable on a desktop or small server. 6

Tools TaskTools Cleaning sequence reads Repeat Masker Clustering sequence reads PaCE Assemble reads Cap3 Similarity search Blast 7

Repeat Masker  Developed by Institute of Systems Biology  Screens sequences for interspersed repeats and low complexity regions.  Sequence comparisons done by cross_match  Splitting of input to buckets  Post processing step 8

CAP3  Developed by Department of Computer Science, Michigan Technological University.  CAP3 is very memory intensive and cannot be run on small servers. 9

PaCE  Developed by Department of Computer Science, Iowa State University.  Clusters ESTs on parallel computers  Post-Processing step 10

CAP3  Since the clustering step is done, the load for CAP3 is considerably less, but not trivial. No. of SequencesNo. of Clusters by PaCE

PaCE Clusters 12

CAP3  Sort the input files, and submit the Cap3 jobs both ways. 13

CAP3  Set a threshold, and submit the files with number of sequences less than the threshold to the local machine and the others to GRID. 14

CAP3  CAP3 Job Distribution after clustering of clusters for 2 million sequences 15

BLAST  NCBI BLAST for homology search  Splitting of input to buckets  If Complete, update the status for the pipeline in the database, zip the output files and to the User. 16

Workflow  Login and select the programs one wants to run from the list of available programs. 17

Workflow  Enter the parameters for the selected programs. 18

Workflow  Upload the required files, if any.  The job is then submitted to the Swarm service and a status message is displayed.  An is sent to the user, once the job is completed. 19

Results 20  Assembly results for 2million sequences No. of Sequenc es Runtime for PaCE No. of Clusters by PaCE No. of jobs for CAP3 Runtime for CAP3 Total Runtime :22 hours :44 hours 27:06 hours

Results 21  Runtime for the entire pipeline for 2 million sequences ProgramNo. Of JobsRun time Repeat Masker100011:56 PaCE101:22 CAP :44 BLAST89349:00

Validation 22  The Assembly results for Daphnia pulex, assembled using Swarm was compared to the assembly results of EST Piper.  Comparison of Blast results with hits greater than e value of 2 are as follows : No.NameEST PiperSwarm 1Number Of Contigs Number of hits No. of unique top hit genes

Validation 23  Number of genes commonly identified were That is, Swarm predicted 76.4% of the genes predicted by assembly using EST Piper.  There were 3284 genes identified by Swarm but not EST Piper.

Future Work  Implement assembly programs like MIRA for next- gen sequences.  Try different job scheduling strategies.  Use cloud computing resources. 24