Zach Miller Computer Sciences Department University of Wisconsin-Madison Supporting the Computation Needs.

Slides:



Advertisements
Similar presentations
Submitting a Genome to RAST. Uploading Your Job 1.Login to your RAST account. You will need to register if this is your first time using SEED technologies.
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Condor Project Computer Sciences Department University of Wisconsin-Madison Stork An Introduction Condor Week 2006 Milan.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
RCAC Research Computing Presents: DiaGird Overview Tuesday, September 24, 2013.
ProActive Task Manager Component for SEGL Parameter Sweeping Natalia Currle-Linde and Wasseim Alzouabi High Performance Computing Center Stuttgart (HLRS),
TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub Bin Gan CMSC 838 Presentation.
Geography 465 Overview Geoprocessing in ArcGIS. MODELING Geoprocessing as modeling.
DESCRIPTION: AutomN is concerned with automating the tedious task of protein interaction pathway discovery using only protein sequences as input. AutomN.
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
TIBCO Designer TIBCO BusinessWorks is a scalable, extensible, and easy to use integration platform that allows you to develop, deploy, and run integration.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
A Scalable Application Architecture for composing News Portals on the Internet Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta Famagusta.
A summary of the report written by W. Alink, R.A.F. Bhoedjang, P.A. Boncz, and A.P. de Vries.
Christopher Jeffers August 2012
A Music Filled Flask - Real Time Distributed Transcoding Nicholas Jaeger, Trey Zahradka, & Dr. Peter Bui Department of Computer Science  University of.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
DynamicBLAST on SURAgrid: Overview, Update, and Demo John-Paul Robinson Enis Afgan and Purushotham Bangalore University of Alabama at Birmingham SURAgrid.
The GAVO Cross-Matcher Application Hans-Martin Adorf, Gerard Lemson, Wolfgang Voges GAVO, Max-Planck-Institut für extraterrestrische Physik, Garching b.
Business Unit or Product Name © 2007 IBM Corporation Introduction of Autotest Qing Lin.
Flexibility and user-friendliness of grid portals: the PROGRESS approach Michal Kosiedowski
SEMESTER PROJECT PRESENTATION CS 6030 – Bioinformatics Instructor Dr.Elise de Doncker Chandana Guduru Jason Eric Johnson.
Supporting High- Performance Data Processing on Flat-Files Xuan Zhang Gagan Agrawal Ohio State University.

6 th Annual Focus Users’ Conference Manage Integrations Presented by: Mike Morris.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
Zach Miller Computer Sciences Department University of Wisconsin-Madison Bioinformatics Applications.
Metadata Extraction for NASA Collection June 21, 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil,
Condor Project Computer Sciences Department University of Wisconsin-Madison Case Studies of Using.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Metagenomic Analysis Using MEGAN4 Peter R. Hoyt Director, OSU Bioinformatics Graduate Certificate Program Matthew Vaughn iPlant, University of Texas Super.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
Having a Blast! on DiaGrid Carol Song Rosen Center for Advanced Computing December 9, 2011.
ARGOS (A Replicable Genome InfOrmation System) for FlyBase and wFleaBase Don Gilbert, Hardik Sheth, Vasanth Singan { gilbertd, hsheth, vsingan
VIGNAN'S NIRULA INSTITUTE OF TECHNOLOGY & SCIENCE FOR WOMEN TOOLS LINKS PRESENTED BY 1.P.NAVEENA09NN1A A.SOUJANYA09NN1A R.PRASANNA09NN1A1251.
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.
A Fully Automated Fault- tolerant System for Distributed Video Processing and Off­site Replication George Kola, Tevfik Kosar and Miron Livny University.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
Improving Software with the UW Metronome Becky Gietzel Todd L Miller.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.
Copyright OpenHelix. No use or reproduction without express written consent1.
Matthew Farrellee Computer Sciences Department University of Wisconsin-Madison Condor and Web Services.
EGEE is a project funded by the European Union under contract IST Enabling bioinformatics applications to.
George Kola Computer Sciences Department University of Wisconsin-Madison Data Pipelines: Real Life Fully.
PROTEIN IDENTIFIER IAN ROBERTS JOSEPH INFANTI NICOLE FERRARO.
BIG DATA/ Hadoop Interview Questions.
Turning science problems into HTC jobs Tuesday, Dec 7 th 2pm Zach Miller Condor Team University of Wisconsin-Madison.
Designing, Executing and Sharing Workflows with Taverna 2.4 Different Service Types Katy Wolstencroft Helen Hulme myGrid University of Manchester.
HPC In The Cloud Case Study: Proteomics Workflow
Recap: introduction to e-science
Database Requirements for CCP4 17th October 2005
University of Florida CTS-IT: Automated Data Translation from EMR to REDCap Problem Research often depends upon reliable access to clinical data. Many.
BLAST.
Outline What users want ? Data pipeline overview
Basic Local Alignment Search Tool (BLAST)
MAPREDUCE TYPES, FORMATS AND FEATURES
Overview of Workflows: Why Use Them?
Supporting High-Performance Data Processing on Flat-Files
Condor: BLAST Tuesday, Dec 7th, 10:45am
Lecture 23 CS 507.
Presentation transcript:

Zach Miller Computer Sciences Department University of Wisconsin-Madison Supporting the Computation Needs of Structural Genomics

Overview › What is structural genomics? › Problems we are trying to solve › Applications we use and how they interface with Condor › Future work › Conclusion

What is structural genomics? › It is the branch of genomics that attempts to determine the three dimensional structure of proteins. › This often requires high-throughput computing to do.

Problems we are trying to solve › Target selection – which protein sequences are interesting and worth spending time calculating structures of?  BLAST › Protein structure determination – what is the 3D shape of a given protein sequence?  CNS  CYANA

BLAST › BLAST is developed and supported by NCBI, part of the NIH. › The NCBI BLAST home page is › BLAST is a search tool with special allowances for incomplete data and partial matches.

BLAST target selection › By comparing different sets of whole or partial sequences against other databases of known sequences, you can determine if the sequence you are trying to discover is already part of another database. › In this way you can determine the interesting sequences to work on.

BLAST and Condor › Large BLAST searches are easily split into smaller chunks that can be executed in parallel. › There are two basic approaches:  Split the input query into smaller chunks (our approach)  Split the database into smaller chunks (mpiBLAST approach)

BLAST and Condor › Doing thousands of queries against multiple databases is easy using the Condor/BLAST framework. › Features of the framework:  Input queries can come from a file, ftp, or http  Input queries can be in FASTA or XML format

BLAST and Condor › More features of the framework:  Databases can also be local files or automatically fetched via ftp or http and also in either FASTA or XML format  Database Indexes can be automatically built using formatdb  Multiple input files are joined or split as appropriate to fine-tune throughput  Output can be delivered via ftp

Some statistics › The BMRB here at the UW is using this framework to compare over 100,000 input sequences against five different databases:  nr( sequences )  pdb( sequences )  pdboh( 1122 sequences )  sg( sequences )  bmrb( 2736 sequences)

Some statistics › All in all, the BMRB is doing over 8 billion sequence comparisons for their weekly run. › Condor completes this in roughly eight hours of wall-clock time. › This is now a weekly routine which is fully automated, very reliable, and requires almost no “babysitting”.

Structure Calculation › CNS  Available from › CYANA  Available from › Both do structure calculations but use different methods

CNS and Condor › Using CNS can take a relatively long time to compute for a given entry (protein sequence) depending on the number of possible intermediate structures. › Each structure takes about 5 – 30 minutes depending on length of sequence › At 200 structures per entry, this ends up being between 16 and 100 hours.

CYANA › Cyana takes only about 2 – 16 hours per entry depending on the sequence length. › The cyana results are post-processed with CNS to refine them, which takes an additional 4 – 20 hours per entry

CNS, CYANA, and Condor › Until now, each different group doing structure calculations would process their own entries using different programs or input parameters, making comparisons between different groups difficult. › By processing large numbers of entries in exactly the same way, it is possible to then compare apples to apples.

CNS, CYANA, and Condor › Working with the BMRB, I created a framework which allows you to easily process multiple entries at once with both CNS and CYANA. › Using this framework, Condor calculated structures for 600 entries (about 50,000 hours) in just 10 days.

CNS, CYANA, and Condor › The structure calculation framework is also very reliable and requires very little human time to do a fairly massive amount of computing. › This process can now be easily automated and done on a routine basis.

Challenges › Creating a job flow that doesn’t need babysitting requires that the framework be able to handle a variety of problems. › To this end, it employs some other Condor technologies:  Many things are wrapped in ftsh.  Condor watches for “misbehaving” jobs and kills them using the PERIODIC_REMOVE feature.  DAGMan oversees the whole run and retries failed jobs.

Future Work › BLAST  Use STORK for data transfer which will improve reliability of all file transfers and instantly add support for many more methods of transferring input and output.  Create a wrapper around the framework which behaves just like NCBI’s BLAST but uses Condor behind the scenes.  Include this framework with the Condor distribution so it is BLAST-ready “out of the box”.

Future Work › CNS & CYANA  Use sequence length to better estimate runtime for fine-tuning throughput.  Use STORK for file transfer.

Conclusion › I have created tools which allow users to run coordinated BLAST, CNS, and CYANA runs on very large scales. › This makes it easy to process not only your data but other groups’ too, and end up with results that were all computed with the same protocols and inputs. › This will enable better collaboration by providing more consistency between the results of different groups.