Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Source View Community Integrative Bioinformatics (NSF) Arabidopsis (reference.

Slides:



Advertisements
Similar presentations
SOMA2 – Drug Design Environment. Drug design environment – SOMA2 The SOMA2 project Tekes (National Technology Agency of Finland) DRUG2000 program.
Advertisements

Building Portals to access Grid Middleware National Technical University of Athens Konstantinos Dolkas, On behalf of Andreas Menychtas.
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
1 OBJECTIVES To generate a web-based system enables to assemble model configurations. to submit these configurations on different.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
USING THE GLOBUS TOOLKIT This summary by: Asad Samar / CALTECH/CMS Ben Segal / CERN-IT FULL INFO AT:
The Center for Computational Genomics and Bioinformatics Christopher Dwan Mike Karo Tim Kunau.
15 Chapter 15 Web Database Development Database Systems: Design, Implementation, and Management, Fifth Edition, Rob and Coronel.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Passage Three Introduction to Microsoft SQL Server 2000.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
® IBM Software Group © IBM Corporation IBM Information Server Service Oriented Architecture WebSphere Information Services Director (WISD)
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
Introduction to HP LoadRunner Getting Familiar with LoadRunner >>>>>>>>>>>>>>>>>>>>>>
A Scalable Application Architecture for composing News Portals on the Internet Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta Famagusta.
SCRAM Software Configuration, Release And Management Background SCRAM has been developed to enable large, geographically dispersed and autonomous groups.
Christopher Jeffers August 2012
Week 7 Lecture Web Database Development Samuel Conn, Asst. Professor
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Flexibility and user-friendliness of grid portals: the PROGRESS approach Michal Kosiedowski
Grids and Portals for VLAB Marlon Pierce Community Grids Lab Indiana University.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.

© 2001 Business & Information Systems 2/e1 Chapter 8 Personal Productivity and Problem Solving.
Lead Black Slide Powered by DeSiaMore1. 2 Chapter 8 Personal Productivity and Problem Solving.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
11 Overview Paracel GeneMatcher2. 22 GeneMatcher2 The GeneMatcher system comprises of hardware and software components that significantly accelerate a.
NOVA Networked Object-based EnVironment for Analysis P. Nevski, A. Vaniachine, T. Wenaus NOVA is a project to develop distributed object oriented physics.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
6/12/99 Java GrandeT. Haupt1 The Gateway System This project is a collaborative effort between Northeast Parallel Architectures Center (NPAC) Ohio Supercomputer.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Bio-Linux 3.0 An integrated bioinformatics solution for the EG community ClustalX showing DNA polymerase alignment GeneSpring showing yeast transcriptome.
Genomes to Grids Thoughts on Building Data Grids for Biology Biologists have discovered many millions of genes and genome features, now part of the bio-data.
Having a Blast! on DiaGrid Carol Song Rosen Center for Advanced Computing December 9, 2011.
NOVA A Networked Object-Based EnVironment for Analysis “Framework Components for Distributed Computing” Pavel Nevski, Sasha Vanyashin, Torre Wenaus US.
1 e-Science AHM st Aug – 3 rd Sept 2004 Nottingham Distributed Storage management using SRB on UK National Grid Service Manandhar A, Haines K,
Copyright © 2010, SAS Institute Inc. All rights reserved. SAS ® Using the SAS Grid.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
Institute for the Protection and Security of the Citizen HAZAS – Hazard Assessment ECCAIRS Technical Course Provided by the Joint Research Centre - Ispra.
What is BLAST? Basic BLAST search What is BLAST?
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
Integration of BioInformatics tools at NUS. GenBank Growth Chart Year Bases.
Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
The Database Project a starting work by Arnauld Albert, Cristiano Bozza.
Zach Miller Computer Sciences Department University of Wisconsin-Madison Supporting the Computation Needs.
5/29/2001Y. D. Wu & M. Liu1 Content Management for Digital Library May 29, 2001.
The Holmes Platform and Applications
What is BLAST? Basic BLAST search What is BLAST?
The LIBI Federated database
Netscape Application Server
Open Source distributed document DB for an enterprise
Basics of BLAST Basic BLAST Search - What is BLAST?
Overview of big data tools
Serpil TOK, Zeki BAYRAM. Eastern MediterraneanUniversity Famagusta
Mark Quirk Head of Technology Developer & Platform Group
Presentation transcript:

Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Source View Community Integrative Bioinformatics (NSF) Arabidopsis (reference organism) All cereals (NSF) Rice Legumes Soy EST (USB) Soy Functional (NSF) Medicago (NSF) Trees Pine EST (DOE) Pine Functional (NSF)

Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Partnerships Research Community Support: Shared Expertise and Knowledge Bioinformatics Community Plant Community Metacomputing Community Federal Support: Grants and Contracts Corporate Support: Hardware, Software, and Data Integrated Genomics

Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Application View All public genomic data Sequence processing Similarity Searches Unigene Sets Diogenes Pipeline, Automation BioData

Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Application View All public genomic data Sequence processing Similarity Searches Unigene Sets Diogenes Pipeline, Automation Genomics Desktop Functional Genomics Array Design SAGE Clustering Data Mining Visualization & Exploration BioData

Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Application View Warehouse Multi-species Comparative Functional Genomics Metafam All public genomic data Sequence processing Similarity Searches Unigene Sets Diogenes Pipeline, Automation Genomics Desktop Functional Genomics Array Design SAGE Clustering Data Mining Visualization & Exploration Metabolic Pathway Reconstruction BioData Relational Genbank

Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA The Genomics Grid Distributed Computing: Condor, Globus, Sun Grid Clusters of Workstations High Performance Networking ATM / GBE / FCAL Internet 2 Special Purpose Hardware Time Logic “DeCypher” Interoperable Software “Grid Aware” Applications Remote SQL Queries Java Enterprise level data storage Oracle High Throughput Genomics Visual Exploration of Global Data Resources Real Time, Visual Collaboration

Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Design Goals Scalable - Provide a workload management solution for large scale bioinformatics processing Extensible - Add new tools easily without modifying core components Portable - Deliver functionality in heterogeneous environments Collaborative - Combine processing resources to increase throughput

Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Underlying Components

Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Client Data Files Metadata Context Unique Internal Identifiers Individual Data Items (Chromatograms or sequence files) All metadata related to each individual sequence XML format “Preprocessing” database Data submissions happen in batches, initiated by clients. File formats, processing requirements, and batch structure vary widely. Data arrives at CCGB in a well structured format, amenable to automatic processing. Web Based Submission Tool

Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Data Submission Prototype In this example of a data submission page, the user selects the appropriate data directory, and uses Netscape’s file browser to upload the TAB delimited spreadsheet file.

Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Metadata Required for Processing Name Type Sequence IDString (used to identify which data file is associated with this metadata) Sequence NameString (used for GSS# or EST# in GB submission) Experiment Type Data Type Date SequencedDate Seq PrimerIdentifier for Primer (CBC maintained list) Contact NameIdentifier for NCBI Contact File (CBC maintained list) CitationIdentifier for NCBI Citation File (CBC maintained list) Library Identifier for NCBI Library File (CBC maintained list) Class OrganismIdentifier for organism (CBC maintained list) Send to DB Some quality control checking is done at submission time to ensure that the metadata are consistent and correct. This includes a “spellcheck” like feature to be sure that primers, citations and such reference things known to CBC.

Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Tasks in Processing Biological Data Base Calling (Phred, Phran) Vector Filter (VF4) Artifact Filter (af) BLAST (blast, blastx, tblastx, blastn) Contig construction (Phrap) Microarray Design Primer Selection Functional Analysis & Annotation Submission to public repositories (Genbank) Publication

Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA TkBatch User Interface Provide a configurable interface to a set of tools. Batch Processing System Enable batch submission of thousands of jobs Dependency Management Define Directed Acyclic Graphs (DAG)s for process flow. A DAG is not a tree.

Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Watchlist: Directed Acyclic Graph of processes which will act on the input data File List: Input data, possibly selected from diverse locations in the file system Compile to - Job Description: Enumerates all tasks included in the job, all job dependencies, as well as a “status journal” indicating progress through the tasks. TkBatch – Use Outline Submit to – Distributed Processing CONDOR metacomputing platform Similar to GLOBUS and Sun’s GRID Uses idle workstations to perform processing tasks Dependancy Observe through TkBatch Building process monitoring capabilities into the TkBatch system. Obtaining CONDOR source code to make improvements directly.

Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Application Configuration Some system abstraction, but still a very “close to the road” interface Tools cannot be selected unless they are appropriate to the current output type in the watchlist. Reasonable defaults are provided for command line options.

Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA Analysis Tools RelGB –A simple relational framework for GenBank Data –Java based UI for biologically relevant queries SSR Identification & primer design for ESTs –All; UTR; BAC-end; BAC EST contigs: Diogenes-Blast; Primer3 –Analysis Tools

Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA PERL and CGI Scripts, operating on XML indexes to data directories Creates set of predefined web views on data Grant Summary Grant Info Grant Statistics Contig list Submission Set List Submission Sequence Length Distribution Submission Set Visualization Search BLAST reports Sequence List Contig Sets Contig Info Table Phrap Parameters Submissions in the Contig Set Contig Quality Graphs Sequences in the Contig Set Contig Page Sequence Info Contig Visualization Sequence Analysis Tools BLAST Reports Sequence Info Raw Sequence Filtered Sequence Sequence Quality Graph Sequence Analysis Tools BLAST Reports Project Statistics Number of sequences Number of submissions Length Statistics Contig Statistics Quality Statistics BioData Summary

Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA BioData File Tree contig_dir_### | +-index.xml | | <contigdata | | kingdom="Planta" | | family="Pinaceae" | | species="Pinus taeda" | | files="contigs" | | > | | | | Xylem | | | | NXNV | | | | a b a b | | a a a | | b a b a | | c a c b | | | | <phrapparams | | minmatch="40" | | minscore="80" | | > | | | | <contigversionlist | | AssemblyProcessId="PtaedaNormalXylem" | | AssemblyProcessVersion="1" | | AssemblyStepNumber="1" | | > | | | +-libraryname.fasta.screen.ace.1 | |

Center for Computational Genomics and Bioinformatics U NIVERSITY OF M INNESOTA CCGB Condor Cluster 65 processors on 37 machines Performance –4.75 Gflops –25 BIPS –19 GB memory –Figures are roughly equivalent to a 16 processor IBM SP2 Customized usage policies