ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya.

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
SALSA HPC Group School of Informatics and Computing Indiana University.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Introduction to bioknoppix: Linux for the life sciences Carlos M Rodríguez Rivera Humberto Ortiz Zuazaga.
The Center for Computational Genomics and Bioinformatics Christopher Dwan Mike Karo Tim Kunau.
A Grid implementation of the sliding window algorithm for protein similarity searches facilitates whole proteome analysis on continuously updated databases.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Bioinformatics and Phylogenetic Analysis
Workshop in Bioinformatics 2010 Class # Class 8 March 2010.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
A Study of Computational Methods for Storing and Sequencing Genetic Databases CSC 545 – Advanced Database Systems By: Nnamdi Ihuegbu 12/2/03.
Sequence alignment, E-value & Extreme value distribution
A Study of GeneWise with the Drosophila Adh Region Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc.,
From Pairwise Alignment to Database Similarity Search.
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
Presented by Liu Qi An introduction to Bioinformatics Algorithms Qi Liu
Login: BITseminar Pass: BITseminar2011 Login: BITseminar Pass: BITseminar2011.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
The BioBox Initiative: Bio-ClusterGrid Gilbert Thomas Associate Engineer Sun APSTC – Asia Pacific Science & Technology Center.
BLAST What it does and what it means Steven Slater Adapted from pt.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Chapter 14 Genomes and Genomics. Sequencing DNA dideoxy (Sanger) method ddGTP ddATP ddTTP ddCTP 5’TAATGTACG TAATGTAC TAATGTA TAATGT TAATG TAAT TAA TA.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.

The Biosciences Facility Biosciences eastern and central Africa (BecA) being established as part of NEPAD’s network of centres of excellence Ed Rege Director,
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
11 Overview Paracel GeneMatcher2. 22 GeneMatcher2 The GeneMatcher system comprises of hardware and software components that significantly accelerate a.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
SALSA HPC Group School of Informatics and Computing Indiana University.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
EMBOSS over a Grid 1. 1st EELA Grid School December 4th of 2006 Eduardo MURRIETA LEON Romualdo ZAYAS-LAGUNAS Pierre-Alain BRANGER Jérôme VERLEYEN Roberto.
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
Eukaryotic Gene Prediction Rui Alves. How are eukaryotic genes different? DNA RNA Pol mRNA Ryb Protein.
EB3233 Bioinformatics Introduction to Bioinformatics.
Algorithms for Biological Sequence Analysis Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University,
Bioinformatics and Computational Biology
An approach to carry out research and teaching in Bioinformatics in remote areas Alok Bhattacharya Centre for Computational Biology & Bioinformatics JAWAHARLAL.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.
Annotation of eukaryotic genomes
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
CIP HPC CIP - HPC HPC = High Performance Computer It’s not a regular computer, it’s bigger, faster, more powerful, and more.
Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Bioinformatics Shared Resource Bioinformatics : How to… Bioinformatics Shared Resource Kutbuddin Doctor, PhD.
Visualizing Biosciences Genomics & Proteomics. “Scientists Complete Rough Draft of Human Genome” - New York Times, June 26, 2000 The problem: –3 billion.
Bioinformatics Computing 1 CMP 807 – Day 4 Kevin Galens.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
What is BLAST? Basic BLAST search What is BLAST?
bacteria and eukaryotes
Bioinformatics Overview
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Sequence Based Analysis Tutorial
Pairwise Sequence Alignment
Basic Local Alignment Search Tool
Presentation transcript:

ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya

Outline ILRI/BECA Bioinformatics Platform Hardware Specialized software: –Database searching –Assembly software CGIAR Bioinformatics Grid

International Livestock Research Institute A lab in Africa at the foot of Kenya’s Ngong Hills

ILRI Research Objectives Overall mandate is livestock research for poverty alleviation in Africa and South East Asia. Undertakes a balance of fundamental and applied research with long, medium and short term objectives. Livestock health, genetics, and management.

ILRI Facilities State of the art laboratories (2500 m 2 ) Large and small animal facilities – Level-2/3 biosafety facility for cattle and sheep Bioinformatics unit –64 CPU Paracel 64-bit HPC cluster Sequencing unit –ABI 3730 and ABI 3100 Microarray facility Proteomics facility Oligonucleotide synthesis unit FACS analysis facility Tick unit

BECA - Biosciences East and Central Africa Under NEPAD several centers of excellence are being established in Africa. One center is being established at ILRI –Biosciences East and Central Africa (BECA). Center will provide state-of–the-art facilities for scientist in the region. Facilities include: Genetics and Genomics lab with high throughput sequencers Microarray laboratory Proteomics laboratory Immunology and molecular biology laboratories Bioinformatics Platform

ILRI/BECA – Bioinformatics Platform Provide all East and Central African scientist access to bioinformatics applications, large-volume data storage, local mirror of all relevant databases, basic training and helpdesk support. EMBNet node for East and central Africa

IBBP services Access to bioinformatics tools through either: –web-based bioinformatics tools through the BBP website –secure shell (ssh) access for registered users Facilities for storage of large datasets Systems administration and backup of datasets Training and support in the use of BBP resources Graduate and Post-graduate Fellowships in Bioinformatics

IBBP Facilities Training room –18 computers with MS windows and Linux –High speed internet connection Servers –66 CPU Beowulf Linux cluster –High availability Web server

IBBP Website

Selection of available tools on IBBP Paracel Blast GeneMatcher2 PTA Oligocheck EMBOSS 200+ bioinformatics tools ClustalW multiple alignment software T-coffee multiple alignment software FastA sequence alignment tool HMMER multiple alignment and sequence searching software Staden sequence assembly and analysis package Primer3 primer design package Paup tree-inference package Phylip tree-inference package Phred/Phrap DNA editing and assembly tools R statistical package Rosetta – Ab initio protein prediction SRS – sequence retrieval tool Etc……

IBBP Hardware Systems Paracel Blast Machine Parallel NCBI-Blast (20 CPU ) Blast PSI-Blast Mega-Blast GeneMatcher CPU supercomputer HMM Smith-Waterman GeneWise Profile HPC Linux cluster 66 CPUs (AMD 64-bit) 72 Gigabyte RAM 3 Terrabyte disk storage

Linux cluster Rocks 4.1 (RedHat) operating system Platform LSF batch queuing shares resources equally between users MPI libraries Parallel computations Application Software (e.g. BLAST, EMBOSS, Rosetta) Middleware (Platform LSF) Operating System (Red Hat - ROCKS) Node Network (GiGE) Application Integration Batch Queue Setup Cluster Build and Configuration Turnkey HPC Integration

Database searching Heuristic Algorithms (FASTA and BLAST) –Gapped BLAST –Traditional ungapped BLAST  Are fast but give approximate alignments Dynamic Programming Algorithms –Global – Needleman-Wunsch –Local – Smith-Waterman  Give optimal alignment but are very slow

Paracel Blast Server Paracel BLAST is the most advanced BLAST software written specifically for large-scale cluster systems 20 CPU parallel NCBI-Blast 20x faster than NCBI-Blast server Paracel Blast – 1h 9m 56s NCBI – 6 days 2h 20m 34s Blastn – Paracel Blast vs. NCBI Blast Query – Chromosome 8 1 sequence 150,000,000 bases Database – Human Ref. Seq 10,300 sequences 24,300,000 bases

Paracel Blast Server BioView Viewer

Gene Structure Determination To compare a cDNA or EST database to a genomic database, one must allow introns Two approaches: –Double-affine Smith-Waterman (separate gap penalty for introns) –Genewise – protein or HMM versus genomic DNA (models the important features of protein families better)

How to get more distant homologs Use dynamic programming algorithms Use position-specific or HMM profiles Do iterated searches Use translated searches  Must be careful in interpretation (statistics)

GeneMatcher2 Do things you either can’t or wouldn’t attempt at NCBI (100x faster) Is a computer specialized for executing calculation intensive methods in bioinformatics: –Especially fast in performing the very sensitive Smith- Waterman pairwise alignment method compensate for frame shifts –GeneWise intron- and frameshift-tolerant search method –Needleman-Wunch alignments –HMM searches 6,144 parallel processor computer

Why GeneMatcher2? Comparison of sensitivity and selectivity of various sequence search methods Blue denotes a software method Yellow denotes a hardware accelerated method Less False positives More true positives

GeneMatcher2 - Performance Time-to-completion comparison of original methods and methods on GeneMatcher2 TBLASTX improvement is 20-fold Other methods at least 100-fold Source: Genome Canada Bioinformatics Platform Project NCBI TBLASTX Paracel TBLASTX Decypher TBLASTX WUSTL HMM cluster Decypher HMM FASTA Smith-Waterman GeneMatcher2 SW EBI GeneWise Paracel GeneWIse Runtime for an average query Method Seconds * * *

BioView Workbench BioView Viewer

Assembly Software Paracel Transcript Assembler (PTA) –High capacity solution for EST based transcript reconstruction –Can assemble large numbers of ESTs, allowing for splice variants –Complete pipeline for: sequence cleaning,clustering and assembly –Detection, alignment and visualization of alternative splice forms –Visualization through intuitive graphical interfaces

Scientific problems for PTA Proteomics Gene discovery Verify gene predictions for genome assembly Detecting splice variants Patterns of expression, tissue specificity SNP detection Combinations of all the above...

PTA – Contig view

PTA – Splice variant alignment

Paracel Oligocheck Oligocheck use sensitive Smith-Waterman alignment routine of GeneMatcher2 Search oligo’s fast against whole genome Software used by companies designing and synthesizing oligonucleotides e.g. MWG

Ensemble mirror Ensembl is a joint project between EMBL - EBI and the Sanger Institute. A software system which produces and maintains automatic annotation on selected eukaryotic genomes. Our site provides free access to a selected areas of the data and software from the Ensembl project.

CGIAR – HPC GRID computing ILRI Kenya IRRI Philippines ICRISAT India CIP Peru 49 nodes 89 CPUs 33 nodes Genematcher2 4 nodes 8 nodes4 nodes BECA/Partners

Thank you