ILRI/BECA Bioinformatics Platform Introduction Etienne de Villiers ILRI - Kenya
Outline ILRI/BECA Bioinformatics Platform Hardware Specialized software: –Database searching –Assembly software CGIAR Bioinformatics Grid
International Livestock Research Institute A lab in Africa at the foot of Kenya’s Ngong Hills
ILRI Research Objectives Overall mandate is livestock research for poverty alleviation in Africa and South East Asia. Undertakes a balance of fundamental and applied research with long, medium and short term objectives. Livestock health, genetics, and management.
ILRI Facilities State of the art laboratories (2500 m 2 ) Large and small animal facilities – Level-2/3 biosafety facility for cattle and sheep Bioinformatics unit –64 CPU Paracel 64-bit HPC cluster Sequencing unit –ABI 3730 and ABI 3100 Microarray facility Proteomics facility Oligonucleotide synthesis unit FACS analysis facility Tick unit
BECA - Biosciences East and Central Africa Under NEPAD several centers of excellence are being established in Africa. One center is being established at ILRI –Biosciences East and Central Africa (BECA). Center will provide state-of–the-art facilities for scientist in the region. Facilities include: Genetics and Genomics lab with high throughput sequencers Microarray laboratory Proteomics laboratory Immunology and molecular biology laboratories Bioinformatics Platform
ILRI/BECA – Bioinformatics Platform Provide all East and Central African scientist access to bioinformatics applications, large-volume data storage, local mirror of all relevant databases, basic training and helpdesk support. EMBNet node for East and central Africa
IBBP services Access to bioinformatics tools through either: –web-based bioinformatics tools through the BBP website –secure shell (ssh) access for registered users Facilities for storage of large datasets Systems administration and backup of datasets Training and support in the use of BBP resources Graduate and Post-graduate Fellowships in Bioinformatics
IBBP Facilities Training room –18 computers with MS windows and Linux –High speed internet connection Servers –66 CPU Beowulf Linux cluster –High availability Web server
IBBP Website
Selection of available tools on IBBP Paracel Blast GeneMatcher2 PTA Oligocheck EMBOSS 200+ bioinformatics tools ClustalW multiple alignment software T-coffee multiple alignment software FastA sequence alignment tool HMMER multiple alignment and sequence searching software Staden sequence assembly and analysis package Primer3 primer design package Paup tree-inference package Phylip tree-inference package Phred/Phrap DNA editing and assembly tools R statistical package Rosetta – Ab initio protein prediction SRS – sequence retrieval tool Etc……
IBBP Hardware Systems Paracel Blast Machine Parallel NCBI-Blast (20 CPU ) Blast PSI-Blast Mega-Blast GeneMatcher CPU supercomputer HMM Smith-Waterman GeneWise Profile HPC Linux cluster 66 CPUs (AMD 64-bit) 72 Gigabyte RAM 3 Terrabyte disk storage
Linux cluster Rocks 4.1 (RedHat) operating system Platform LSF batch queuing shares resources equally between users MPI libraries Parallel computations Application Software (e.g. BLAST, EMBOSS, Rosetta) Middleware (Platform LSF) Operating System (Red Hat - ROCKS) Node Network (GiGE) Application Integration Batch Queue Setup Cluster Build and Configuration Turnkey HPC Integration
Database searching Heuristic Algorithms (FASTA and BLAST) –Gapped BLAST –Traditional ungapped BLAST Are fast but give approximate alignments Dynamic Programming Algorithms –Global – Needleman-Wunsch –Local – Smith-Waterman Give optimal alignment but are very slow
Paracel Blast Server Paracel BLAST is the most advanced BLAST software written specifically for large-scale cluster systems 20 CPU parallel NCBI-Blast 20x faster than NCBI-Blast server Paracel Blast – 1h 9m 56s NCBI – 6 days 2h 20m 34s Blastn – Paracel Blast vs. NCBI Blast Query – Chromosome 8 1 sequence 150,000,000 bases Database – Human Ref. Seq 10,300 sequences 24,300,000 bases
Paracel Blast Server BioView Viewer
Gene Structure Determination To compare a cDNA or EST database to a genomic database, one must allow introns Two approaches: –Double-affine Smith-Waterman (separate gap penalty for introns) –Genewise – protein or HMM versus genomic DNA (models the important features of protein families better)
How to get more distant homologs Use dynamic programming algorithms Use position-specific or HMM profiles Do iterated searches Use translated searches Must be careful in interpretation (statistics)
GeneMatcher2 Do things you either can’t or wouldn’t attempt at NCBI (100x faster) Is a computer specialized for executing calculation intensive methods in bioinformatics: –Especially fast in performing the very sensitive Smith- Waterman pairwise alignment method compensate for frame shifts –GeneWise intron- and frameshift-tolerant search method –Needleman-Wunch alignments –HMM searches 6,144 parallel processor computer
Why GeneMatcher2? Comparison of sensitivity and selectivity of various sequence search methods Blue denotes a software method Yellow denotes a hardware accelerated method Less False positives More true positives
GeneMatcher2 - Performance Time-to-completion comparison of original methods and methods on GeneMatcher2 TBLASTX improvement is 20-fold Other methods at least 100-fold Source: Genome Canada Bioinformatics Platform Project NCBI TBLASTX Paracel TBLASTX Decypher TBLASTX WUSTL HMM cluster Decypher HMM FASTA Smith-Waterman GeneMatcher2 SW EBI GeneWise Paracel GeneWIse Runtime for an average query Method Seconds * * *
BioView Workbench BioView Viewer
Assembly Software Paracel Transcript Assembler (PTA) –High capacity solution for EST based transcript reconstruction –Can assemble large numbers of ESTs, allowing for splice variants –Complete pipeline for: sequence cleaning,clustering and assembly –Detection, alignment and visualization of alternative splice forms –Visualization through intuitive graphical interfaces
Scientific problems for PTA Proteomics Gene discovery Verify gene predictions for genome assembly Detecting splice variants Patterns of expression, tissue specificity SNP detection Combinations of all the above...
PTA – Contig view
PTA – Splice variant alignment
Paracel Oligocheck Oligocheck use sensitive Smith-Waterman alignment routine of GeneMatcher2 Search oligo’s fast against whole genome Software used by companies designing and synthesizing oligonucleotides e.g. MWG
Ensemble mirror Ensembl is a joint project between EMBL - EBI and the Sanger Institute. A software system which produces and maintains automatic annotation on selected eukaryotic genomes. Our site provides free access to a selected areas of the data and software from the Ensembl project.
CGIAR – HPC GRID computing ILRI Kenya IRRI Philippines ICRISAT India CIP Peru 49 nodes 89 CPUs 33 nodes Genematcher2 4 nodes 8 nodes4 nodes BECA/Partners
Thank you