Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview of bioinformatics applications
Introduction Damian Christey Professional Technologist Departments of Mathematics and Biology
Cluster Computing High Availability (HA) High Performance (HPC) Specialized software Highly parallel Beowulf Commodity hardware Open Source software
Biology Cluster Hardware 12 nodes 2 processors per node Dual core 1GHz Opteron 8 GB RAM each Gigabit ethernet 2TB RAID storage
GNU/Linux Free, Open Source, Unix- based operating system Rocks cluster management system: CentOS: derived from Redhat:
Why Linux? Cheap Reliable and Scalable Customizable Unix philosophy Text processing
Accessing the Cluster Monitoring - Secure Shell ssh -X on Mac OS or Windows users can download SSH and X server from: File transfer – SFTP for Windows for Mac qrsh – command to get a shell on a node
Unix Filesystem Tree with a single root: / folders may be physically stored on separate devices, different machines /home/bob : Bob’s files /opt/Bio : Bioinformatics programs /share/bio : shared data, genome libraries
Unix Permissions 3x3 Matrix: owner, group, other read, write, execute chgrp biouser file change the group to which the file belongs chmod g+w file give the group write permission to your file
Text Processing cat file : dump the contents of file to standard output head, tail : output the first / last n lines of file grep : return lines matching pattern in input or file grep -v : invert match | : pipe output of one program to another > : pipe output to a file >> : concatenate output to end of file
Sequencing and Assembly Software Phred - reads DNA sequencing trace files, calls bases, and assigns quality values Phrap - assembling shotgun DNA sequence data Consed - viewing, editing, and finishing sequence assemblies created with phrap Artemis - genome viewer and annotation tool
Sequence Analysis and Screening Software (WU, NCBI, MPI) BLAST - find regions of local similarity between sequences ClustalW, T_Coffee, MUSCLE - multiple sequence alignment RepeatMasker - screens for interspersed repeats and low complexity sequences RepeatScout, PILER - de novo repeat finder EMBOSS – assorted analysis tools
Phylogenetics Software Phylip, Paup - packages for inferring phylogenies or evolutionary trees. MrBayes - bayesian inference of phylogeny Structure - model-based clustering method for inferring population structure