1 Operating System Packages for Bioinformatics Allen Day 2005.05.17

2 What is a package?  Software, config files, documentation, and/or data encapsulated in a single file  Metadata describing: Version, license, package “category” Dependencies What the package provides

3  GMOD target audience Small MODs

4 Package Dependency Graph Dependencies What the package provides chado chado-Hsa genome-Hsa-nibucsc-blat genome-Hsa-annotation-affymetrix genome-Hsa-annotation-gene postgresql-AffxSeq postgresql-server perl-bioperl obo-core perl-go-perl

5 Dependencies  Build Dependency  Installation Dependency

6 What is a Package Manager?  Tools to manage installation, upgrade, uninstallation of packages Verify package integrity (checksums) Maintain system integrity  Transactional  Allow rollbacks Dependency checking Dependency graph recursion Allow software customization (patches)

7 Current Generation of PMs  RPM  Dpkg  Apt  Yum  Emerge  tgz/bz2  Windows Installer

8 Why bioinformatics packages?  Consistency of installation process Bioinfo. package installs vary wildly, and commonly lack documentation  Automatic dependency installation Perl modules especially bad – bioperl has 60+ modules in its dependency tree  Integrity/Auditing of system state Know an installed package works, which version, how to replicate system setup  Tighter integration with operating system Daemons, config & log file locations, etc.

9 What’s available?  RPM packages only right now Primary focus on Fedora Core 2  Some RPMs also available for Fedora Core 3 RedHat 9 Cygwin

10 What’s available?  Three primary foci Applications Libraries Data sets

11 Applications  Gbrowse  Textpresso  BLAT daemon  NCBI Toolkit (BLAST, etc)  HMMer

12 What’s available?  Libraries Bioperl R & Bioconductor Squid EMBOSS

13 What’s available?  Data sets Genome & protein sequence Sequence features Ontologies All installed using a common directory structure

14 What’s available?  UCSC tools (utilities, BLAT system service, CGI scripts)  Bioperl  R / Bioconductor  GMOD apps (Gbrowse, Textpresso, …)  Data packages Genome sequence (fa, nib, blastdb) Genome features (Affy probeset alignments, mRNA, etc)

15 GMOD Components Available chado-Hsagbrowsetextpresso gmod-web-Hsa turnkey chado das2-Hsa apollo-Hsa cmap-Hsa  ‘Hsa’ can be substituted for your organism  Currently built for ‘Cel’, ‘Hsa’, ‘Sce’ ucsc-BLATgenome-Hsa-nib

16 More details… chado chado-Hsa genome-Hsa-nibucsc-blat perl-go-perl genome-Hsa-annotation-affymetrix genome-Hsa-annotation-gene postgresql-AffxSeq postgresql-server perl-bioperl ……………

17 Gene Expression Components chado-HsaBioconductorR Quant/Norm Pipeline chado-GEC DAS/2 for Genotyping, GeneChip

18 Resources  ~1000 RPMs for Fedora Core 2, 3 Available via yum  See site for a configuration example.

19 TODO  Support more architectures Build for Cygwin & OS X. RPM has been ported to both  Automate package build process Build farm of multiple architectures, controllable via scheduler (GridEngine)  Automate (if possible) inclusion of new software / data releases

20 TODO  Build community interest and involvement Keep adding more packages! Keep existing packages current!

21 Acknowledgements  Patrick Alger  Jared Fox  Brian O’Connor  Todd Harris  Lincoln Stein  Stanley Nelson

22 Anatomy of a specfile  Metadata Name Depends Provides Changelog  Build & install script hooks %prep %build %install %post %preun

