Download presentation
Presentation is loading. Please wait.
Published byGinger Wells Modified over 8 years ago
1
Introduction to the Microsoft Biology Foundation (MBF) Microsoft Biology Initiative Module 02
2
Agenda Introduction to MBF ▫ What is MBF? ▫ Common usage scenarios MBF Architecture ▫ Sequences, alphabets and symbols ▫ Parsers and formatters ▫ Introduction to algorithms MBF Starter Project ▫ Creating a new C# project MBF Source Code ▫ Building the source ▫ Testing with nUnit
3
What is MBF? Microsoft Biology Foundation (MBF) is a bioinformatics toolkit ▫ built on top of the.NET Framework 4.0 ▫ open source under MS-PL license ▫ foundation upon which other tools can be built Provides various components useful for biological analysis ▫ parsers to read and write common bioinformatics formats ▫ support for DNA, RNA and protein sequences ▫ algorithm framework for analysis and transformation ▫ web connector framework for web-service interaction
4
What is MBF intended to do? Primarily focused on genomics ▫ reusable data structures to represent sequences + symbols ▫ I/O framework to load/save sequences ▫ algorithm framework to process loaded sequences Provides an alternative to other biology frameworks ▫ similar concepts to BioJava or BioPerl ▫ takes advantage of Microsoft developer tools and.NET ▫ will evolve as Microsoft and other contributors add features Designed to manipulate large data sets ▫ in-memory compression of sequence data ▫ data virtualization for sequences larger than memory ▫ scalable algorithms that take advantage of multiple cores
5
MBF Design Goals Extensibility was a primary goal ▫ core concepts mapped as interfaces and ABCs ▫ can easily provide alternative implementations or add any missing features you need Language Neutral ▫ built on top of.NET – use any supported language ▫ supports dynamic languages such as IronPython Designed and implemented using best practices ▫ commented source code provided so nothing is a black box ▫ algorithms all cite publications Interoperability ▫ code can be run on several mainstream platforms
6
MBF vs. your application MBF is not an application in itself ▫ it does not provide any visualization of the data being managed ▫ it provides the basis for visualizations to be built on top of.NET Framework 4.0 Your Application MBF
7
Creating Applications with MBF MBF allows you to work with your data however you need Console (Text) Win Forms & WPF ASP.NET / WCF SilverlightAzureNT Service
8
Example: Sequence Assembler sequence data is loaded from FASTA file and assembled using MBF drawn as nucleotide symbols and graphics using WPF
9
Deploying your applications Possible to target non-Windows platforms [1] ▫ using Silverlight / Mono / Moonlight
10
Getting MBF MBF is available as an open source, free download ▫ http://mbf.codeplex.com http://mbf.codeplex.com Downloads section lists most recent build
11
MBF Licensing MBF is licensed under Ms-PL ▫ http://msdn.microsoft.com/en-us/library/cc707818.aspx http://msdn.microsoft.com/en-us/library/cc707818.aspx ▫ allows you to take the code and use it in academic or commercial products
12
Installing MBF Official releases are packaged as Setup Files ▫ include all the pre-built assemblies you can use immediately ▫ installs full.NET 4.0 framework if not already installed Several other tools available ▫ Sequence Assembler sample ▫ Excel add-in (http://bioexcel.codeplex.com/)http://bioexcel.codeplex.com/
13
Installing MBF Run the Setup program to install the MBF elements ▫ only supports Windows installs, other platforms require manual installation using source code ▫ creates %Program Files%\Microsoft Biology Initiative\1.0\MBF
14
MBF Installation Installer creates several files and directories ▫ \Doc directory holds documentation files ▫ \Addin directory contains optional algorithms ▫ \Sdk directory contains samples and additional documentation ▫ Bio.dll is core MBF assembly ▫ WebServiceHandlers.dll provides web service capabilities
15
Documentation Several documents supplied with installation (in /Doc ) ▫ even more available from http://mbf.codeplex.com/documentationhttp://mbf.codeplex.com/documentation Two documents are required reading before you begin ▫ start with the MBF_Overview.docx ▫ then read the Programming_Guide.docx BioDotNet.chm help file provides API reference ▫ installed with SDK (full install) MBF_Overview.docx MBF_Programming_Guide.docx BioDotNet.chm
16
Download the source code Source code is also available online at CodePlex ▫ can download as a pre-packaged.ZIP file [1] Can also apply to contribute to the framework [2] ▫ provides TFS credentials to get access to repository
17
Architecture: Namespaces Bio Sequences Alphabets Alignments Genomic Intervals Phylogeny Bio.IO FASTA / FASTQ GenBank NEXUS … Bio.Algorithms Translation Alignment Sequence Assembly … Bio.Web BLAST ClustalW BioHPC …
18
MBF Core Types Bio namespace holds core types and base interfacesAlphabets ISequence DnaAlphabet ISequenceItem Nucleotide Sequence
19
Alphabets Valid symbols are defined in terms of an alphabet ▫ determines allowed characters and meaning Access standard alphabets through Instance properties ▫ supplies standard alphabets for DNA, RNA and protein sets ▫ can also access them using Alphabets static class Or create custom alphabets if necessary ▫ by implementing the IAlphabet interface var dnaAlphabet = DnaAlphabet.Instance;... var dnaAlphabet2 = Alphabets.DNA; var dnaAlphabet = DnaAlphabet.Instance;... var dnaAlphabet2 = Alphabets.DNA; These two statements retrieve the same alphabet
20
Sequence Items ISequenceItem defines a single symbol ▫ supplies name, attributes and character used to represent symbol ▫ most common form is Nucleotide var dnaAlphabet = DnaAlphabet.Instance; ISequenceItem dnaG = dnaAlphabet.LookupBySymbol("G"); Console.WriteLine("{0}: {1} {2} {3}, {4}, {5}", dnaG.Name, dnaG.IsAmbiguous, dnaG.IsGap, dnaG.IsTermination, dnaG.Symbol, dnaG.Value); var dnaAlphabet = DnaAlphabet.Instance; ISequenceItem dnaG = dnaAlphabet.LookupBySymbol("G"); Console.WriteLine("{0}: {1} {2} {3}, {4}, {5}", dnaG.Name, dnaG.IsAmbiguous, dnaG.IsGap, dnaG.IsTermination, dnaG.Symbol, dnaG.Value); Guanine: False False False, G, 0
21
Representing Sequences ISequence interface represents ordered list of sequence items ▫ store data relevant to DNA, RNA and Amino Acid structures ▫ can work with sequence as a list of items, or as a string public interface ISequence : IList { string ID { get; } string DisplayID { get; } IAlphabet Alphabet { get; } object Documentation { get; set; } MoleculeType MoleculeType { get; }... string ToString(); } public interface ISequence : IList { string ID { get; } string DisplayID { get; } IAlphabet Alphabet { get; } object Documentation { get; set; } MoleculeType MoleculeType { get; }... string ToString(); }
22
Sequence Implementations Several ISequence implementations in the framework ▫ each optimized for a specific purpose, most common is Sequence
23
Creating new sequences Sequence type is most basic ISequence implementation ▫ created as read-only by default ▫ insert Nucleotide items to populate ISequence sequence = new Sequence(Alphabets.DNA, "AGCT");... sequence.IsReadOnly = false; sequence.Add(DnaAlphabet.Instance.AC); sequence.Add(new Nucleotide('-',"Gap"));... sequence.RemoveAt(0); Console.WriteLine(sequence); ISequence sequence = new Sequence(Alphabets.DNA, "AGCT");... sequence.IsReadOnly = false; sequence.Add(DnaAlphabet.Instance.AC); sequence.Add(new Nucleotide('-',"Gap"));... sequence.RemoveAt(0); Console.WriteLine(sequence); GCTM-
24
Working with string-based data Common to work with sequences a strings ▫ provides a readable representation of the data ▫ lose some information (gaps, terminators, etc.) ▫ not efficient for larger sequences void ProcessSequence(ISequence sequence) { string data = sequence.ToString(); string reverse = new string(data.Reverse().ToArray()); foreach (char symbol in reverse) {... } } void ProcessSequence(ISequence sequence) { string data = sequence.ToString(); string reverse = new string(data.Reverse().ToArray()); foreach (char symbol in reverse) {... } }
25
Working with sequence-based data Better to work with real ISequenceItem data ▫ maintains full identity ▫ helper properties on ISequence perform common tasks [1] void ProcessSequence(ISequence sequence) { ISequence reverse = sequence.Reverse; foreach (ISequenceItem symbol in reverse) {... } } void ProcessSequence(ISequence sequence) { ISequence reverse = sequence.Reverse; foreach (ISequenceItem symbol in reverse) {... } }
26
Reading and writing sequences Most common way to obtain a sequence is through a parser ▫ loads sequence data from some persistent storage ▫ tied to a specific format ▫ can load one or more sequences together ▫ can support metadata and statistics for sequence Once loaded, sequence can be processed ▫ through methods of ISequence, or by algorithms Finally, sequences are saved using formatters ▫ writes collection of ISequence objects to persistent storage MBF has several available parsers and formatters [1] ▫ contained in the Bio.IO namespace ▫ designed for extensibility – to support your formats
27
Loading sequences with parsers Several supplied parsers load common bio sequence formats ▫ FastA, FastQ, GenBank, Gff All sequence parsers implement ISequenceParser ▫ provides consistent interface to parsing data ▫ supports loading data from files and streams (more on this later) public interface ISequenceParser : IParser { IList Parse(string filename); IList Parse(string filename, bool isReadOnly);... ISequence ParseOne(string filename); ISequence ParseOne(string filename, bool isReadOnly);... } public interface ISequenceParser : IParser { IList Parse(string filename); IList Parse(string filename, bool isReadOnly);... ISequence ParseOne(string filename); ISequence ParseOne(string filename, bool isReadOnly);... }
28
Loading data from specific formats If file format is known, specific parser can be used to load data ▫ easiest and least error prone method to loading data private IList LoadSequence(string filename) { FastaParser parser = new FastaParser(); IList data = parser.Parse(filename,true); return data; } private IList LoadSequence(string filename) { FastaParser parser = new FastaParser(); IList data = parser.Parse(filename,true); return data; } second parameter indicates to open in read-only mode for performance – indicating change tracking is not necessary
29
Handling multiple file formats SequenceParsers class manages built-in parser types ▫ can use FindParserByFile method to locate proper parser at runtime private IList LoadSequence(string filename) { ISequenceParser parser = SequenceParsers.FindParserByFile(filename); if (parser == null) return null; IList data = parser.Parse(filename,true); return data; } private IList LoadSequence(string filename) { ISequenceParser parser = SequenceParsers.FindParserByFile(filename); if (parser == null) return null; IList data = parser.Parse(filename,true); return data; } FindParserByFile returns null if file could not be identified [1]
30
Interrogating the parser list SequenceParsers also provides enumerable list of parsers private IList TryLoadSequence(string filename) { IList parsers = SequenceParsers.All; foreach (var parser in parsers) { try { return parser.Parse(filename, true); } catch { } } return null; } private IList TryLoadSequence(string filename) { IList parsers = SequenceParsers.All; foreach (var parser in parsers) { try { return parser.Parse(filename, true); } catch { } } return null; }
31
Saving sequences back to files Formatters take sequences and persists them ▫ same formats supported: FastA, FastQ, GenBank and Gff Abstracted by ISequenceFormatter interface ▫ supports file-based and stream-based writing (more on this later) public interface ISequenceFormatter : IFormatter { void Format(ICollection sequences, string filename); void Format(ISequence sequence, string filename); string FormatString(ISequence sequence); } public interface ISequenceFormatter : IFormatter { void Format(ICollection sequences, string filename); void Format(ISequence sequence, string filename); string FormatString(ISequence sequence); }
32
Saving a sequence SequenceFormatters provides list of available formatters void SaveSequence(string filename, IList seqList) { ISequenceFormatter formatter = SequenceFormatters.FindFormatterByFile(filename); if (formatter != null) { formatter.Format(seqList, filename); } } void SaveSequence(string filename, IList seqList) { ISequenceFormatter formatter = SequenceFormatters.FindFormatterByFile(filename); if (formatter != null) { formatter.Format(seqList, filename); } } void SaveFastASequence(string fname, IList seqList) { SequenceFormatters.Fasta.Format(seqList, fname); } void SaveFastASequence(string fname, IList seqList) { SequenceFormatters.Fasta.Format(seqList, fname); }
33
Running algorithms on Sequences MBF provides a small collection of popular algorithms ▫ alignment, translation, assembly, … ▫ designed specifically to plug in new algorithms Bio.Algorithms is where all the algorithmic code is located ▫ each algorithm is given unique namespace … can also be supplied in separate assemblies that MBF locates and provides access to at runtime [1]
34
Using the algorithm classes Algorithms generally ▫ take one or more ISequence elements as input and return one or more ISequence elements as output Algorithms come in two forms ▫ static methods – to run simple algorithm on a single sequence ▫ instance classes – to run algorithms on 1+ sequences using Bio.Algorithms.Translation;... ISequence DNAtoRNA(ISequence dnaSequence) { ISequence rnaSequence = Transcription.Transcribe(dnaSequence); return rnaSequence; } using Bio.Algorithms.Translation;... ISequence DNAtoRNA(ISequence dnaSequence) { ISequence rnaSequence = Transcription.Transcribe(dnaSequence); return rnaSequence; }
35
Using MBF in your applications Using MBF is as simple as adding a reference to Bio.dll ▫ can then begin consuming available types ▫ convenient to add assembly to project – to ensure it is available ▫ supplied distribution requires the full.NET 4.0 framework install
36
MBF Starter Project [Step 1] If you are starting fresh, you can use the MBF Starter Template ▫ added to Visual Studio 2010 when you installed MBF Select MBF Console Application from project types requires.NET Framework 4 and Visual C# project type selected
37
MBF Starter Project [Step 2] Select options you want to use from MBF in your new app ▫ each checkbox will add standard methods for you to utilize … click Finish to generate project provides simple text- file logging capability
38
MBF Starter Project [Step 3] Add your code to the Main method ▫ call the supplied methods to get / save and manipulate the sequences class Program { static void Main(string[] args) { // TODO: Your Code Goes Here } // Exports a given sequence to a file in FastA format static void ExportFastA(ISequence sequence, string filename); // Parses a FastA file which has one or more sequences. static IList ParseFastA(string filename); // Write a given string to the application log. static void WriteLog(string matter); // Method to align two sequences using NeedlemanWunschAligner. public static IList AlignSequences( ISequence referenceSequence, ISequence querySequence) } class Program { static void Main(string[] args) { // TODO: Your Code Goes Here } // Exports a given sequence to a file in FastA format static void ExportFastA(ISequence sequence, string filename); // Parses a FastA file which has one or more sequences. static IList ParseFastA(string filename); // Write a given string to the application log. static void WriteLog(string matter); // Method to align two sequences using NeedlemanWunschAligner. public static IList AlignSequences( ISequence referenceSequence, ISequence querySequence) }
39
Contributing back to MBF MBF is an open source project ▫ Microsoft wants your ideas, contributions and feedback Useful to download the source code ▫ read through to get good coding ideas ▫ to extended or repurpose Consider contributing changes / features back to the project ▫ read MBF_Onboarding.doc or download the guide from http://mbf.codeplex.com/Project/Download/FileDownload.aspx?D ownloadId=112159 http://mbf.codeplex.com/Project/Download/FileDownload.aspx?D ownloadId=112159 All contributions must be released under Ms-PL license
40
Examining the Source Code Can retrieve source code from TFS repository [1] ▫ or as self-contained.zip file Solution MBI.sln contains all projects ▫ Bio – MBF core library ▫ Bio.Workflow ▫ WebServiceHandlers ▫ Unit tests ▫ Sample SDK code
41
Unit Testing MBF includes suite of unit tests for all components ▫ uses nUnit (www.nunit.org), included with distributionwww.nunit.org ▫ test cases are in separate unit-test assemblies ▫ any contributions to codebase must include unit tests
42
Running unit tests Running unit tests involves three steps 1.Execute nUnit.exe from public/ext/nunit/bin/net-2.0 2.Open the Bio.Tests.dll assembly with nUnit 3.Click the Run button to execute the unit test
43
Writing your own unit tests Unit tests are just blocks of code written to test other code ▫ tests assumptions, edge cases, error cases, and functionality nUnit makes writing test cases easy ▫ uses.NET attributes to signal intent ▫ Assert class provides helper methods to test assertions [TestFixture] public class ReverseTests { [Test] public void TestReverse() { Sequence sequence = new Sequence(Alphabets.DNA, "AGCT"); string reverse = sequence.Reverse.ToString(); Assert.AreEqual("TCGA", reverse); } } [TestFixture] public class ReverseTests { [Test] public void TestReverse() { Sequence sequence = new Sequence(Alphabets.DNA, "AGCT"); string reverse = sequence.Reverse.ToString(); Assert.AreEqual("TCGA", reverse); } } identifies this as a unit test class identifies this as a unit test method verifies the two strings are equal
44
Summary MBF framework is used to build bioinformatics applications ▫ open source ▫ highly extensible ▫ scalable ▫ flexible data architecture Based on.NET ▫ language agnostic ▫ allows any application style ▫ can use all the power and flexibility of.NET Sequences are the core concept in the framework ▫ contain sequence items [symbols] ▫ based on alphabets ▫ read/written using parsers/formatters ▫ passed as arguments and returned from algorithms
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.