Practically Genomic A hands-on bioinformatics IAP Course Materials: Instructors: Paola Favaretto, Sebastian Hoersch, Charlie Whittaker and Courtney Crummett KI for Integrative Cancer Research at MIT and MIT Libraries Students - Wide range of experience levels Unix account access information will be provided Evaluations - Please send comments to
Turning Biologists into Bioinformaticists - A practical approach The teaching material should: be modular and practical have obvious contextual relevance serve as readily accessible and easily used reference materials The students should: become aware of the contents of a basic bioinformatics toolkit learn how to find instructions covering tools and methods. experiment with different methods covered in classes gain familiarity and comfort with command-line computing Target Audience are KI Biologists
Turning Biologists into Bioinformaticists - A practical approach – the specifics 1.Theory - Core Bioinformatics Concepts Important principles required to use bioinformatics 2.Tools - A Basic Bioinformatics Toolkit The software of bioinformatics 3.Tasks - Bioinformatics Methods Data analysis with bioinformatics Under Development!
IAP 2012 Agenda (subject to change) Introduction Getting more from Excel Unix Introduction Next Generation Sequence Analysis with Unix and Galaxy Visualization and Analysis of Genomics Data rous.mit.edu
Theory – Genomic Data All kinds of genomics data are described using at least 4 pieces of information. 1)The name of a DNA sequence name 2)A position on that sequence 3)A feature that exists at that position. 4)Genome assembly version Sequence1 Position Feature Chromosome Mutation Sequence 1 is a long block of sequence arranged by a process called genome assembly.genome assembly This is critical because the 3 pieces of information described above are only meaningful for one specific assembly version. A new version of the genome will probably not have this mutation at position It would be located elsewhere. BED, GFF, GTF formats
Theory – Microarray Data 1.Target features created on a surface 2.Labeled material hybridized 3.Image analysis ProbeIDSample1Sample2Sample3Sample4 1007_s_at _at _at _at _g_at Used for: Gene expression analysis Polymorphism detection Copy number analysis DNA binding studies Data is gathered about the features present on the array.
Theory – Next Generation Sequencing (NGS) 1.Generate DNA fragments 2.Attach to surface and amplify in situ. 3.Subject surface to cycles of imaging/chemistry. 4.Image analysis to call base sequences and qualities Used for: Gene expression analysis Polymorphism/Mutation detection Copy number analysis Mixture Quantization DNA or RNA binding studies others… 200+ million clusters per experiment Data is gathered about everything in the input mixture.
Theory – NGS Alignment Files 2:75:1538: chr M 4:31:101: chr M CACCTACTTGCCA################ Query Flag Reference Position MapQual CIGAR Sequence Base Quality SAM Format Each line has a lot of information (not all columns are shown) One experiment = millions of lines = many Gb of data Scale of the data causes problems with Excel etc.