Download presentation
Presentation is loading. Please wait.
Published byDinah Richard Modified over 9 years ago
1
The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students from the biological sciences, computer science, and mathematics departments. They have been developed as a part of the NIH funded project “Assisting Bioinformatics Efforts at Minority Schools” (2T36 GM008789). The people involved with the curriculum development effort include: Dr. Hugh B. Nicholas, Dr. Troy Wymore, Mr. Alexander Ropelewski and Dr. David Deerfield II, National Resource for Biomedical Supercomputing, Pittsburgh Supercomputing Center, Carnegie Mellon University. Dr. Ricardo Gonzalez-Mendez, University of Puerto Rico Medical Sciences Campus. Dr. Alade Tokuta, North Carolina Central University. Dr. Jaime Seguel and Dr. Bienvenido Velez, University of Puerto Rico at Mayaguez. Dr. Satish Bhalla, Johnson C. Smith University. Unless otherwise specified, all the information contained within is Copyrighted © by Carnegie Mellon University. Permission is granted for use, modify, and reproduce these materials for teaching purposes. These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center 1
2
This material is targeted towards students with a general background in Biology. It was developed to introduce biology students to the computational mathematical and biological issues surrounding bioinformatics. This specific lesson deals with the following fundamental topics: Essential computing for bioinformatics Computer Science track This material has been developed by: Dr. Hugh B. Nicholas, Jr. National Center for Biomedical Supercomputing Pittsburgh Supercomputing Center Carnegie Mellon University These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center 2
3
Essential Computing for Bioinformatics Course Development Overview Bienvenido Vélez UPR Mayaguez July, 2008
4
Outline Essential Computing for Bioinformatics Course Description Educational Objectives Major Course Modules Module Descriptions Status Bioinformatics Data Management Course Description Educational Objectives Major Course Modules Module Descriptions Status 4 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
5
Essential Computing for Bioinformatics Course Description This course provides a broad introductory discussion of essential computer science concepts that have wide applicability in the natural sciences. Particular emphasis will be placed on applications to Bioinformatics. The concepts will be motivated by practical problems arising from the use of bioinformatics research tools such as genetic sequence databases. Concepts will be discussed in a weekly lecture and will be practiced via simple programming exercises using Python, an easy to learn and widely available scripting language. 5 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
6
Educational Objectives Awareness of the mathematical models of computation and their fundamental limits Basic understanding of the inner workings of a computer system Ability to extract useful information from various bioinformatics data sources Ability to design computer programs in a modern high level language to analyze bioinformatics data. Experience with commonly used software development environments and operating systems Experience applying computer programming to solve bioinformatics problems 6 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
7
Major Course Modules ModuleLecture MARC Lecture First Steps in Computing: Course Overview1 Using Bioinformatics Data Sources2 Mathematical Computing Models35 High-level Programming (Python): Basics51 High-level Programming (Python): Flow Control62 High-level Programming (Python): Container Objects73 High-level Programming (Python): Files84 High-level Programming (Python): BioPython9 7 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
8
Main Advantages of Python Familiar to C/C++/C#/Java Programmers Very High Level Interpreted and Multi-platform Dynamic Object-Oriented Modular Strong string manipulation Lots of libraries available Runs everywhere Free and Open Source Track record in bioInformatics (BioPython) 8 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
9
Example A Codon -> AminoAcid Dictionary >> code = { ’ttt’: ’F’, ’tct’: ’S’, ’tat’: ’Y’, ’tgt’: ’C’,... ’ttc’: ’F’, ’tcc’: ’S’, ’tac’: ’Y’, ’tgc’: ’C’,... ’tta’: ’L’, ’tca’: ’S’, ’taa’: ’*’, ’tga’: ’*’,... ’ttg’: ’L’, ’tcg’: ’S’, ’tag’: ’*’, ’tgg’: ’W’,... ’ctt’: ’L’, ’cct’: ’P’, ’cat’: ’H’, ’cgt’: ’R’,... ’ctc’: ’L’, ’ccc’: ’P’, ’cac’: ’H’, ’cgc’: ’R’,... ’cta’: ’L’, ’cca’: ’P’, ’caa’: ’Q’, ’cga’: ’R’,... ’ctg’: ’L’, ’ccg’: ’P’, ’cag’: ’Q’, ’cgg’: ’R’,... ’att’: ’I’, ’act’: ’T’, ’aat’: ’N’, ’agt’: ’S’,... ’atc’: ’I’, ’acc’: ’T’, ’aac’: ’N’, ’agc’: ’S’,... ’ata’: ’I’, ’aca’: ’T’, ’aaa’: ’K’, ’aga’: ’R’,... ’atg’: ’M’, ’acg’: ’T’, ’aag’: ’K’, ’agg’: ’R’,... ’gtt’: ’V’, ’gct’: ’A’, ’gat’: ’D’, ’ggt’: ’G’,... ’gtc’: ’V’, ’gcc’: ’A’, ’gac’: ’D’, ’ggc’: ’G’,... ’gta’: ’V’, ’gca’: ’A’, ’gaa’: ’E’, ’gga’: ’G’,... ’gtg’: ’V’, ’gcg’: ’A’, ’gag’: ’E’, ’ggg’: ’G’...... } >> 9 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
10
Example A DNA Sequence >>> cds = "atgagtgaacgtctgagcattaccccgctggggccgtatatcggcgcacaaa tttcgggtgccgacctgacgcgcccgttaagcgataatcagtttgaacagctttaccatgcggtg ctgcgccatcaggtggtgtttctacgcgatcaagctattacgccgcagcagcaacgcgcgctggc ccagcgttttggcgaattgcatattcaccctgtttacccgcatgccgaaggggttgacgagatca tcgtgctggatacccataacgataatccgccagataacgacaactggcataccgatgtgacattt attgaaacgccacccgcaggggcgattctggcagctaaagagttaccttcgaccggcggtgatac gctctggaccagcggtattgcggcctatgaggcgctctctgttcccttccgccagctgctgagtg ggctgcgtgcggagcatgatttccgtaaatcgttcccggaatacaaataccgcaaaaccgaggag gaacatcaacgctggcgcgaggcggtcgcgaaaaacccgccgttgctacatccggtggtgcgaac gcatccggtgagcggtaaacaggcgctgtttgtgaatgaaggctttactacgcgaattgttgatg tgagcgagaaagagagcgaagccttgttaagttttttgtttgcccatatcaccaaaccggagttt caggtgcgctggcgctggcaaccaaatgatattgcgatttgggataaccgcgtgacccagcacta tgccaatgccgattacctgccacagcgacggataatgcatcgggcgacgatccttggggataaac cgttttatcgggcggggtaa" >>> 10 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
11
Example CDS Sequence -> Protein Sequence >>> def translate(cds, code):... prot = ""... for i in range(0,len(cds),3):... codon = cds[i:i+3]... prot = prot + code[codon]... return prot >>> translate(cds, code) ’MSERLSITPLGPYIGAQ*’ 11 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
12
Using BioInformatics Data Sources Goal:Basic Experience Searching Nuceotide Sequence Databases Searching Amino Acid Sequence Database Performing BLAST Searches Using Specialized Data Sources IDEA: How can we expedite data collection and analysis?... writing programs to automate parts of the process. 12 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Reference: Bioinformatics for Dummies (Ch 1-4)
13
Mathematical Computing Goal:General Awareness What is Computing? Mathematical Models of Computing Finite Automata Turing Machines The Limits of Computation Church/Turing Thesis What is an Algorithm? Big O Notation 13 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
14
High-Level Programming (Python) Goal:Knowledge and Experience Downloading and Installing the Interpreter Command-line versus Batch Mode Values, Expressions and Naming Designing your own Functional Building Blocks Controlling the Flow of your Program String Manipulation (Sequence Processing) Container Data Structures File Manipulation 14 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Reference: How to Think Like A Computer Scientist: Learning with Python
15
CS Fundamentals will be Interleaved Throughout the Course Information Representation and Encoding Computer Architecture Programming Language Translation Methods The Software Development Cycle Fundamental Principles of Software Engineering Basic Data Structures for Bioinformatics Design and Analysis of Bioinformatics Algorithms 15 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
16
Outline Essential Computing for Bioinformatics Course Description Educational Objectives Major Course Modules Module Descriptions Status Bioinformatics Data Management Course Description Educational Objectives Major Course Modules Module Descriptions Status 16 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
17
Essential Computing for Bioinformatics Course Description An introduction to the concepts of Data Base Management and Information Retrieval most commonly applicable to Bio-informatics database systems. The course is intended for Computer Science graduate students wishing to focus their academic careers in Bio-Informatics. The first half of the course discusses relational databases and the second half covers Information Retrieval systems. 17 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
18
Educational Objectives Understanding of the pros and cons of the different data models available to manage bioinformatics data Experience with the most widely used query models of bioinformatics data retrieval Understanding of the relational data base model and its query language Understanding of the vector and boolean query models for unstructured text retrieval Experience programming tools to move information across daa models in order to efficiently process it Experience with widely used database management systems and tools 18 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
19
Major Course Modules ModuleLecture Overview of Bioinformatics Data Models1 Using Bioinformatics Data Sources2 Fundamentals of Unstructured Text Retrieval3 Fundamentals of The Relational Data Model4 Structure Query Language (SQL)5 Querying DNA/Protein sequence databases6 Using highly-Specialized biological databases7 19 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
20
Relational Databases and SQL The Relational Model The SQL Relational Query Language Downloading and Installing MySQL Creating databases in MySQL + Access Importing data into MySQL + Access Creating Analysis Reports with iReports Exporting Data into Spreadsheets 20 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
21
Extracting Information from Bioinformatics Databases Manipulating sequences Bioinformatics database file formats Parsing Bioinformatics database files to focus on information of interest Exporting data into relational databases and other tools 21 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
22
22 Questions?
23
Accomplishments for Year 1 Studied alternative HLL languages Gained experience with Python Worked on slides for Mathematical Computing Reviewed existing beginning courses in computing for Bioinformatics using Python Revised course syllabus and curricular structure Offered the course for the first time in Spring 2007 23 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
24
Plan for Year 2 Offer the course a second time Fall 2007 Develop slides and lecture notes for remaining modules Develop Problem Sets and Solutions Study BioPython and integrate throughout the course Round 1 of Assessment 24 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.