Michael Schroeder BioTechnological Center TU Dresden Biotec Programming for Bioinformatics.

Slides:



Advertisements
Similar presentations
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 14 Web Database Programming Using PHP.
Advertisements

PHP (2) – Functions, Arrays, Databases, and sessions.
1 Programming for Engineers in Python Autumn Lecture 5: Object Oriented Programming.
The Protein Data Bank (PDB)
Michael Schroeder BioTechnological Center TU Dresden Biotec Algorithmic Bioinformatics.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Introduction to a Programming Environment
Russell Taylor Lecturer in Computing & Business Studies.
Michael Schroeder BioTechnological Center TU Dresden Biotec Bioinformatics I.
MBAC 611.  We have been using MS Access to query and modify our databases.  MS Access provides a GUI (Graphical User Interface) that hides much of the.
 2004 Prentice Hall, Inc. All rights reserved. Chapter 25 – Perl and CGI (Common Gateway Interface) Outline 25.1 Introduction 25.2 Perl 25.3 String Processing.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
CS190/295 Programming in Python for Life Sciences: Lecture 1 Instructor: Xiaohui Xie University of California, Irvine.
CC0002NI – Computer Programming Computer Programming Er. Saroj Sharan Regmi Week 7.
Builtins, namespaces, functions. There are objects that are predefined in Python Python built-ins When you use something without defining it, it means.
CSCI/CMPE 4341 Topic: Programming in Python Chapter 1: Introduction to Python Xiang Lian The University of Texas – Pan American Edinburg, TX 78539
MET280: Computing for Bioinformatics Introduction to databases What is a database? Not a spreadsheet. Data types and uses DBMS (DataBase Management System)
Getting Started with MATLAB 1. Fundamentals of MATLAB 2. Different Windows of MATLAB 1.
Guide to Programming with Python Chapter One Getting Started: The Game Over Program.
Fundamental Programming: Fundamental Programming K.Chinnasarn, Ph.D.
Copyright © 2010 Certification Partners, LLC -- All Rights Reserved Perl Specialist.
Python – May 11 Briefing Course overview Introduction to the language Lab.
CMP 131 Introduction to Computer Programming Violetta Cavalli-Sforza Week 3, Lecture 1.
Trinity College Dublin, The University of Dublin GE3M25: Computer Programming for Biologists Python Karsten Hokamp, PhD Genetics TCD, 03/11/2015.
COIT29222 Structured Programming 1 COIT29222-Structured Programming Lecture Week 02  Reading: Textbook(4 th Ed.), Chapter 2 Textbook (6 th Ed.), Chapters.
8 January 2016Birkbeck College, U. London1 Introduction to Programming Lecturer: Steve Maybank Department of Computer Science and Information Systems
A Python Tour: Just a Brief Introduction "The only way to learn a new programming language is by writing programs in it." -- B. Kernighan and D. Ritchie.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 14 Web Database Programming Using PHP.
1. COMPUTERS AND PROGRAMS Rocky K. C. Chang September 6, 2015 (Adapted from John Zelle’s slides)
1 Welcome! DBT544 students to the iSeries, DB2 Universal Database And SQL interface.
Dr. Abdullah Almutairi Spring PHP is a server scripting language, and a powerful tool for making dynamic and interactive Web pages. PHP is a widely-used,
PHP Tutorial. What is PHP PHP is a server scripting language, and a powerful tool for making dynamic and interactive Web pages.
1 Agenda  Unit 7: Introduction to Programming Using JavaScript T. Jumana Abu Shmais – AOU - Riyadh.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
Quiz 1 A sample quiz 1 is linked to the grading page on the course web site. Everything up to and including this Friday’s lecture except that conditionals.
Part 1 Learning Objectives To understand that variables are a temporary named location to store data and that programmers work with different data types.
5. Algorithm 1: Variables, Operators, Sequences 1.
Web Database Programming Using PHP
Programming for Bioinformatics
Introduction to Programming
CST 1101 Problem Solving Using Computers
A Python Tour: Just a Brief Introduction
Python: Experiencing IDLE, writing simple programs
GO! with Microsoft Office 2016
Web Database Programming Using PHP
Topics Introduction to Repetition Structures
GO! with Microsoft Access 2016
Introduction to Programming
Introduction to Programmng in Python
Introduction to Programming
SQL – Application Persistence Design Patterns
MATLAB: Structures and File I/O
CS190/295 Programming in Python for Life Sciences: Lecture 1
Learning to Program in Python
Learning to Program in Python
Introduction to Python
Topics Introduction to File Input and Output
Learning to Program in Python
CS190/295 Programming in Python for Life Sciences: Lecture 6
T. Jumana Abu Shmais – AOU - Riyadh
Introduction to Programming
Web DB Programming: PHP
Introduction to Programming
Intro to PHP.
Chapter 1: Programming Basics, Python History and Program Components
Introduction to Programming
Topics Introduction to File Input and Output
SQL – Application Persistence Design Patterns
CS313T Advanced Programming language
Presentation transcript:

Michael Schroeder BioTechnological Center TU Dresden Biotec Programming for Bioinformatics

The module… nwill teach students basic programming skills relevant to bioinformatics, which will enable them to actively develop bioinformatics tools. nwill take a problem-driven approach. nwill present bioinformatics problems and show how to solve them using existing online tools and how to implement such tools. nwill revisit some of the problems and databases discussed in applied bioinformatics. nwill be very practical and hands-on approach to basic computer science tools such as using command line operating systems, programming in Python, and using relational databases.

Objectives nStudents will have an understanding of different operating systems nStudents will be able to automate simple repetitive information retrieval tasks nStudents will be able to write simple programs in Python nStudents will be able to work with relational databases nStudents will appreciate the principles, limits, and possibilities of programming nStudents will be able to formulate biological questions as information processing problems nStudents will understand when and how programming can help to automate bioinformatics problems

Module Structure nIntroduction nDatabases nIntroduction to SQL nA Little Exercise nA Little Science nIntroduction to Python nData types and loops nSequences and lists nPatterns and functions nDictionaries nAdvanced topics nMore Python nDynamic programming nClustering nRevision Class

Books nYou will need two books for the module: a reference book on MySQL and a book on Python

Books: Python nWe will follow a number of online resources. (see course web page) (see course web page) nFurther, we look in Python in a Nutshell, Alex Martelli, O’Reilly nWesley Chun's Core Python Programming nPython Cookbook (O’Reilly) nThe publisher O’Reilly has many general programming books on linux, python, etc. nThey allow you to read all books for 2 weeks online for free. This is very nice to decide what to buy and what not. nYou can also buy electronic copies of the book.

Books: MySQL nThere are many, many books on MySQL nThe following two are just sugestions, as there are many other books covering the same material nMySQL Cookbook by Paul DuBois, O'Reilly or nMySQL by Paul DuBois, Michael Widenius, O'Reilly

Structure of Labs nDatabases nLab 1,2: Simple SQL nLab 3,4: SQL to answer interesting scientific questions nPython nLab 5: Data types and loops, accessing a DB from Python nLab 6: Sequences and lists nLab 7: Patterns and Functions nLab 8: Dictionaries nLab 9: BioPython nLab 10: Python & PyMOL n More Python: nLab 11: Dynamic programming revisited nLab 12: Clustering revisited nLab 13: Revision

Assessment nLab nExercises: nEach week during the lab you get exercises which you have to do during the lab and finish on your own during the week nThese exercises need to be handed in on paper at the next lecture nResults are discussed during the labs and as part of the assessment you will have to present a solution once nDoing the exercises is compulsory, but there are no marks nProject nYou will demonstrate your programming skills by implementing and presenting a software project nExam nPen and paper exam on material covered in lecture

Programming Project nGoal: Demonstrate ability to use SQL and programming nGoal 2: Produce science movie for Long Night of Science nYou will work in a team and get a biological problem. nPart 1: Programming: You have to implement some workflows, which integrate data from various sites and use various tools programmatically. This includes an animation of your target protein in PyMol. nPart 2: Make a movie. Tell the story about your protein based on the data collected and analysis carried out. Create a story board and turn all material and Pymol animations into a movie.

Motivation: Databases nIn the last term, nwe accessed most information online via the web nwe interacted directly and manually with databases and tools nwe had to manually submit queries, interpret results. select interesting results, cut&paste them, and submit queries again,… nPro: nReasonably easy to get hold of information nCon: nNot possible to ask many queries nQueries limited by interface provided by web page nDifficult/impossible to integrate information from different sites nIn this term, we will look at the databases underlying the online front ends nHow is the data internally stored? nHow can we - and more important computer programs - directly interact with the underlying data, so that we can ask more powerful queries, large queries, and integrate different systems

What actually happens You are limited by what web server allows you to ask: Example CATH: PDB ID, CATH code, or General text But you cannot ask: In how many different PDB structures is there a P-loop domain? Is there a PDB entry with a P-loop and a DNA-binding domain How many different superfamilies does the largest structure in PDB have? With direct access to the underlying database you could answer all these questions (and many more)

Motivation: SCOP as Relational Database nWe worked with SCOP, the Structural Classification of Proteins nFamily: >30% sequence identity nSuperfamily: Similar structure and function (possibly lower 30% sequence identity) Picture from 30% Family Same Superfamily, But not family

Motivation: Databases nWe wish to answer the following questions: nHow many families and superfamilies are there? nDo all superfamilies roughly have the same number of families? nHow many families does the immunoglobulin superfamily have? nWhich superfamily has the most families and how many? nHow many percent of superfamilies have only one family? nWhich PDB structure has the largest number of distinct superfamilies? nHow many percent of PDB structures have only one type of superfamily, how many percent have at least two? nWhich is the most popular superfamily? nAre all superfamilies equally likely to co-occur or do they have preferences? nWhich superfamily has the most co-occurrence partners? nIs the number of co-occurrence partners and the frequency of the superfamily correlated?

What is a Database nSCOP contains relevant information, but we cannot answer the above questions through the web-interface of SCOP nThe problem is that we do not have access to the underlying database nWhat is a database anyway? nA database provides… nLogical organization of data ndata models, schema design, dictionaries nPhysical organization of data nFast retrieval, indexing, compact storage of data

Relational Database nCentral Idea: Data as relations in a table nE.g. Employee | id | name | salary | role | | | pete | | director| | | jane | | nurse | | | asif | | driver |

Relational Database nCentral Idea: Data as relations in a table nE.g. SCOP, Structural Classification of Proteins | id | type | sccs | sid | description | | | cf | a.1 | - | Globin-like | | | sf | a.1.1 | - | Globin-like | | | fa | a | - | Truncated hemoglobin | | | dm | a | - | Truncated hemoglobin | | | sp | a | - | Ciliate (Paramecium caudatum) | | | px | a | d1dlwa_ | 1dlw A: | | | sp | a | - | Green alga (Chlamydomonas eugametos) | | | px | a | d1dlya_ | 1dly A: | | | sp | a | - | Mycobacterium tuberculosis | | | px | a | d1idra_ | 1idr A: |

SCOP Tables mysql> select * from cla limit 1; | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | | d1dlwa_ | 1dlw | a | | | | | | | | mysql> select * from des limit 1; | id | type | sccs | sid | description | | | cl | a | - | All alpha proteins | mysql> select * from astral limit 1; | sid | sccs | seq | | d1dlwa_ | a | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...| mysql> select * from subchain limit 1; | id | px | chain_id | begin | end | | 1 | | A | | |

SCOP Tables mysql> select * from cla limit 1; | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | | d1dlwa_ | 1dlw | a | | | | | | | | mysql> select * from des limit 1; | id | type | sccs | sid | description | | | cl | a | - | All alpha proteins | mysql> select * from astral limit 1; | sid | sccs | seq | | d1dlwa_ | a | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...| mysql> select * from subchain limit 1; | id | px | chain_id | begin | end | | 1 | | A | | |

Querying Relational Databases nSQL = Structured Query Language Select Which attributes? from Which tables? where Which conditions? nSelect … from … where … nDistinct nLike nUnion/intersect nJoin nCount/average/sum/min/m ax nGroup by nHaving nShow tables nShow databases nUse nCreate database nCreate table … as nDrop table nLoad data nInsert into

Databases Given SCOP as relational database, we can answer all the questions raised above using the SQL constructs of the previous slide!

Programming nWe will use Python (Guido van Rossum, named after Monty Python) as a convenient extension to the operating system nEasy to write quick programs nMore than just a scripting language nInterpreted, interactive, indented nSupports string processing well nWidely used in bioinformatics nObject oriented, general purpose nMany nice libraries for database access, Graphics, Web, GUI, R… nScientific orientation: Numerical Python (math), Scientific Python, Biopython nBeware: Python is inefficient, but computationally expensive parts can be included as C-libraries

Motivation: Families and Identity nWe said that SCOP families share >30% identity nWhat does that mean? nAny two structures in a family >30%? nAt least one other member in family with >30%? nWhat is the average sequence similarity within a family? Within a superfamily? nGiven a sequence and that we know already which superfamily it belongs to. Can we find the superfamily’s family best suited for the sequence

Two approaches: Blast vs. DIY nWe can answer the above easily: nWe use SCOP database and run database queries from a Python script nFor a given superfamily select all corresponding sequences from the astral table nFor all pairs of selected sequences nCall Blast and record the sequence identity nOr run your own dynamic programming algorithm and record the sequence identity nFor second problem: Compare sequence to all family sequences and assign it to the family which shares the highest (must be >30%) similarity with the sequence

Motivation: Sequence vs. Structure nCan we verify the plot below? nCan we create a similar plot for specific superfamilies? E.g. DNA-binding domains? Picture from 30% Family Same Superfamily, But not family

Motivation: Sequence vs. Structure Again: select the relevant sequences from the astral table and besides computing the sequence identity, we compute structural similarity to the relevant structure using an algorithm like Dali or CE Then plot the two similarities against each other in a scatter plot

Motivation: Amino Acid Composition of Families nCan we characterise the amino acid composition of different families/superfamilies? nAgain: select the relevant sequences from astral and count the frequencies of amino acids nIs the amino acid composition at the interface of a domain different from the rest of the domain?

Motivation: Let’s rebuild SCOP families nGiven a SCOP superfamily and its sequences, how can we divide it into families? nFirst, we need dynamic programming to determine the sequence similarity nThen we do the following: nFor all pairs of sequences, call the sequence similarity algorithm and record the similarity into a distance matrix nNext, run hierarchical clustering to cluster the sequences.

What’s needed… n…programming in Python

Python Programming Constructs nVariables, strings, nFor/while Loops nIf statements nFile I/O nRegular expressions nData structures: Lists, Hashes nCode Structure: Objects, classes, modules

Hello World in Python Given a file helloworld.py Open a shell and type at the command prompt helloworld.py nThe shell then executes your programme nIn the first line, it realises that the python interpreter needs to be loaded and that what follows is a python program nThe line below prints a message print "Hello World" File: helloworld.py

Read a text file in python The command open opens a text file and creates “r” as second argument after the filename indicates that file is read (this is default, ie. can be left out) “w” as second argument indicates that file is written to “a” as second argument indicates that file is appended to nThe for-loop reads all lines of the file one by one (requires python >2.2) The body of the loop prints them on the screen (note that print adds a new line automatically, avoid that with adding a ”, ” ) data = open("seq.txt“, “r”) for line in data: print "Line:”, line, acgt gggt File: seq.txt File: fileIO.py Line: acgt Line: gggt Output

Variables in Python The = symbol is used to assign values to variables The + symbol is also used to concatenate strings lineNo = 1 for line in open(“seq.txt”): print lineNo+”: ”+line, lineNo = lineNo+1 acgt gggt File: seq.txt File: fileIO.pl 1: acgt 2: gggt Output

If-then- else and strings in Python data = open("seq.txt") line1 = data.readline().rstrip() line2 = data.readline().rstrip() len1=len(line1) len2=len(line2) if len1 < len2: minLen = len1 else: minLen = len2 line3 = "" for i in range(minLen): if line1[i] == line2[i]: line3=line3+"*" else: line3=line3+" " print "Sequence comparison" print line1 print line2 print line3 acgt gggt File: seq.txt File: seqcomp.py Sequence comparison acgt gggt ** Output

Programming Example