Bioinformatics For MNW 2 nd Year Jaap Heringa FEW/FALW Integrative Bioinformatics Institute VU (IBIVU)

Slides:



Advertisements
Similar presentations
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Advertisements

BLAST Sequence alignment, E-value & Extreme value distribution.
Bioinformatics Master Course II: DNA/Protein structure-function analysis and prediction Lecture 12: DNA/RNA structure Centre for Integrative Bioinformatics.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Bioinformatics For MNW 2 nd Year Jaap Heringa FEW/FALW Integrative Bioinformatics Institute VU (IBIVU) Tel ,
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Bioinformatics Master’s Course Genome Analysis ( Integrative Bioinformatics ) Lecture 1: Introduction Centre for Integrative Bioinformatics VU (IBIVU)
Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)
“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky ( )) “Nothing in bioinformatics makes sense except in.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Master’s course Bioinformatics Data Analysis and Tools Centre for Integrative Bioinformatics FEW/FALW
Heuristic alignment algorithms and cost matrices
Sequence analysis course Lecture 7 Multiple sequence alignment 3 of 3 Optimizing progressive multiple alignment methods.
We continue where we stopped last week: FASTA – BLAST
Pattern Recognition Introduction to bioinformatics 2005 Lecture 4.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
SnapDRAGON: protein 3D prediction-based DOMAINATION: based on PSI-BLAST Two methods to predict domain boundary sequence positions from sequence information.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence alignment, E-value & Extreme value distribution
Chapter 5 Multiple Sequence Alignment.
Pair-wise alignment quality versus sequence identity (Vogt et al., JMB 249, ,1995)
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
Master’s course Bioinformatics Data Analysis and Tools Lecture 1: Introduction Centre for Integrative Bioinformatics FEW/FALW
High-throughput Biological Data The data deluge and bioinformatics algorithms Introduction to bioinformatics 2005 Lecture 3.
Pattern Recognition Introduction to bioinformatics 2006 Lecture 4.
C E N T R F O I G A V B M S U 2MNW/3I/3AI/3PHAR bachelor course Introduction to Bioinformatics Lecture 1: Introduction Centre for Integrative Bioinformatics.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Database Searches BLAST. Basic Local Alignment Search Tool –Altschul, Gish, Miller, Myers, Lipman, J. Mol. Biol. 215 (1990) –Altschul, Madden, Schaffer,
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Bioinformatics For MNW 2 nd Year Jaap Heringa FEW/FALW Centre for Integrative Bioinformatics VU (IBIVU) Tel ,
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Course Sequence Analysis for Bioinformatics Master’s Bart van Houte, Radek Szklarczyk, Victor Simossis, Jens Kleinjung, Jaap Heringa
Introduction to bioinformatics Lecture 3 High-throughput Biological Data -data deluge, bioinformatics algorithms- and evolution C E N T R F O R I N T.
Genes and Genomic Datasets. DNA compositional biases Base composition of genomes: E. coli: 25% A, 25% C, 25% G, 25% T P. falciparum (Malaria parasite):
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Medical Natural Sciences Year 2: Introduction to Bioinformatics Lecture 9: Multiple sequence alignment (III) Centre for Integrative Bioinformatics VU.
Sequence Alignment.
Step 3: Tools Database Searching
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Introduction to bioinformatics Lecture 7 Multiple sequence alignment (1)
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
DNA Sequences Analysis Hasan Alshahrani CS6800 Statistical Background : HMMs. What is DNA Sequence. How to get DNA Sequence. DNA Sequence formats. Analysis.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Sequence database searching – Homology searching Dynamic Programming (DP) too slow for repeated database searches. Therefore fast heuristic methods: FASTA.
Pairwise alignment Now we know how to do it: How do we get a multiple alignment (three or more sequences)? Multiple alignment: much greater combinatorial.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
High-throughput Biological Data The data deluge
C E N T R F O I G A V B M S U 2MNW/3I/3AI/3PHAR bachelor course Introduction to Bioinformatics Lecture 1: Introduction Centre for Integrative Bioinformatics.
Sequence Based Analysis Tutorial
Bioinformatics For MNW 2nd Year
SnapDRAGON: protein 3D prediction-based
Basic Local Alignment Search Tool (BLAST)
Predicting protein structure and function
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment
1-month Practical Course Genome Analysis Iterative homology searching
Presentation transcript:

Bioinformatics For MNW 2 nd Year Jaap Heringa FEW/FALW Integrative Bioinformatics Institute VU (IBIVU)

Current Bioinformatics Unit Jens Kleinjung (1/11/02) Victor Simosis – PhD (1/12/02) Radek Szklarczyk - PhD (1/01/03) John Romein (1/12/02, Henri Bal)

Bioinformatics course 2nd year MNW spring 2003 Pattern recognition –Supervised/unsupervised learning –Types of data, data normalisation, lacking data –Search image –Similarity tables –Clustering –Principal component analysis –Discriminant analysis

Bioinformatics course 2nd year MNW spring 2003 Protein –Folding –Structure and function –Protein structure prediction –Secondary structure –Tertiary structure –Function –Post-translational modification –Prot.-Prot. Interaction -- Docking algorithm –Molecular dynamics/Monte Carlo

Bioinformatics course 2nd year MNW spring 2003 Sequence analysis –Pairwise alignment –Dynamic programming (NW, SW, shortcuts) –Multiple alignment –Combining information –Database/homology searching (Fasta, Blast, Statistical issues-E/P values)

Bioinformatics course 2nd year MNW spring 2003 Gene structure and gene finding algorithm Omics –DNA makes RNA makes protein –Expression data, Nucleus to ribosome, translation, etc. –Metabolomics –Physiomics –Databases DNA, EST Protein sequence Protein structure

Bioinformatics course 2nd year MNW spring 2003 oMicroarray data oProtein structure (PDB) oProteomics oMass spectrometry/NMR/X-ray?

Bioinformatics course 2nd year MNW spring 2003 Bioinformatics method development IPR issues Programming and scripting languages Web solutions Computational issues –NP-complete problems –CPU, memory, storage problems –Parallel computing Bioinformatics method usage/application Molecular viewers (RasMol, MolMol, etc.)

Gathering knowledge Anatomy, architecture Dynamics, mechanics Informatics (Cybernetics – Wiener, 1948) (Cybernetics has been defined as the science of control in machines and animals, and hence it applies to technological, animal and environmental systems) Genomics, bioinformatics Rembrandt, 1632 Newton, 1726

Mathematics Statistics Computer Science Informatics Biology Molecular biology Medicine Chemistry Physics Bioinformatics

“Studying informational processes in biological systems” (Hogeweg, early 1970s) No computers necessary Back of envelope OK Applying algorithms with mathematical formalisms in biology (genomics) -- USA “Information technology applied to the management and analysis of biological data” (Attwood and Parry-Smith)

Bioinformatics in the olden days Close to Molecular Biology: –(Statistical) analysis of protein and nucleotide structure –Protein folding problem –Protein-protein and protein-nucleotide interaction Many essential methods were created early on (BG era) –Protein sequence analysis (pairwise and multiple alignment) –Protein structure prediction (secondary, tertiary structure)

Bioinformatics in the olden days (Cont.) Evolution was studied and methods created –Phylogenetic reconstruction (clustering – NJ method

The Human Genome June 2000

Dr. Craig Venter Celera Genomics -- Shotgun method Sir John Sulston Human Genome Project

Human DNA There are about 3bn (3  10 9 ) nucleotides in the nucleus of almost all of the trillions (3.5  ) of cells of a human body (an exception is, for example, red blood cells which have no nucleus and therefore no DNA) – a total of ~10 22 nucleotides! Many DNA regions code for proteins, and are called genes (1 gene codes for 1 protein in principle) Human DNA contains ~30,000 expressed genes Deoxyribonucleic acid (DNA) comprises 4 different types of nucleotides: adenine (A), thiamine (T), cytosine (C) and guanine (G). These nucleotides are sometimes also called bases

Human DNA (Cont.) All people are different, but the DNA of different people only varies for 0.2% or less. So, only 2 letters in 1000 are expected to be different. Over the whole genome, this means that about 3 million letters would differ between individuals. The structure of DNA is the so-called double helix, discovered by Watson and Crick in 1953, where the two helices are cross-linked by A-T and C-G base-pairs (nucleotide pairs – so-called Watson-Crick base pairing).

Tot hier 3/2 –

DNA compositional biases Base composition of genomes: E. coli: 25% A, 25% C, 25% G, 25% T P. falciparum (Malaria parasite): 82%A+T Translation initiation: ATG is the near universal motif indicating the start of translation in DNA coding sequence.

Some facts about human genes Comprise about 3% of the genome Average gene length: ~ 8,000 bp Average of 5-6 exons/gene Average exon length: ~200 bp Average intron length: ~2,000 bp ~8% genes have a single exon Some exons can be as small as 1 or 3 bp. HUMFMR1S is not atypical: 17 exons bp long, comprising 3% of a 67,000 bp gene

Genetic diseases Many diseases run in families and are a result of genes which predispose such family members to these illnesses Examples are Alzheimer’s disease, cystic fibrosis (CF), breast or colon cancer, or heart diseases. Some of these diseases can be caused by a problem within a single gene, such as with CF.

Genetic diseases (Cont.) For other illnesses, like heart disease, at least genes are thought to play a part, and it is still unknown which combination of problems within which genes are responsible. With a “problem” within a gene is meant that a single nucleotide or a combination of those within the gene are causing the disease (or make that the body is not sufficiently fighting the disease). Persons with different combinations of these nucleotides could then be unaffected by these diseases.

Genetic diseases (Cont.) Cystic Fibrosis Known since very early on (“Celtic gene”) Inherited autosomal recessive condition (Chr. 7) Symptoms: –Clogging and infection of lungs (early death) –Intestinal obstruction –Reduced fertility and (male) anatomical anomalies CF gene CFTR has 3-bp deletion leading to Del508 (Phe) in 1480 aa protein (epithelial Cl - channel) – protein degraded in ER instead of inserted into cell membrane

Genomic Data Sources DNA/protein sequence Expression (microarray) Proteome (xray, NMR, mass spectrometry) Metabolome Physiome (spatial, temporal) Integrative bioinformatics

Dinner discussion: Integrative Bioinformatics & Genomics VU metabolome proteome genome transcriptome physiome Genomic Data Sources Vertical Genomics

A gene codes for a protein Protein mRNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDEPEPTIDE CCUGAGCCAACUAUUGAUGAA

Humans have spliced genes…

DNA makes RNA makes Protein

Remark The problem of identifying (annotating) human genes is considerably harder than the early success story for ß- globin might suggest. The human factor VIII gene (whose mutations cause hemophilia A) is spread over ~186,000 bp. It consists of 26 exons ranging in size from 69 to 3,106 bp, and its 25 introns range in size from 207 to 32,400 bp. The complete gene is thus ~9 kb of exon and ~177 kb of intron. The biggest human gene yet is for dystrophin. It has > 30 exons and is spread over 2.4 million bp.

DNA makes RNA makes Protein: Expression data More copies of mRNA for a gene leads to more protein mRNA can now be measured for all the genes in a cell at ones through microarray technology Can have 60,000 spots (genes) on a single gene chip Colour change gives intensity of gene expression (over- or under-expression)

Metabolic networks Glycolysis and Gluconeogenesis Kegg database (Japan)

High-throughput Biological Data Enormous amounts of biological data are being generated by high-throughput capabilities; even more are coming –genomic sequences –gene expression data –mass spec. data –protein-protein interaction –protein structures –......

Protein structural data explosion Protein Data Bank (PDB): Structures (6 March 2001) x-ray crystallography, 1810 NMR, 278 theoretical models, others...

Dickerson’s formula: equivalent to Moore’s law On 27 March 2001 there were 12,123 3D protein structures in the PDB: Dickerson’s formula predicts 12,066 (within 0.5%)! n = e 0.19(y-1960) with y the year.

Sequence versus structural data Despite structural genomics efforts, growth of PDB slowed down in (i.e did not keep up with Dickerson’s formula) More than 100 completely sequenced genomes Increasing gap between structural and sequence data

Bioinformatics Large - external (integrative)ScienceHuman Planetary ScienceCultural Anthropology Population Biology Sociology SociobiologyPsychology Systems Biology Biology Medicine Molecular Biology Chemistry Physics Small – internal (individual) Bioinformatics

Offers an ever more essential input to –Molecular Biology –Pharmacology (drug design) –Agriculture –Biotechnology –Clinical medicine –Anthropology –Forensic science –Chemical industries (detergent industries, etc.)

High-throughput Biological Data The data deluge Hidden in these data is information that reflects –existence, organization, activity, functionality …… of biological machineries at different levels in living organisms Most effectively utilising this information will prove to be essential for Integrative Bioinformatics

Data Issues …… Data collection: getting the data Data representation: data standards, data normalisation ….. Data organisation and storage: database issues ….. Data analysis and data mining: discovering “knowledge”, patterns/signals, from data, establishing associations among data patterns Data utilisation and application: from data patterns/signals to models for bio-machineries Data visualization: viewing complex data …… Data transmission: data collection, retrieval, ….. ……

Tot hier 5/2

“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky ( )) “Nothing in bioinformatics makes sense except in the light of Biology” Bioinformatics

Pair-wise alignment Combinatorial explosion - 1 gap in 1 sequence: n+1 possibilities - 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc. 2n (2n)! 2 2n = ~ n (n!) 2   n 2 sequences of 300 a.a.: ~10 88 alignments 2 sequences of 1000 a.a.: ~ alignments! T D W V T A L K T D W L - - I K

Dynamic programming Scoring alignments S a,b = + gp(k) = pi + k  pe affine gap penalties pi and pe are the penalties for gap initialisation and extension, respectively

Dynamic programming Scoring alignments 101 Amino Acid Exchange Matrix Gap penalties (open, extension) 20  20 Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)+P o +2P x + +s(L,I)+s(K,K) T D W V T A L K T D W L - - I K

Pairwise sequence alignment Global dynamic programming MDAGSTVILCFVG MDAASTILCGSMDAASTILCGS Amino Acid Exchange Matrix Gap penalties (open,extension) Search matrix MDAGSTVILCFVG- MDAAST-ILC--GS Evolution

Global dynamic programming i-1 j-1 S i,j = s i,j + Max Max{S 0<x<i-1, j-1 - Pi - (i-x-1)Px} S i-1,j-1 Max{S i-1, 0<y<j-1 - Pi - (j-y-1)Px}

Global dynamic programming

Tot hier 17/02/03

Local dynamic programming (Smith & Waterman, 1981) LCFVMLAGSTVIVGTR EDASTILCGSEDASTILCGS Amino Acid Exchange Matrix Gap penalties (open, extension) Search matrix Negative numbers AGSTVIVG A-STILCG

Local dynamic programming (Smith & Waterman, 1981) i-1 j-1 S i,j = Max S i,j + Max{S 0<x<i-1,j-1 - Pi - (i-x-1)Px} S i,j + S i-1,j-1 S i,j + Max {S i-1,0<y<j-1 - Pi - (j-y-1)Px} 0

Local dynamic programming

Sequence database searching – Homology searching DP too slow for repeated database searches FASTA BLAST and PSI-BLAST QUEST HMMER SAM-T98 Fast heuristics Hidden Markov modelling

FASTA Compares a given query sequence with a library of sequences and calculates for each pair the highest scoring local alignment Speed is obtained by delaying application of the dynamic programming technique to the moment where the most similar segments are already identified by faster and less sensitive techniques FASTA routine operates in four steps:

FASTA Operates in four steps: 1.Rapid searches for identical words of a user specified length occurring in query and database sequence(s) (Wilbur and Lipman, 1983, 1984). For each target sequence the 10 regions with the highest density of ungapped common words are determined. 2.These 10 regions are rescored using Dayhoff PAM-250 residue exchange matrix (Dayhoff et al., 1983) and the best scoring region of the 10 is reported under init1 in the FASTA output. 3.Regions scoring higher than a threshold value and being sufficiently near each other in the sequence are joined, now allowing gaps. The highest score of these new fragments can be found under initn in the FASTA output. 4.full dynamic programming alignment (Chao et al., 1992) over the final region which is widened by 32 residues at either side, of which the score is written under opt in the FASTA output.

FASTA output example DE METAL RESISTANCE PROTEIN YCF1 (YEAST CADMIUM FACTOR 1).... SCORES Init1: 161 Initn: 161 Opt: 162 z-score: E(): 3.4e-06 Smith-Waterman score: 162; 35.1% identity in 57 aa overlap test.seq MQRSPLEKASVVSKLFFSWTRPILRKGYRQRLE :| :|::| |:::||:|||::|: | YCFI_YEAST CASILLLEALPKKPLMPHQHIHQTLTRRKPNPYDSANIFSRITFSWMSGLMKTGYEKYLV test.seq LSDIYQIPSVDSADNLSEKLEREWDRE :|:|::| |:::||:|||::|: | YCFI_YEAST EADLYKLPRNFSSEELSQKLEKNWENELKQKSNPSLSWAICRTFGSKMLLAAFFKAIHDV

FASTA (1) Rapid identical word searches: Searching for k-tuples of a certain size within a specified bandwidth along search matrix diagonals. For not-too-distant sequences (> 35% residue identity), little sensitivity is lost while speed is greatly increased. Technique employed is known as hash coding or hashing: a lookup table is constructed for all words in the query sequence, which is then used to compare all encountered words in each database sequence.

FASTA The k-tuple length is user-defined and is usually 1 or 2 for protein sequences (i.e. either the positions of each of the individual 20 amino acids or the positions of each of the 400 possible dipeptides are located). For nucleic acid sequences, the k-tuple is 5-20, and should be longer because short k-tuples are much more common due to the 4 letter alphabet of nucleic acids. The larger the k-tuple chosen, the more rapid but less thorough, a database search.

BLAST blastp compares an amino acid query sequence against a protein sequence database blastn compares a nucleotide query sequence against a nucleotide sequence database blastx compares the six-frame conceptual protein translation products of a nucleotide query sequence against a protein sequence database tblastn compares a protein query sequence against a nucleotide sequence database translated in six reading frames tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.

BLAST Generates all tripeptides from a query sequence and for each of those the derivation of a table of similar tripeptides: number is only fraction of total number possible. Quickly scans a database of protein sequences for ungapped regions showing high similarity, which are called high-scoring segment pairs (HSP), using the tables of similar peptides. The initial search is done for a word of length W that scores at least the threshold value T when compared to the query using a substitution matrix. Word hits are then extended in either direction in an attempt to generate an alignment with a score exceeding the threshold of S, and as far as the cumulative alignment score can be increased.

BLAST Extension of the word hits in each direction are halted when the cumulative alignment score falls off by the quantity X from its maximum achieved value the cumulative score goes to zero or below due to the accumulation of one or more negative-scoring residue alignments upon reaching the end of either sequence The T parameter is the most important for the speed and sensitivity of the search resulting in the high-scoring segment pairs A Maximal-scoring Segment Pair (MSP) is defined as the highest scoring of all possible segment pairs produced from two sequences.

PSI-BLAST Query sequences are first scanned for the presence of so-called low-complexity regions (Wooton and Federhen, 1996), i.e. regions with a biased composition likely to lead to spurious hits; are excluded from alignment. The program then initially operates on a single query sequence by performing a gapped BLAST search Then, the program takes significant local alignments found, constructs a multiple alignment and abstracts a position specific scoring matrix (PSSM) from this alignment. Rescan the database in a subsequent round to find more homologous sequences Iteration continues until user decides to stop or search has converged

PSI-BLAST iteration Q ACD..YACD..Y Pi Px Query sequence PSSM Q Query sequence Gapped BLAST search Database hits Gapped BLAST search ACD..YACD..Y Pi Px PSSM Database hits xxxxxxxxxxxxxxxxx

PSI-BLAST output example

Multiple alignment profiles Gribskov et al ACDWYACDWY Gap penalties i  Position dependent gap penalties

Normalised sequence similarity The p-value is defined as the probability of seeing at least one unrelated score S greater than or equal to a given score x in a database search over n sequences. This probability follows the Poisson distribution (Waterman and Vingron, 1994): P(x, n) = 1 – e -n  P(S  x), where n is the number of sequences in the database Depending on x and n (fixed)

Normalised sequence similarity Statistical significance The E-value is defined as the expected number of non- homologous sequences with score greater than or equal to a score x in a database of n sequences: E(x, n) = n  P(S  x) if E-value = 0.01, then the expected number of random hits with score S  x is 0.01, which means that this E- value is expected by chance only once in 100 independent searches over the database. if the E-value of a hit is 5, then five fortuitous hits with S  x are expected within a single database search, which renders the hit not significant.

Normalised sequence similarity Statistical significance Database searching is commonly performed using an E-value in between 0.1 and Low E-values decrease the number of false positives in a database search, but increase the number of false negatives, thereby lowering the sensitivity of the search.

HMM-based homology searching Most widely used HMM-based profile searching tools currently are SAM-T98 (Karplus et al., 1998) and HMMER2 (Eddy, 1998) formal probabilistic basis and consistent theory behind gap and insertion scores HMMs good for profile searches, bad for alignment HMMs are slow

The HMM algorithms Questions: 1.What is the most likely die (predicted) sequence? Viterbi 2.What is the probability of the observed sequence? Forward 3.What is the probability that the 3 rd state is B, given the observed sequence? Backward Forward:  (i) = P(observed sequence, ending in state i at base t) Backward:  ß (i) = P(obs. after t | ending in state i at base t)  Viterbi:  (i) = max P(obs., ending in state i at base t) t t t

HMM-based homology searching Transition probabilities and Emission probabilities Gapped HMMs also have insertion and deletion states

Profile HMM: m=match state, I-insert state, d=delete state; go from left to right. I and m states output amino acids; d states are ‘silent”. d1d1 d2d2 d3d3 d4d4 I0I0 I2I2 I3I3 I4I4 I1I1 m0m0 m1m1 m2m2 m3m3 m4m4 m5m5 Start End

Homology-derived Secondary Structure of Proteins (HSSP) Sander & Schneider, 1991

Tot hier 17/02/03

Bio-Data Analysis and Data Mining Existing/emerging bio-data analysis and mining tools for –DNA sequence assembly –Genetic map construction –Sequence comparison and database searching –Gene finding –…. –Gene expression data analysis –Phylogenetic tree analysis to infer horizontally-transferred genes –Mass spec. data analysis for protein complex characterization –…… Current mode of work: Often enough: developing ad hoc tools for each individual application

Bio-Data Analysis and Data Mining As the amount and types of data and their cross connections increase rapidly the number of analysis tools needed will go up “exponentially” –blast, blastp, blastx, blastn, … from BLAST family of tools –gene finding tools for human, mouse, fly, rice, cyanobacteria, ….. –tools for finding various signals in genomic sequences, protein-binding sites, splice junction sites, translation start sites, …..

Bio-Data Analysis and Data Mining Many of these data analysis problems are fundamentally the same problem(s) and can be solved using the same set of tools: e.g. clustering or optimal segmentation by Dynamic Programming Developing ad hoc tools for each application (by each group of individual researchers) may soon become inadequate as bio-data production capabilities further ramp up

Bio-data Analysis, Data Mining and Integrative Bioinformatics To have analysis capabilities covering wide range of problems, we need to discover the common fundamental structures of these problems; HOWEVER in biology one size does NOT fit all… Goal is development of a data analysis infrastructure in support of Genomics and beyond

Algorithms in bioinformatics string algorithms dynamic programming machine learning (NN, k-NN, SVM, GA,..) Markov chain models hidden Markov models Markov Chain Monte Carlo (MCMC) algorithms stochastic context free grammars EM algorithms Gibbs sampling clustering tree algorithms text analysis hybrid/combinatorial techniques and more…

Sequence analysis and homology searching

Finding genes and regulatory elements

Expression data

Functional genomics Monte Carlo

Protein translation

Example of algorithm reuse: Data clustering Many biological data analysis problems can be formulated as clustering problems –microarray gene expression data analysis –identification of regulatory binding sites (similarly, splice junction sites, translation start sites,......) –(yeast) two-hybrid data analysis (for inference of protein complexes) –phylogenetic tree clustering (for inference of horizontally transferred genes) –protein domain identification –identification of structural motifs –prediction reliability assessment of protein structures –NMR peak assignments –......

Data Clustering Problems Clustering: partition a data set into clusters so that data points of the same cluster are “similar” and points of different clusters are “dissimilar” cluster identification -- identifying clusters with significantly different features than the background

Application Examples Regulatory binding site identification: CRP (CAP) binding site Two hybrid data analysis l Gene expression data analysis Are all solvable by the same algorithm!

Other Application Examples Phylogenetic tree clustering analysis Protein sidechain packing prediction Assessment of prediction reliability of protein structures Protein secondary structures Protein domain prediction NMR peak assignments ……

Integrative VU Studying informational processes at biological system level From gene sequence to intercellular processes Computers necessary We have biology, statistics, computational intelligence (AI), HTC,.. VUMC: microarray facility Enabling technology: new glue to integrate New integrative algorithms Goals: understanding cells in terms of genomes, fighting disease (VUMC)

VU Progression: DNA: gene prediction, predicting regulatory elements mRNA expression Proteins: docking, domain prediction Metabolic pathways: metabolic control Cell-cell communication

Pyruvate kinase Phosphotransferase  barrel regulatory domain  barrel catalytic substrate binding domain  nucleotide binding domain 1 continuous + 2 discontinuous domains Protein structure and function can be complex…

VU Qualitative challenges: High quality alignments (alternative splicing) In-silico structural genomics In-silico functional genomics: reliable annotation Protein-protein interactions. Metabolic pathways: assign the edges in the networks Cell-cell communication: find membrane associated components New algorithms

VU Quantitative challenges: Understanding mRNA expression levels Understanding resulting protein activity Time dependencies Spatial constraints, compartmentalisation Are classical differential equation models adequate or do we need more individual modeling (e.g macromolecular crowding and activity at oligomolecular level)? Metabolic pathways: calculate fluxes through time Cell-cell communication: tissues, hormones, innervations Need ‘complete’ experimental data for good biological model system to learn to integrate

VU VUMC Neuropeptide – addiction Oncogenes – disease patterns Reumatic disease CNCR From synapses to higher order behaviour Addiction FPP Genetic psychology – twin data bank

Integrative Genomics

Dinner discussion: Integrative Bioinformatics & Genomics VU Recurrent theme: Integration; from molecule to health CRCS VUMC Leiden-VU-TNO (Centre for Medical Systems Biology)

Dinner discussion: Integrative Bioinformatics & Genomics VU metabolome proteome genome transcriptome physiome

Integrative bioinformatics zCalculate from sequence to molecular behaviour zCalculate from molecular behaviour and interactions to cells zCalculate from cellular interactions to tissues zCalculate from tissue to organism zCalculate from organisms to ecosystem and society zDo this in conjunction with data analysis at all levels zAND CALCULATE BACK (induction)

VU Quantitative challenges: How much protein produced from single gene? What time dependencies? What spatial constraints (compartmentalisation)? Metabolic pathways: assign the edges in the networks Cell-cell communication: find membrane associated components

Integrate data sources Integrate methods Integrate data through method integration (biological model) Integrative bioinformatics

Bioinformatics tool Data Algorithm Biological Interpretation (model) tool

“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky ( )) “Nothing in Bioinformatics makes sense except in the light of Biology” Bioinformatics

Pair-wise sequence alignment (more than just string matching) MDAGSTVILCFVG MDAASTILCGSMDAASTILCGS Amino Acid Exchange Matrix Gap penalties (open,extension) Search matrix MDAGSTVILCFVG- MDAAST-ILC--GS Evolution Global dynamic programming

Pair-wise alignment search explosions Combinatorial explosion - 1 gap in 1 sequence: n+1 possibilities - 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc. 2n (2n)! 2 2n = ~ n (n!) 2   n 2 sequences of 300 a.a.: ~10 88 alignments 2 sequences of 1000 a.a.: ~ alignments! T D W V T A L K T D W L - - I K

Global dynamic programming

Three integrative methods to predict protein structural aspects: Iterative multiple alignment + protein secondary structure (Praline) Intermezzo: 2½-D structure prediction of flavodoxin fold by hand Protein domain delineation based on consistency of multiple ab initio model tertiary structures (SnapDRAGON) Protein domain delineation based on combining homology searching with domain prediction (Domaination) This talk – own kitchen

Comparing sequences - Similarity Score - Many properties can be used: Nucleotide or amino acid composition Isoelectric point Molecular weight Morphological characters

Multivariate statistics – Cluster analysis Phylogenetic tree Scores Similarity matrix 5× C1 C2 C3 C4 C5 C6.. Raw table Similarity criterion Cluster criterion

Human Evolution

Comparing sequences - Similarity Score - Many properties can be used: Nucleotide or amino acid composition Isoelectric point Molecular weight Morphological characters But: molecular evolution through sequence alignment

Multivariate statistics – Cluster analysis Phylogenetic tree Scores Similarity matrix 5×5 Multiple alignment Similarity criterion

Human -KITVVGVGAVGMACAISILMKDLADELALVDVIEDKLKGEMMDLQHGSLFLRTPKIVSGKDYNVTANSKLVIITAGARQ Chicken -KISVVGVGAVGMACAISILMKDLADELTLVDVVEDKLKGEMMDLQHGSLFLKTPKITSGKDYSVTAHSKLVIVTAGARQ Dogfish –KITVVGVGAVGMACAISILMKDLADEVALVDVMEDKLKGEMMDLQHGSLFLHTAKIVSGKDYSVSAGSKLVVITAGARQ Lamprey SKVTIVGVGQVGMAAAISVLLRDLADELALVDVVEDRLKGEMMDLLHGSLFLKTAKIVADKDYSVTAGSRLVVVTAGARQ Barley TKISVIGAGNVGMAIAQTILTQNLADEIALVDALPDKLRGEALDLQHAAAFLPRVRI-SGTDAAVTKNSDLVIVTAGARQ Maizey casei -KVILVGDGAVGSSYAYAMVLQGIAQEIGIVDIFKDKTKGDAIDLSNALPFTSPKKIYSA-EYSDAKDADLVVITAGAPQ Bacillus TKVSVIGAGNVGMAIAQTILTRDLADEIALVDAVPDKLRGEMLDLQHAAAFLPRTRLVSGTDMSVTRGSDLVIVTAGARQ Lacto__ste -RVVVIGAGFVGASYVFALMNQGIADEIVLIDANESKAIGDAMDFNHGKVFAPKPVDIWHGDYDDCRDADLVVICAGANQ Lacto_plant QKVVLVGDGAVGSSYAFAMAQQGIAEEFVIVDVVKDRTKGDALDLEDAQAFTAPKKIYSG-EYSDCKDADLVVITAGAPQ Therma_mari MKIGIVGLGRVGSSTAFALLMKGFAREMVLIDVDKKRAEGDALDLIHGTPFTRRANIYAG-DYADLKGSDVVIVAAGVPQ Bifido -KLAVIGAGAVGSTLAFAAAQRGIAREIVLEDIAKERVEAEVLDMQHGSSFYPTVSIDGSDDPEICRDADMVVITAGPRQ Thermus_aqua MKVGIVGSGFVGSATAYALVLQGVAREVVLVDLDRKLAQAHAEDILHATPFAHPVWVRSGW-YEDLEGARVVIVAAGVAQ Mycoplasma -KIALIGAGNVGNSFLYAAMNQGLASEYGIIDINPDFADGNAFDFEDASASLPFPISVSRYEYKDLKDADFIVITAGRPQ Lactate dehydrogenase multiple alignment Distance Matrix Human Chicken Dogfish Lamprey Barley Maizey Lacto_casei Bacillus_stea Lacto_plant Therma_mari Bifido Thermus_aqua Mycoplasma

Multiple sequence alignment Why? It is the most important means to assess relatedness of a set of sequences Gain information about the structure/function of a query sequence (conservation patterns) Construct a phylogenetic tree Putting together a set of sequenced fragments (Fragment assembly) Comparing a segment sequenced by two different labs Many bioinformatics methods depend on it (e.g. secondary/tertiary structure prediction)

Flavodoxin fold: aligning 13 Flavodoxins + cheY 5(  ) fold

Flavodoxin-cheY multiple alignment Praline with pre-processing 1fx1 -PKALIVYGSTTGNT-EYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACF FLAV_DESDE MSKVLIVFGSSTGNT-ESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLF-EEFNRFGLAGRKVAAf FLAV_DESVH MPKALIVYGSTTGNT-EYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACf FLAV_DESSA MSKSLIVYGSTTGNT-ETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLY-DSLENADLKGKKVSVf FLAV_DESGI MPKALIVYGSTTGNT-EGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLY-EDLDRAGLKDKKVGVf 2fcr --KIGIFFSTSTGNT-TEVADFIGKTLGA---KADAPIDVDDVTDPQALKDYDLLFLGAPTWNTG----ADTERSGTSWDEFLYDKLPEVDMKDLPVAIF FLAV_AZOVI -AKIGLFFGSNTGKT-RKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFL-PKIEGLDFSGKTVALf FLAV_ENTAG MATIGIFFGSDTGQT-RKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFT-NTLSEADLTGKTVALf FLAV_ANASP SKKIGLFYGTQTGKT-ESVaEIIRDEFGN---DVVTLHDVSQAE-VTDLNDYQYLIIgCPTWNIGEL QSDWEGLY-SELDDVDFNGKLVAYf FLAV_ECOLI -AITGIFFGSDTGNT-ENIaKMIQKQLGK---DVADVHDIAKSS-KEDLEAYDILLLgIPTWYYGE AQCDWDDFF-PTLEEIDFNGKLVALf 4fxn -MK--IVYWSGTGNT-EKMAELIAKGIIESG-KDVNTINVSDVNIDELL-NEDILILGCSAMGDEVL EESEFEPFI-EEIS-TKISGKKVALF FLAV_MEGEL MVE--IVYWSGTGNT-EAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVA-SKDVILLgCPAMGSEEL EDSVVEPFF-TDLA-PKLKGKKVGLf FLAV_CLOAB -MKISILYSSKTGKT-ERVaKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQESEGIIFgTPTYYAN ISWEMKKWI-DESSEFNLEGKLGAAf 3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFN--NVEEAEDGVDALNKLQAGGYGFVI---SDWNMPNM DGLELL-KTIRADGAMSALPVLM T 1fx1 GCGDS-SY-EYFCGA-VDAIEEKLKNLGAEIVQD GLRIDGD--PRAARDDIVGWAHDVRGAI FLAV_DESDE ASGDQ-EY-EHFCGA-VPAIEERAKELgATIIAE GLKMEGD--ASNDPEAVASfAEDVLKQL FLAV_DESVH GCGDS-SY-EYFCGA-VDAIEEKLKNLgAEIVQD GLRIDGD--PRAARDDIVGwAHDVRGAI FLAV_DESSA GCGDS-DY-TYFCGA-VDAIEEKLEKMgAVVIGD SLKIDGD--PE--RDEIVSwGSGIADKI FLAV_DESGI GCGDS-SY-TYFCGA-VDVIEKKAEELgATLVAS SLKIDGE--PD--SAEVLDwAREVLARV fcr GLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKS-VRDGKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV FLAV_AZOVI GLGDQVGYPENYLDA-LGELYSFFKDRgAKIVGSWSTDGYEFESSEA-VVDGKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L-- FLAV_ENTAG GLGDQLNYSKNFVSA-MRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L FLAV_ANASP GTGDQIGYADNFQDA-IGILEEKISQRgGKTVGYWSTDGYDFNDSKA-LRNGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL FLAV_ECOLI GCGDQEDYAEYFCDA-LGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA 4fxn G-----SY-GWGDGKWMRDFEERMNGYGCVVVET PLIVQNE--PDEAEQDCIEFGKKIANI FLAV_MEGEL G-----SY-GWGSGEWMDAWKQRTEDTgATVIGT AIVNEM--PDNA-PECKElGEAAAKA FLAV_CLOAB STANSIAGGSDIA---LLTILNHLMVKgMLVYSG----GVAFGKPKTHLGYVHINEIQENEDENARIfGERiANkVKQIF chy VTAEAKK--ENIIAA AQAGAS GYVV-----KPFTAATLEEKLNKIFEKLGM G Iteration 0 SP= AvSP= SId= 4009 AvSId= 0.313

Flavodoxin-cheY NJ tree

Integrating secondary structure prediction in multiple alignment Victor Simossis Praline multiple alignment method (Heringa, Comp. Chem. 23, ;1999, Comp. Chem., 26, ;2002; Kleinjung, Douglas & Heringa, Bioinformatics, in press;2002) Combining sequence data and secondary structure prediction (Heringa, Curr. Prot. Pept. Sci., 1 (3), ;2000) Secondary structure methods: PhD, Predator, PSIPred, Jpred, SSPRED,...

Using secondary structure in multiple alignment “Structure more conserved than sequence”

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE (oligomers) SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE (oligomers) SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Secondary structure-induced alignment

Using secondary structure in multiple alignment Dynamic programming search matrix Amino acid exchange weights matrices MDAGSTVILCFV HHHCCCEEEEEE MDAASTILCGSMDAASTILCGS HHHHCCEEECCHHHHCCEEECC C H E HC E Default

Flavodoxin-cheY predicted secondary structure (PREDATOR) 1fx1 -PK-ALIVYGSTTGNTEYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACF e eeee b ssshhhhhhhhhhhhhhttt eeeee stt tttttt seeee b ee sss ee ttthhhhtt ttss tt eeeee FLAV_DESVH MPK-ALIVYGSTTGNTEYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLFDS-LEETGAQGRKVACf e eeeeee hhhhhhhhhhhhhhh eeeeee eeeeee hhhhhh eeeee FLAV_DESGI MPK-ALIVYGSTTGNTEGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLYED-LDRAGLKDKKVGVf e eeeeee hhhhhhhhhhhhhh eeeeee hhhhhh eeeeeee hhhhhh eeeeee FLAV_DESSA MSK-SLIVYGSTTGNTETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLYDS-LENADLKGKKVSVf eeeeee hhhhhhhhhhhhhh eeeee eeeee hhhhhhh h eeeee FLAV_DESDE MSK-VLIVFGSSTGNTESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLFEE-FNRFGLAGRKVAAf eeee hhhhhhhhhhhhhh eeeee hhhhhhhhhhheeeee hhhhhhh hh eeeee 2fcr --K-IGIFFSTSTGNTTEVADFIGKTLGAK---ADAPIDVDDVTDPQALKDYDLLFLGAPTWNTGAD----TERSGTSWDEFLYDKLPEVDMKDLPVAIF eeeee ssshhhhhhhhhhhhhggg b eeggg s gggggg seeeeeee stt s s s sthhhhhhhtggg tt eeeee FLAV_ANASP SKK-IGLFYGTQTGKTESVaEIIRDEFGND--VVTL-HDVSQAE-VTDLNDYQYLIIgCPTWNIGEL QSDWEGLYSE-LDDVDFNGKLVAYf eeeee hhhhhhhhhhhh eee hhh hhhhhhheeeeee hhhhhhhhh eeeeee FLAV_ECOLI -AI-TGIFFGSDTGNTENIaKMIQKQLGKD--VADV-HDIAKSS-KEDLEAYDILLLgIPTWYYGEA QCDWDDFFPT-LEEIDFNGKLVALf eee hhhhhhhhhhhh eee hhh hhhhhhheeeee hhhhh eeeeee FLAV_AZOVI -AK-IGLFFGSNTGKTRKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFLPK-IEGLDFSGKTVALf eee hhhhhhhhhhhhh hhh hhhhhhheeeee hhhhhhhhh eeeeee FLAV_ENTAG MAT-IGIFFGSDTGQTRKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFTNT-LSEADLTGKTVALf eeee hhhhhhhhhhhh hhh hhhhhhheeeee hhhhh eeeee 4fxn ----MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDVNIDELLNE-DILILGCSAMGDEVL------E-ESEFEPFIEE-IST-KISGKKVALF eeeee ssshhhhhhhhhhhhhhhtt eeeettt sttttt seeeeee btttb ttthhhhhhh hst t tt eeeee FLAV_MEGEL M---VEIVYWSGTGNTEAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVASK-DVILLgCPAMGSEEL------E-DSVVEPFFTD-LAP-KLKGKKVGLf hhhhhhhhhhhhhh eeeee hhhhhhhh eeeee eeeee FLAV_CLOAB M-K-ISILYSSKTGKTERVaKLIEEGVKRSGNIEVKTMNL-DAVDKKFLQESEGIIFgTPTY-YANI SWEMKKWIDE-SSEFNLEGKLGAAf eee hhhhhhhhhhhhhh eeeeee hhhhhhhhhh eeee hhhhhhhhh eeeee 3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFNN-VEEAEDGV-DALNKLQAGGYGFVISD---WNMPNM DGLELLKTIRADGAMSALPVLMV tt eeee s hhhhhhhhhhhhhht eeeesshh hhhhhhhh eeeee s sss hhhhhhhhhh ttttt eeee 1fx1 GCGDS-SY-EYFCGAVDAIEEKLKNLGAEIVQD GLRIDGD--PRAARDDIVGWAHDVRGAI eee s ss sstthhhhhhhhhhhttt ee s eeees gggghhhhhhhhhhhhhh FLAV_DESVH GCGDS-SY-EYFCGAVDAIEEKLKNLgAEIVQD GLRIDGD--PRAARDDIVGwAHDVRGAI eee hhhhhhhhhhhh eeeee eeeee hhhhhhhhhhhhhh FLAV_DESGI GCGDS-SY-TYFCGAVDVIEKKAEELgATLVAS SLKIDGE--P--DSAEVLDwAREVLARV eee hhhhhhhhhhhh eeeee hhhhhhhhhhh FLAV_DESSA GCGDS-DY-TYFCGAVDAIEEKLEKMgAVVIGD SLKIDGD--P--ERDEIVSwGSGIADKI hhhhhhhhhhhh eeeee e eee FLAV_DESDE ASGDQ-EY-EHFCGAVPAIEERAKELgATIIAE GLKMEGD--ASNDPEAVASfAEDVLKQL e hhhhhhhhhhhhhh eeeee ee hhhhhhhhhhh 2fcr GLGDAEGYPDNFCDAIEEIHDCFAKQGAKPVGFSNPDDYDYEESKSVRD-GKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV eee ttt ttsttthhhhhhhhhhhtt eee b gggs s tteet teesseeeettt ss hhhhhhhhhhhhhhhht FLAV_ANASP GTGDQIGYADNFQDAIGILEEKISQRgGKTVGYWSTDGYDFNDSKALR-NGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL hhhhhhhhhhhhhh eeee hhhhhhhhhhhhhhhh FLAV_ECOLI GCGDQEDYAEYFCDALGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA hhhhhhhhhhhhhh eeee hhhhhhhhhhhhhhhhhh FLAV_AZOVI GLGDQVGYPENYLDALGELYSFFKDRgAKIVGSWSTDGYEFESSEAVVD-GKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L-- e hhhhhhhhhhhhhh eeeee hhhhhhhhhhh FLAV_ENTAG GLGDQLNYSKNFVSAMRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L hhhhhhhhhhhhhhh eeee hhhhhhh hhhhhhhhhhhh 4fxn G-----SYGWGDGKWMRDFEERMNGYGCVVVET PLIVQNE--PDEAEQDCIEFGKKIANI e eesss shhhhhhhhhhhhtt ee s eeees ggghhhhhhhhhhhht FLAV_MEGEL G-----SYGWGSGEWMDAWKQRTEDTgATVIGT AIVNEM--PDNAPE-CKElGEAAAKA hhhhhhhhhhh eeeee eeee h hhhhhhhh FLAV_CLOAB STANSIA-GGSDIALLTILNHLMVK-gMLVYSG----GVAFGKPKTHLG-----YVHINEI--QENEDENARIfGERiANkV--KQIF-- hhhhhhhhhhhhhh eeeee hhhh hhh hhhhhhhhhhhh h 3chy TAEAKKENIIAAAQAGASGY VVK----P-FTAATLEEKLNKIFEKLGM ess hhhhhhhhhtt see ees s hhhhhhhhhhhhhhht G Enough to predict 5(  ) topology

Secondary structure-induced alignment

Iteration Convergence Limit cycle Divergence

3chy-AA SEQUENCE|| AA |ADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEEAEDGVDALNKLQAGGYGFVISDWNMP| 3chy-ITERATION-0|| PHD | EEEEEEE HHHHHHHHHHHHHHHHH E HHHHHHHHHH HHHEEE | 3chy-ITERATION-1|| PHD | EEEEEEEE HHHHHHHHHHHHHHH HHHHHHHH EEEEEE | 3chy-ITERATION-2|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHHH EEEEEE | 3chy-ITERATION-3|| PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE | 3chy-ITERATION-4|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHH EEEEE | 3chy-ITERATION-5|| PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE | 3chy-ITERATION-6|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHH EEEEEE | 3chy-ITERATION-7|| PHD | EEEEEEEE HHHHHHHHHHHHHH EEE HHHHHH EEEEE | 3chy-ITERATION-8|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHH EEEEEE | 3chy-ITERATION-9|| PHD | EEEEEEEE HHHHHHHHHHHHHH HHHHHHHHHH EEEEE | 3chy-AA SEQUENCE|| AA |NMDGLELLKTIRADGAMSALPVLMVTAEAKKENIIAAAQAGASGYVVKPFTAATLEEKLNKIFEKLGM| 3chy-ITERATION-0|| PHD | HHHHHHEEEEEE HHHHHHHHHHHHHHHHH HHHHHHHHHHHHHH | 3chy-ITERATION-1|| PHD | HHHHHHEEEEEE HHH HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | 3chy-ITERATION-2|| PHD | HHHHHHEEEEEE HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | 3chy-ITERATION-3|| PHD | HHHHHHHHHHHH HHHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | 3chy-ITERATION-4|| PHD | HHHHH EEEEE HHHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | 3chy-ITERATION-5|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | 3chy-ITERATION-6|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEEE HHHHHHHHHHHHHH | 3chy-ITERATION-7|| PHD | HHHHHHHH EEEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | 3chy-ITERATION-8|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHHH EEE HHHHHHHHHHHHHH | 3chy-ITERATION-9|| PHD | HHHHHHHH EEEEE HHHHHHHHHHHHHHH EEEE HHHHHHHHHHHHHH | Flavodoxin-cheY multiple alignment/ secondary structure iteration cheY SSEs

4fxn-AA SEQUENCE|| AA |MKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDVNIDELLNEDILILGCSAMGDEV| 4fxn-ITERATION-0|| PHD | EEEEE HHHHHHHHHHHHHHH EEE EEEEE | 4fxn-ITERATION-1|| PHD | EEEEE HHHHHHHHHHHHHHH EEEE EEEEE | 4fxn-ITERATION-2|| PHD | EEEEE HHHHHHHHHHHHHHH EEEE EEEEE | 4fxn-ITERATION-3|| PHD | EEEEE HHHHHHHHHHHHHHH E EEEEE | 4fxn-ITERATION-4|| PHD | EEEEEE HHHHHHHHHHHHHHH EEEE EEEEE | 4fxn-ITERATION-5|| PHD | EEEEEE HHHHHHHHHHHHHHH EE EEEEE | 4fxn-ITERATION-6|| PHD | EEEEEE HHHHHHHHHHHHHHH EEEE EEEEE | 4fxn-ITERATION-7|| PHD | EEEEEE HHHHHHHHHHHHHHH EE EEEEE | 4fxn-ITERATION-8|| PHD | EEEEEE HHHHHHHHHHHHHHH EEE EEEEE | 4fxn-ITERATION-9|| PHD | EEEEE HHHHHHHHHHHHHHH EEE EEEEE | 4fxn-AA SEQUENCE|| AA |LEESEFEPFIEEISTKISGKKVALFGSYGWGDGKWMRDFEERMNGYGCVVVETPLIVQNE| 4fxn-ITERATION-0|| PHD | EEEEE HHHHHHHHHHHHHHHHH EEE EEE | 4fxn-ITERATION-1|| PHD | HHHH EEEEE HHHHHHHHHHHHHHH EEE EE | 4fxn-ITERATION-2|| PHD | HHHHHHHHHHHH EEEEEE HHHHHHHHHHHHHHH EEE EE | 4fxn-ITERATION-3|| PHD | HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHH EEE EE | 4fxn-ITERATION-4|| PHD | HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHHHH EEE E | 4fxn-ITERATION-5|| PHD | HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHHHH EEE E | 4fxn-ITERATION-6|| PHD | HHHHHHHHHHHH EEEEEE HHHHHHHHHHHHHHHH EEE E | 4fxn-ITERATION-7|| PHD | HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHHHH EEE E | 4fxn-ITERATION-8|| PHD | HHHHHHHHHHHH EEEEE HHHHHHHHHHHHHHHHH EEE E | 4fxn-ITERATION-9|| PHD | HHHHHHHHHHHH EEEEEE HHHHHHHHHHHHHHHH EEE E | 4fxn-AA SEQUENCE|| AA |PDEAEQDCIEFGKKIANI| 4fxn-ITERATION-0|| PHD | HHHHHHHHHHHHH | 4fxn-ITERATION-1|| PHD | HHHHHHHHHHHHH | 4fxn-ITERATION-2|| PHD | HHHHHHHHHHHHH | 4fxn-ITERATION-3|| PHD | HHHHHHHHHHHHH | 4fxn-ITERATION-4|| PHD | HHHHHHHHHHHH | 4fxn-ITERATION-5|| PHD | HHHHHHHHHHHHH | 4fxn-ITERATION-6|| PHD | HHHHHHHHHHHH | 4fxn-ITERATION-7|| PHD | HHHHHHHHHHHHH | 4fxn-ITERATION-8|| PHD | HHHHHHHHHHHHH | 4fxn-ITERATION-9|| PHD | HHHHHHHHHHHH |

C A B D Predicting sec. struct. with PHD, etc. A B C D

Secondary structure prediction using MA (SymSS) EEEEE HHHHHH EEEEE HH EEEE? ?HHHHH EEE H EEEEE HHHHH? ??EE HH EEEEEE ?HHHHH EEEE HH EEEEE HHHHHH EEE HH EEEE? ?HHHHH EEE H EEEEE HHHHH? ??EE HH EEEEE ?HHHHH EEEE HH EEEEE HHHH EEE HH EEEE? ?HHH EEE H EEEEE HHH? ??EE HH EEEEE HHH? EEEE HH EEEEE HHHHHH EEE HHHH EEEE? ?HHHHH EEE ?HHH EEEEE HHHHH? ??EE HHHH EEEEE ?HHHHH EEEE HHHH EEEEE HHHHH EEE HEEEE HHHH EE HHHEEEE HHHHH EEE HEEEE HHH EEE HH

Flavodoxin-cheY 3chy GYVVKPFTAATLEEKLNKIFEKLGM PHD hhhhhhhhhhhhhh > 0 ee ??hhhhhhhhhhh? 13 -> 1 ee ??hhhhhhhhhhh?? 13 -> 2 ee ??hhhhhhhhhhh? 13 -> 3 eee ?hhhhhhhhhhh? 13 -> 4 eee ?hhhhhhhhhhh? 13 -> 5 eee h?hhhhhhhhhhh 13 -> 6 eee hh hhhhhhhhhhh 13 -> 7 e eeeeeee hhhhhhhhhhhhh?? 13 -> 8 eeeeeee hhhhhhhhhhhhh?? 13 -> 9 eeeeeee hhhhhhhhhhhhh?? ????? 13 -> 10 eeeeeee hhhhhhhhhhhhh?? 13 -> 11 e eeeeeeee hhhhhhhhhhhhh??? 13 -> 12 eeeeeee hhhhhhhhhh 13 -> 13 hhhhhhhhhhhhhh h DSSP EEEESS HHHHHHHHHHHHHHHT......

Optimal segmentation of predicted secondary structures H score …. E score …. C score … EEEEE HHHHHH EEEEE HH EEEE? ?HHHHH EEE H EEEEE HHHHH? ??EE HH EEEEEE ?HHHHH EEEE HH 1 ->1 1 -> 2 1 ->3 1 ->4 ? Score …. Region …. C E H Each sequence within an alignment gives rise to a library of n secondary structure predictions, where n is the number of sequences in the alignment. The predictions are recorded by secondary structure type and region position in a single matrix

Optimal segmentation of predicted secondary structures by Dynamic Programming sequence position window size Max score Offset Label H score E score C score The recorded values are used in a weighted function according to their secondary structure type, that gives each position a window-specific score. The more probable the secondary structure element, the higher the score. Restrictions: H only if ws >= 4 E only if ws >= 2 5 H 26 Segmentation score (Total score of each path) ? score Region

Example of an optimally segmented secondary structure prediction library for sequence 3chy 3chy GYVV-----KPFTAATLEEKLNKIFEKLGM chy <- 1fx1 ??????????????? ee ?? hhhhhhhhhhhhhh ???????? 3chy <- FLAV_DESDE ??????????????? ee ?? hhhhhhhhhhhhhhh ???????? 3chy <- FLAV_DESVH ??????????????? ee ?? hhhhhhhhhhhhhh ???????? 3chy <- FLAV_DESGI ??????????????? eee ?? ??hhhhhhhhhhhhh ???????? 3chy <- FLAV_DESSA ??????????????? eee ?? ??hhhhhhhhhhhhh ???????? 3chy <- 4fxn ??????????????? eee ?? hhhhhhhhhhhhh ????????? 3chy <- FLAV_MEGEL ????????????????eee ?? hh?hhhhhhhhhhh ????????? 3chy <- 2fcr e ? eeeeeee hhhhhhhhhhhhhhh ?????? 3chy <- FLAV_ANASP ? eeeeeee hhhhhhhhhhhhhhh ?????? 3chy <- FLAV_ECOLI eeeeeee hhhhhhhhhhhhhhh hhhhh 3chy <- FLAV_AZOVI ? eeeeeee hhhhhhhhhhhhhhh ???? 3chy <- FLAV_ENTAG e eeeeeeee hhhhhhhhhhhhhhhh? ?????? 3chy <- FLAV_CLOAB eeeeeee hhhhhhhhhh ??????????? 3chy <- 3chy hhhhhhhhhhhhhh Consensus EEEE----- HHHHHHHHHHHHH Consensus-DSSP ****.....****xx*************** PHD HHHHHHHHHHHHHH PHD-DSSP xxxx.....******************x** DSSP EEEE.....SS HHHHHHHHHHHHHHHT LumpDSSP EEEE..... HHHHHHHHHHHHHHH......

What to do with a multiple alignment? Use it to eyeball and detect structural/functional features Use it to make a profile and search a database for homologs Give it to other bioinformatics methods and predict secondary structure, functional residues, correlated mutations, phylogenetic trees, etc.

Rules of thumb when looking at a multiple alignment (MA) Hydrophobic residues are internal Gly (Thr, Ser) in loops MA: hydrophobic block -> internal  -strand MA: alternating (1-1) hydrophobic/hydrophilic => edge  -strand MA: alternating 2-2 (or 3-1) periodicity =>  -helix MA: gaps in loops MA: Conserved column => functional? => active site

Rules of thumb when looking at a multiple alignment (MA) Active site residues are together in 3D structure Helices often cover up core of strands Helices less extended than strands => more residues to cross protein  -  -  motif is right-handed in >95% of cases (with parallel strands) MA: ‘inconsistent’ alignment columns and match errors! Secondary structures have local anomalies, e.g.  -bulges

Rules of thumb when looking at a multiple alignment (MA) Active site residues are together in 3D structure Helices often cover up core of strands Helices less extended than strands => more residues to cross protein  -  -  motif is right-handed in >95% of cases (with parallel strands) MA: ‘inconsistent’ alignment columns and match errors! Secondary structures have local anomalies, e.g.  -bulges

Periodicity patterns Burried  -strand Edge  -strand  -helix

Burried and Edge strands Parallel  -sheet Anti-parallel  -sheet

 -  -  motif is right-handed in >95% of cases RH LH

Flavodoxin-cheY example: 5(  ) 1fx1 -PKALIVYGSTTGNT-EYTAETIARQLANAG-YEVDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACF FLAV_DESDE MSKVLIVFGSSTGNT-ESIaQKLEELIAAGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLF-EEFNRFGLAGRKVAAf FLAV_DESVH MPKALIVYGSTTGNT-EYTaETIARELADAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLF-DSLEETGAQGRKVACf FLAV_DESSA MSKSLIVYGSTTGNT-ETAaEYVAEAFENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLY-DSLENADLKGKKVSVf FLAV_DESGI MPKALIVYGSTTGNT-EGVaEAIAKTLNSEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLY-EDLDRAGLKDKKVGVf 2fcr --KIGIFFSTSTGNT-TEVADFIGKTLGA---KADAPIDVDDVTDPQALKDYDLLFLGAPTWNTG----ADTERSGTSWDEFLYDKLPEVDMKDLPVAIF FLAV_AZOVI -AKIGLFFGSNTGKT-RKVaKSIKKRFDDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFL-PKIEGLDFSGKTVALf FLAV_ENTAG MATIGIFFGSDTGQT-RKVaKLIHQKLDG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFT-NTLSEADLTGKTVALf FLAV_ANASP SKKIGLFYGTQTGKT-ESVaEIIRDEFGN---DVVTLHDVSQAE-VTDLNDYQYLIIgCPTWNIGEL QSDWEGLY-SELDDVDFNGKLVAYf FLAV_ECOLI -AITGIFFGSDTGNT-ENIaKMIQKQLGK---DVADVHDIAKSS-KEDLEAYDILLLgIPTWYYGE AQCDWDDFF-PTLEEIDFNGKLVALf 4fxn -MK--IVYWSGTGNT-EKMAELIAKGIIESG-KDVNTINVSDVNIDELL-NEDILILGCSAMGDEVL EESEFEPFI-EEIS-TKISGKKVALF FLAV_MEGEL MVE--IVYWSGTGNT-EAMaNEIEAAVKAAG-ADVESVRFEDTNVDDVA-SKDVILLgCPAMGSEEL EDSVVEPFF-TDLA-PKLKGKKVGLf FLAV_CLOAB -MKISILYSSKTGKT-ERVaKLIEEGVKRSGNIEVKTMNLDAVD-KKFLQESEGIIFgTPTYYAN ISWEMKKWI-DESSEFNLEGKLGAAf 3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFN--NVEEAEDGVDALNKLQAGGYGFVI---SDWNMPNM DGLELL-KTIRADGAMSALPVLM T 1fx1 GCGDS-SY-EYFCGA-VDAIEEKLKNLGAEIVQD GLRIDGD--PRAARDDIVGWAHDVRGAI FLAV_DESDE ASGDQ-EY-EHFCGA-VPAIEERAKELgATIIAE GLKMEGD--ASNDPEAVASfAEDVLKQL FLAV_DESVH GCGDS-SY-EYFCGA-VDAIEEKLKNLgAEIVQD GLRIDGD--PRAARDDIVGwAHDVRGAI FLAV_DESSA GCGDS-DY-TYFCGA-VDAIEEKLEKMgAVVIGD SLKIDGD--PE--RDEIVSwGSGIADKI FLAV_DESGI GCGDS-SY-TYFCGA-VDVIEKKAEELgATLVAS SLKIDGE--PD--SAEVLDwAREVLARV fcr GLGDAEGYPDNFCDA-IEEIHDCFAKQGAKPVGFSNPDDYDYEESKS-VRDGKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV FLAV_AZOVI GLGDQVGYPENYLDA-LGELYSFFKDRgAKIVGSWSTDGYEFESSEA-VVDGKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L-- FLAV_ENTAG GLGDQLNYSKNFVSA-MRILYDLVIARgACVVGNWPREGYKFSFSAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L FLAV_ANASP GTGDQIGYADNFQDA-IGILEEKISQRgGKTVGYWSTDGYDFNDSKA-LRNGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL FLAV_ECOLI GCGDQEDYAEYFCDA-LGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADDDHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA 4fxn G-----SY-GWGDGKWMRDFEERMNGYGCVVVET PLIVQNE--PDEAEQDCIEFGKKIANI FLAV_MEGEL G-----SY-GWGSGEWMDAWKQRTEDTgATVIGT AIVNEM--PDNA-PECKElGEAAAKA FLAV_CLOAB STANSIAGGSDIA---LLTILNHLMVKgMLVYSG----GVAFGKPKTHLGYVHINEIQENEDENARIfGERiANkVKQIF chy VTAEAKK--ENIIAA AQAGAS GYVV-----KPFTAATLEEKLNKIFEKLGM G Iteration 0 SP= AvSP= SId= 4009 AvSId= 0.313

Building flavodoxin RH 21345

Building flavodoxin RH 21345

Building flavodoxin RH 21345

Building flavodoxin RH 21345

Building flavodoxin RH 21345

Building flavodoxin RH 21345

Building flavodoxin try again RH 12345

Building flavodoxin RH 12345

Building flavodoxin RH 12345

Building flavodoxin RH 12345

Building flavodoxin RH 12345

Building flavodoxin RH 12345

Flavodoxin family - TOPS diagrams (Flores et al., 1994)

Protein structure evolution Insertion/deletion of secondary structural elements can ‘easily’ be done at loop sites

Protein structure evolution Insertion/deletion of structural domains can ‘easily’ be done at loop sites N C

SnapDRAGON Richard A. George George R.A. and Heringa, J. (2002) J. Mol. Biol., 316, Integrating protein multiple alignment, secondary and tertiary structure prediction to predict structural domains in sequence data

A domain is a: Compact, semi-independent unit (Richardson, 1981). Stable unit of a protein structure that can fold autonomously (Wetlaufer, 1973). Recurring functional and evolutionary module (Bork, 1992). “Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977).

The DEATH Domain Present in a variety of Eukaryotic proteins involved with cell death. Six helices enclose a tightly packed hydrophobic core. Some DEATH domains form homotypic and heterotypic dimers.

Delineating domains is essential for: Obtaining high resolution structures (x-ray, NMR) Sequence analysis Multiple sequence alignment methods Prediction algorithms (SS, Class, secondary/tertiary structure) Fold recognition and threading Elucidating the evolution, structure and function of a protein family (e.g. ‘Rosetta Stone’ method) Structural/functional genomics Cross genome comparative analysis

Pyruvate kinase Phosphotransferase  barrel regulatory domain  barrel catalytic substrate binding domain  nucleotide binding domain 1 continuous + 2 discontinuous domains Structural domain organisation can be nasty…

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Protein structure hierarchical levels VHLTPEEKSAVTALWGKVNVDE VGGEALGRLLVVYPWTQRFFE SFGDLSTPDAVMGNPKVKAHG KKVLGAFSDGLAHLDNLKGTFA TLSELHCDKLHVDPENFRLLGN VLVCVLAHHFGKEFTPPVQAAY QKVVAGVANALAHKYH PRIMARY STRUCTURE (amino acid sequence) QUATERNARY STRUCTURE SECONDARY STRUCTURE (helices, strands) TERTIARY STRUCTURE (fold)

Distance Regularisation Algorithm for Geometry OptimisatioN (Aszodi & Taylor, 1994) Domain prediction using DRAGON Folds proteins based on the requirement that (conserved) hydrophobic residues cluster together. First constructs a random high dimensional C  distance matrix. Distance geometry is used to find the 3D conformation corresponding to a prescribed target matrix of desired distances between residues.

The DRAGON target matrix is inferred from: A multiple sequence alignment of a protein (old) –Conserved hydrophobicity Secondary structure information (SnapDRAGON) –predicted by PREDATOR (Frishman & Argos, 1996). –strands are entered as distance constraints from the N- terminal C  to the C-terminal C 

The C  distance matrix is divided into smaller clusters. Seperately, each cluster is embedded into a local centroid. The final predicted structure is generated from full embedding of the multiple centroids and their corresponding local structures. 3 N N N N C  distance matrix Target matrix N CCHHHCCEEE Multiple alignment Predicted secondary structure 100 randomised initial matrices 100 predictions Input data

SnapDragon Generated folds by Dragon Boundary recognition Summed and Smoothed Boundaries CCHHHCCEEE Multiple alignment Predicted secondary structure

Domains in structures assigned using method by Taylor (1997) Domain boundary positions of each model against sequence Summed and Smoothed Boundaries (Biased window protocol) SnapDRAGON

Prediction assessment Test set of 414 multiple alignments;183 single and 231 multiple domain proteins. Sequence searches using PSI-BLAST (Altschul et al., 1997) followed by redundancy filtering using OBSTRUCT (Heringa et al.,1992) and alignment by PRALINE (Heringa, 1999) Boundary predictions are compared to the region of the protein connecting two domains (min  10 residues)

Average prediction results per protein Coverage is the % linkers predicted (TP/TP+FN) Success is the % of correct predictions made (TP/TP+FP)

SnapDRAGON Is very slow (can be hours for proteins>400 aa) – cluster computing implementation Uses consistency in the absence of standard of truth Goes from primary+secondary to tertiary structure to ‘just’ chop protein sequences SnapDRAGON webserver is underway

DOMAINATION Richard A. George Protein domain identification and improved sequence searching using PSI-BLAST (George & Heringa, Prot. Struct. Func. Genet., in press; 2002) Integrating protein sequence database searching and on-the-fly domain recognition

Domaination Current iterative homology search methods do not take into account that: –Domains may have different ‘rates of evolution’. –Common conserved domains, such as the tyrosine kinase domain, can obscure weak but relevant matches to other domain types –Premature convergence (false negatives) –Matrix migration / Profile wander (false positives).

PSI-BLAST Query sequence is first scanned for the presence of so- called low-complexity regions (Wooton and Federhen, 1996), i.e. regions with a biased composition (e.g. TM regions or coiled coils) likely to lead to spurious hits, which are excluded from alignment. Initially operates on a single query sequence by performing a gapped BLAST search Then takes significant local alignments found, constructs a ‘multiple alignment’ and abstracts a position specific scoring matrix (PSSM) from this alignment. Rescans the database in a subsequent round to find more homologous sequences -- Iteration continues until user decides to stop or search converges

PSI-BLAST iteration Q ACD..YACD..Y Pi Px Query sequence PSSM Q Query sequence Gapped BLAST search Database hits Gapped BLAST search ACD..YACD..Y Pi Px PSSM Database hits xxxxxxxxxxxxxxxxx

DOMAINATION Chop and Join Domains

Post-processing low complexity Remove local fragments with > 15% LC

Identifying domain boundaries Sum N- and C-termini of gapped local alignments True N- and C- termini are counted twice (within 10 residues) Boundaries are smoothed using two windows (15 residues long) Combine scores using biased protocol: if Ni x Ci = 0 then Si = Ni+Ci else Si = Ni+Ci +(NixCi)/(Ni+Ci)

Identifying domain deletions Deletions in the query (or insertion in the DB sequences) are identified by –two adjacent segments in the query align to the same DB sequences (>70% overlap), which have a region of >35 residues not aligned to the query. (remove N- and C- termini) DB Query

Identifying domain permutations A domain shuffling event is declared –when two local alignments (>35 residues) within a single DB sequence match two separate segments in the query (>70% overlap), but have a different sequential order. DB Query b a a b

Identifying continuous and discontinuous domains Each segment is assigned an independence score (In). If In>10% the segment is assigned as a continuous domain. An association score is calculated between non-adjacent fragments by assessing the shared sequence hits to the segments. If score > 50% then segments are considered as discontinuous domains and joined.

Create domain profiles A representative set of the database sequence fragments that overlap a putative domain are selected for alignment using OBSTRUCT (Heringa et al. 1992). > 20% and < 60% sequence identity (including the query seq). A multiple sequence alignment is generated using PRALINE (Heringa 1999, 2002; Kleinjung et al., 2002). Each domain multiple alignment is used as a profile in further database searches using PSI-BLAST (Altschul et al 1997). The whole process is iterated until no new domains are identified.

Domain boundary prediction accuracy Set of 452 multidomain proteins 56% of proteins were correctly predicted to have more than one domain 42% of predictions are within  20 residues of a true boundary 49.9% (  44.6%) correct boundary predictions per protein

23.3% of all linkers found in 452 multidomain proteins. Not a surprise since: –Structural domain boundaries will not always coincide with sequence domain boundaries –Proteins must have some domain shuffling For discontinuous proteins 34.2% of linkers were identified 30% of discontinuous domains were successfully joined

Change in domain prediction accuracy using various PSI-BLAST E-value cut-offs

Benchmarking versus PSI-BLAST A set 452 non-homologous multidomain protein structures. Each protein was delineated into its structural domains. Database searches of the individual domains were used as a standard of truth. We then tested to what extent PSI-BLAST and DOMAINATION, when run on the full-length protein sequences, can capture the sequences found by the reference PSI-BLAST searches using the individual domains.

Two sets based on individual domain searches: Reference set 1: consists of database sequences for which PSI-BLAST finds all domains contained in the corresponding full length query. Reference set 2: consists of database sequences found by searching with one or more of the domain sequences Therefore set 2 contains many more sequences than set 1 Ref set 1 Ref set 2 Query DB seqs

Sequences found over Reference sets 1 and 2

Reference 1 PSI-BLAST finds 97.9% of sequences Domaination finds 99.1% of sequences Reference 2 PSI-BLAST finds 83.2% of sequences Domaination finds 90.6% of sequences

Test against SMART sequence domains A set of 15 sequences with domain definition in the SMART database (Ponting et al. 1999) Create two reference sets based on individual domain searches.

Sequences found over Reference sets 1 and 2 from 15 Smart sequences

SSEARCH significance test Verify the statistical significance of database sequences found by relating them to the original query sequence. SSEARCH (Pearson & Lipman 1988). Calculates an E-value for each generated local alignment. This filter will lose distant homologies. Use the 452 proteins with known structure.

Significant sequences found in database searches At an E-value cut-off of 0.1 the performance of DOMAINATION searches with the full-length proteins is 15% better than PSI-BLAST

Summary Algorithmic integration issues: Integrating data categories Integrating alternative methods (consensus) Making an web-integrated genomics pipeline that combines it all

Big task VU Needs: People Teams with an interest in Integrative Bioinformatics HTC/Dedicated cluster computing

NIMR MathBio Website

NIMR MathBio Website --Tools

Acknowledgements VU CvB FEW FALW Victor Simossis – NIMR to VU (1 November 2002) Jens Kleinjung – NIMR to VU (1 December 2002) Hans Westerhoff – FALW, VU Henri Bal – CS, FEW, VU Hans van Beek – VUMC/FALW, VU

View at NIMR (Mill Hill)