Presentation is loading. Please wait.

Presentation is loading. Please wait.

What is bioinformatics?. What are bioinformaticians up to, actually? Manage molecular biological data –Store in databases, organise, formalise, describe...

Similar presentations


Presentation on theme: "What is bioinformatics?. What are bioinformaticians up to, actually? Manage molecular biological data –Store in databases, organise, formalise, describe..."— Presentation transcript:

1 What is bioinformatics?

2 What are bioinformaticians up to, actually? Manage molecular biological data –Store in databases, organise, formalise, describe... Compare molecular biological data Find patterns in molecular biological data –phylogenies –correlations (sequence / structure / expression / function / disease)‏ Goals: characterise biological patterns & processes predict biological properties –low level data ⇒ high level properties (eg., sequence ⇒ function)‏

3 Bioinformatics: neighbour disciplines Computational biology –Broader concept: includes computational ecology, physiology, neurology etc... -omics: –Genomics –Transcriptomics –Proteomics Systems biology –Putting it all together... –Building models, identify control & regulation

4 Bioinformatics: prerequisites Bio- side: –Molecular biology –Cell biology –Genetics –Evolutionary theory -informatics side: –Computer science –Statistics –Theoretical physics

5 Molecular biology data... >alpha-D ATGCTGACCGACTCTGACAAGAAGCTGGTCCTGCAGGTGTGGGAGAAGGTGATCCGCCAC CCAGACTGTGGAGCCGAGGCCCTGGAGAGGTGCGGGCTGAGCTTGGGGAAACCATGGGCA AGGGGGGCGACTGGGTGGGAGCCCTACAGGGCTGCTGGGGGTTGTTCGGCTGGGGGTCAG CACTGACCATCCCGCTCCCGCAGCTGTTCACCACCTACCCCCAGACCAAGACCTACTTCC CCCACTTCGACTTGCACCATGGCTCCGACCAGGTCCGCAACCACGGCAAGAAGGTGTTGG CCGCCTTGGGCAACGCTGTCAAGAGCCTGGGCAACCTCAGCCAAGCCCTGTCTGACCTCA GCGACCTGCATGCCTACAACCTGCGTGTCGACCCTGTCAACTTCAAGGCAGGCGGGGGAC GGGGGTCAGGGGCCGGGGAGTTGGGGGCCAGGGACCTGGTTGGGGATCCGGGGCCATGCC GGCGGTACTGAGCCCTGTTTTGCCTTGCAGCTGCTGGCGCAGTGCTTCCACGTGGTGCTG GCCACACACCTGGGCAACGACTACACCCCGGAGGCACATGCTGCCTTCGACAAGTTCCTG TCGGCTGTGTGCACCGTGCTGGCCGAGAAGTACAGATAA >alpha-A ATGGTGCTGTCTGCCAACGACAAGAGCAACGTGAAGGCCGTCTTCGGCAAAATCGGCGGC CAGGCCGGTGACTTGGGTGGTGAAGCCCTGGAGAGGTATGTGGTCATCCGTCATTACCCC ATCTCTTGTCTGTCTGTGACTCCATCCCATCTGCCCCCATACTCTCCCCATCCATAACTG TCCCTGTTCTATGTGGCCCTGGCTCTGTCTCATCTGTCCCCAACTGTCCCTGATTGCCTC TGTCCCCCAGGTTGTTCATCACCTACCCCCAGACCAAGACCTACTTCCCCCACTTCGACC TGTCACATGGCTCCGCTCAGATCAAGGGGCACGGCAAGAAGGTGGCGGAGGCACTGGTTG AGGCTGCCAACCACATCGATGACATCGCTGGTGCCCTCTCCAAGCTGAGCGACCTCCACG CCCAAAAGCTCCGTGTGGACCCCGTCAACTTCAAAGTGAGCATCTGGGAAGGGGTGACCA GTCTGGCTCCCCTCCTGCACACACCTCTGGCTACCCCCTCACCTCACCCCCTTGCTCACC ATCTCCTTTTGCCTTTCAGCTGCTGGGTCACTGCTTCCTGGTGGTCGTGGCCGTCCACTT CCCCTCTCTCCTGACCCCGGAGGTCCATGCTTCCCTGGACAAGTTCGTGTGTGCCGTGGG CACCGTCCTTACTGCCAAGTACCGTTAA DNA sequences

6 Molecular biology data... Amino acid sequences Protein structure: –X-ray crystallography –NMR

7 Cell biology & proteomics data... Subcellular localization

8 Cell biology & proteomics data... protein-protein interactions

9 DNA microarray technology Transcriptomics: DNA microarray technology

10 Proteomics & transcriptomics data Proteins encoded by periodically expressed genes: a functionally diverse protein category

11 Cell cycle regulation Multiprotein complexes in yeast (Saccharomyces cerevisiae). Coloured components are periodically expressed (dynamic). White components are always expressed (static). Complexes contain both kinds of components (just-in-time assembly). Dynamic complex formation during the yeast cell cycle. Science. 2005 Feb 4;307(5710):724-7. de Lichtenberg U, Jensen LJ, Brunak S, Bork P.

12 Cell cycle regulation 2 Transcriptional regulation is poorly conserved! Lars Juhl Jensen, Thomas Skøt Jensen, Ulrik de Lichtenberg, Søren Brunak and Peer Bork, Nature, Oct 5, 2006 Dynamic components overlap:

13 Phenotype data: human diseases

14 Prediction methods Homology / Alignment Simple pattern (“word”) recognition Statistical methods –Weight matrices: calculate amino acid probabilities –Other examples: Regression, variance analysis, clustering Machine learning –Like statistical methods, but parameters are estimated by iterative training rather than direct calculation –Examples: Neural Networks (NN), Hidden Markov Models (HMM), Support Vector Machines (SVM)‏ Combinations

15 Simple pattern (“word”) recognition‏ Example: PROSITE entry PS00014, ER_TARGET: Endoplasmic reticulum targeting sequence (”KDEL-signal”). Pattern: [KRHQSA]-[DENQ]-E-L> NB: only yes/no answers!

16 Statistical methods ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV Estimate probabilities for nucleotides / amino acids Information content in sequences; logos; Position- Weight Matrices. Quantitative answers.

17 ADDGSLAFVPSEF--SISPGEKIVFKNNAGFPHNIVFDEDSIPSGVDASKISMSEEDLLN TVNGAI--PGPLIAERLKEGQNVRVTNTLDEDTSIHWHGLLVPFGMDGVPGVSFPG---I -TSMAPAFGVQEFYRTVKQGDEVTVTIT-----NIDQIED-VSHGFVVVNHGVSME---I IE--KMKYLTPEVFYTIKAGETVYWVNGEVMPHNVAFKKGIV--GEDAFRGEMMTKD--- -TSVAPSFSQPSF-LTVKEGDEVTVIVTNLDE------IDDLTHGFTMGNHGVAME---V ASAETMVFEPDFLVLEIGPGDRVRFVPTHK-SHNAATIDGMVPEGVEGFKSRINDE---- TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI Sequence profiles Conserved Non-conserved Matching anything but G => large negative score Anything can match TKAVVLTFNTSVEICLVMQGTSIV----AAESHPLHLHGFNFPSNFNLVDPMERNTAGVP TVNGQ--FPGPRLAGVAREGDQVLVKVVNHVAENITIHWHGVQLGTGWADGPAYVTQCPI Unlike Position-Weight Matrices, sequence profiles are able to handle gaps

18 HMM ( Hidden Markov Models ) Which parts of the sequence are –in the cytoplasm? –embedded in the membrane (TM-helices)?‏ –outside? Rule-of-thumb says amino acids in the TM helices are hydrophobic – but there are exceptions TMHMM predicts topology from sequence using a statistical model TMHMM: predicting the topology of α-helix transmembrane proteins

19 I K E E H V I I Q A E H E C IKEEHVIIQAE FYLNPDQSGEF….. Window Input Layer Hidden Layer Output Layer Weights Neural Networks: ”Architecture”

20 Neural networks Neural networks can learn higher order correlations! –What does this mean? A A => 0 A C => 1 C A => 1 C C => 0 No linear function can learn this pattern Neural Networks: advantages

21 How to estimate parameters for prediction?

22 Model selection Linear Regression Quadratic RegressionJoin-the-dots

23 The test set method

24

25

26

27

28 Problem: sequences are related If the sequences in the test set are closely related to those in the training set, we can not measure true generalization performance ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

29 Protein sorting in eukaryotes Proteins belong in different organelles of the cell – and some even have their function outside the cell Günter Blobel was in 1999 awarded The Nobel Prize in Physiology or Medicine for the discovery that "proteins have intrinsic signals that govern their transport and localization in the cell"

30 Secretory proteins have a signal peptide Initially, they are transported across the ER membrane Protein sorting: secretory pathway / ER

31 Signal peptides A signal peptide is an N- terminal part of the amino acid chain, containing a hydrophobic region. Signal peptides differ between proteins, and can be hard to recognize.

32 SignalP http://www.cbs.dtu.dk/services/SignalP/ Method for prediction of signal peptides from the amino acid sequence Based on Neural Networks The first SignalP paper (Nielsen H, et al., Protein Eng. 10[1]: 1-6, January 1997) was cited 3144 times in a 10 year period


Download ppt "What is bioinformatics?. What are bioinformaticians up to, actually? Manage molecular biological data –Store in databases, organise, formalise, describe..."

Similar presentations


Ads by Google