Introduction to Pattern Recognition Prediction in Bioinformatics What do we want to predict? –Features from sequence –Data mining How can we predict? –Homology.

Slides:



Advertisements
Similar presentations
(SubLoc) Support vector machine approach for protein subcelluar localization prediction (SubLoc) Kim Hye Jin Intelligent Multimedia Lab
Advertisements

Targeting and assembly of proteins destined for chloroplasts and mitochondria How are proteins targeted to chloroplasts and mitochondria from the cytoplasm?
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU T cell Epitope predictions using bioinformatics (Neural Networks and hidden.
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
Tour of the Cell
Intracellular Compartments and Protein Sorting
Chapter 26 Protein Sorting. Chapter Objectives Understand the pathways of cotranslational processing of proteins – ER, Golgi, Plasma membrane, Lysosomes.
Javad Jamshidi Fasa University of Medical Sciences Proteins Into membranes and Organelles and Vesicular Traffic Moving.
Cell Structure and Function Chapter 3 Basic Characteristics of Cells Smallest living subdivision of the human body Diverse in structure and function.
Translation Translation is the process of building a protein from the mRNA transcript. The protein is built as transfer RNA (tRNA) bring amino acids (AA),
Intracellular Compartments ER, Golgi, Endsomes, Lysosomes and Peroxisomes.
Protein Sorting ISAT 351, Spring 2004 College of Integrated Science and Technology James Madison University.
Biological sequence analysis and information processing by artificial neural networks.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Tools to analyze protein characteristics Protein sequence -Family member -Multiple alignments Identification of conserved regions Evolutionary relationship.
Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.
Introduction to BioInformatics GCB/CIS535
Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen.
M.W. Mak and S.Y. Kung, ICASSP’09 1 Conditional Random Fields for the Prediction of Signal Peptide Cleavage Sites M.W. Mak The Hong Kong Polytechnic University.
Biological sequence analysis and information processing by artificial neural networks.
Major Constituents of Cell
PREDICTION OF PROTEIN FEATURES Beyond protein structure (TM, signal/target peptides, coiled coils, conservation…)
Chemical reactions in cells need to be isolated. Enzymes work in complexes, spatial distribution in cytosol, nucleus Confinement of reactions in organelle.
More regulating gene expression. Fig 16.1 Gene Expression is controlled at all of these steps: DNA packaging Transcription RNA processing and transport.
Chapter 7a Introduction to the Endocrine System. Endocrinology Study of hormones Specialized chemical messengers Secreted by select cells Action at distant.
Lecture 2: Protein sorting (endoplasmic reticulum) Dr. Mamoun Ahram Faculty of Medicine Second year, Second semester, Principles of Genetics.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
What is bioinformatics?. What are bioinformaticians up to, actually? Manage molecular biological data –Store in databases, organise, formalise, describe...
AP Biology Tour of the Cell 1 AP Biology Collins I  6 lines  Choose any two organelles done in yesterdays class assignment and explain how.
Truncation of Protein Sequences for Fast Profile Alignment with Application to Subcellular Localization Man-Wai MAK and Wei WANG The Hong Kong Polytechnic.
Predicting the Cellular Localization Sites of Proteins Using Decision Tree and Neural Networks Yetian Chen
Protein Functional Annotation Dr G.P.S. Raghava. Annotation Methods Annotation by homology (BLAST) requires a large, well annotated database of protein.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Day 2: Protein Sequence Analysis 1.Physico-chemical properties. 2.Cellular localization. 3.Signal peptides. 4.Transmembrane domains. 5.Post-translational.
More regulating gene expression. Combinations of 3 nucleotides code for each 1 amino acid in a protein. We looked at the mechanisms of gene expression,
Fates of Proteins in Cells See also pages in Goodman.
Amino acid sequence of His protein DNA provides the instructions for how to build proteins Each gene dictates how to build a single protein in prokaryotes.
Localization prediction of transmembrane proteins Stefan Maetschke, Mikael Bodén and Marcus Gallagher The University of Queensland.
Cell Structure.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
Dealing with Sequence redundancy Morten Nielsen Department of Systems Biology, DTU.
Russell Group, Protein Evolution _________ ____ Rob Russell Cell Networks University of Heidelberg Interactions and Modules: the how and why of molecular.
GO-Slim term Cluster frequency cytoplasm 1944 out of 2727 genes, 71.3% 70 out of 97 genes, 72.2% out of 72 genes, 86.1% out.
1 GCCTCAATGGATCCACCACCCTTTTTGGGCA GCCTCAATGGATCCACCACCCTTTTTGGTGCA AGCCTCAATGGATCCACCACCCTTTTTGGTGC AAGCCTCAATGGATCCACCACCCTTTTTGGTG CAAGCCTCAATGGATCCACCACCCTTTTTGGT.
 Golgi apparatus or golgi complex or simple Golgi is an organelle found in most of the eukaryotic cells.  It was one of the first organelle to be.
PACKAGING, TRANSPORTING and EXPORTING
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
1 Computational Approaches(1/7)  Computational methods can be divided into four categories: prediction methods based on  (i) The overall protein amino.
7.3 Translation Image from pics/trans_bd.gif Essential Idea: Information transferred.
Cytoplasmic membranes-1 Unit objective: To understand that materials in cell are shuttled from one part to another via an extensive membrane network.
Protein families, domains and motifs in functional prediction May 31, 2016.
4-1 Protein Synthesis Is a Major Function of Cells.
Why organelles? Specialized structures specialized functions cilia or flagella for locomotion Containers partition cell into compartments create different.
bacteria and eukaryotes
Protein families, domains and motifs in functional prediction
Prediction of protein features. Beyond protein structure
Protein databases Henrik Nielsen
Protein Synthesis and Sorting: A Molecular View
The Nobel Prize in Physiology or Medicine 1999
The Cytomembrane System
Protein Families, Motifs & Domains.
Functional Annotation of Transcripts
The Endomembrane system
Combining HMMs with SVMs
Intracellular Compartments and Transport
Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen
Artificial Neural Networks Thomas Nordahl Petersen & Morten Nielsen
7.3 Translation Understanding:
Presentation transcript:

Introduction to Pattern Recognition Prediction in Bioinformatics What do we want to predict? –Features from sequence –Data mining How can we predict? –Homology / Alignment –Pattern Recognition / Statistical Methods / Machine Learning What is prediction? –Generalization / Overfitting –Preventing overfitting: Homology reduction How do we measure prediction? –Performance measures –Threshold selection Henrik Nielsen Center for Biological Sequence Analysis Technical University of Denmark

Sequence → structure → function

Prediction from DNA sequence Protein-coding genes –transcription factor binding sites –transcription start/stop –translation start/stop –splicing: donor/acceptor sites Non-coding RNA –tRNAs –rRNAs –miRNAs General features –Structure (curvature/bending) –Binding (histones etc.)

Folding / structure Post-Translational Modifications –Attachment: phosphorylation glycosylation lipid attachment –Cleavage: signal peptides, propeptides, transit peptides –Sorting: secretion, import into various organelles, insertion into membranes Interactions Function –Enzyme activity –Transport –Receptors –Structural components –etc... Prediction from amino acid sequence

Protein sorting in eukaryotes Proteins belong in different organelles of the cell – and some even have their function outside the cell Günter Blobel was in 1999 awarded The Nobel Prize in Physiology or Medicine for the discovery that "proteins have intrinsic signals that govern their transport and localization in the cell"

Data: UniProt annotation of protein sorting Annotations relevant for protein sorting are found in: –the CC (comments) lines –cross-references (DR lines) to GO (Gene Ontology)‏ –the FT (feature table) lines ID INS_HUMAN Reviewed; 110 AA. AC P01308;... DE Insulin precursor [Contains: Insulin B chain; Insulin A chain]. GN Name=INS;... CC -!- SUBCELLULAR LOCATION: Secreted.... DR GO; GO: ; C:extracellular region; IC:UniProtKB.... FT SIGNAL types of non-experimental qualifiers in the CC and FT lines: –Potential: Predicted by sequence analysis methods –Probable: Inconclusive experimental evidence –By similarity: Predicted by alignment to proteins with known location

Problems in database parsing Extreme example: A4_HUMAN, Alzheimer disease amyloid protein CC -!- SUBCELLULAR LOCATION: Membrane; Single-pass type I membrane CC protein. Note=Cell surface protein that rapidly becomes CC internalized via clathrin-coated pits. During maturation, the CC immature APP (N-glycosylated in the endoplasmic reticulum) moves CC to the Golgi complex where complete maturation occurs (O- CC glycosylated and sulfated). After alpha-secretase cleavage, CC soluble APP is released into the extracellular space and the C- CC terminal is internalized to endosomes and lysosomes. Some APP CC accumulates in secretory transport vesicles leaving the late Golgi CC compartment and returns to the cell surface. Gamma-CTF(59) peptide CC is located to both the cytoplasm and nuclei of neurons. It can be CC translocated to the nucleus through association with Fe65. Beta- CC APP42 associates with FRPL1 at the cell surface and the complex is CC then rapidly internalized. APP sorts to the basolateral surface in CC epithelial cells. During neuronal differentiation, the Thr-743 CC phosphorylated form is located mainly in growth cones, moderately CC in neurites and sparingly in the cell body. Casein kinase CC phosphorylation can occur either at the cell surface or within a CC post-Golgi compartment.... DR GO; GO: ; C:cell surface; IDA:UniProtKB. DR GO; GO: ; C:extracellular region; TAS:ProtInc. DR GO; GO: ; C:integral to plasma membrane; TAS:ProtInc.

Prediction methods Homology / Alignment Simple pattern recognition –Example: PROSITE entry PS00014, ER_TARGET: Endoplasmic reticulum targeting sequence. Pattern: [KRHQSA]-[DENQ]-E-L> Statistical methods –Weight matrices: calculate amino acid probabilities –Other examples: Regression, variance analysis, clustering Machine learning –Like statistical methods, but parameters are estimated by iterative training rather than direct calculation –Examples: Neural Networks (NN), Hidden Markov Models (HMM), Support Vector Machines (SVM)

Prediction of subcellular localisation from sequence Homology: threshold  30%-70% identity Sorting signals (‘‘zip codes’’) –N-terminal: secretory (ER) signal peptides, mitochondrial & chloroplast transit peptides. –C-terminal: peroxisomal targeting signal 1, ER-retention signal. –internal: Nuclear localisation signals, nuclear export signals. Global properties –amino acid composition, aa pair composition –composition in limited regions –predicted structure –physico-chemical parameters Combined approaches

Signal-based prediction Signal peptides –von Heijne 1983, 1986 [WM] –SignalP (Nielsen et al. 1997, 1998; Bendtsen et al. 2004) [NN, HMM] Mitochondrial & chloroplast transit peptides –Mitoprot (Claros & Vincens 1996) [linear discriminant using physico-chemical parameters] –ChloroP, TargetP* (Emanuelsson et al. 1999, 2000) [NN] –iPSORT* (Bannai et al. 2002) [decision tree using physico- chemical parameters] –Protein Prowler* (Hawkins & Bodén 2006) [NN] *= includes also signal peptides Nuclear localisation signals –PredictNLS (Cokol et al. 2000) [regex] –NucPred (Heddad et al. 2004) [regex, GA]

Composition-based prediction Nakashima and Nishikawa 1994 [2 categories; odds-ratio statistics] ProtLock (Cedano et al. 1997) [5 categories; Mahalanobis distance] Chou and Elrod 1998 [12 categories; covariant discriminant] NNPSL (Reinhardt and Hubbard 1998) [4 categories; NN] SubLoc (Hua and Sun 2001) [4 categories; SVM] PLOC (Park and Kanehisa 2003) [12 categories; SVM] LOCtree (Nair & Rost 2005) [6 categories; SVM incl. regions, structure and profiles] BaCelLo (Pierleoni et al. 2006) [5 categories; SVM incl. regions and profiles] Pro: does not require knowledge of signals works even if N-terminus is wrong Con: cannot identify isoform differences

A simple statistical method: Linear regression Observations (training data): a set of x values (input) and y values (output). Model: y = ax + b (2 parameters, which are estimated from the training data) Prediction: Use the model to calculate a y value for a new x value Note: the model does not fit the observations exactly. Can we do better than this?

Overfitting y = ax + b 2 parameter model Good description, poor fit y = ax 6 +bx 5 +cx 4 +dx 3 +ex 2 +fx+g 7 parameter model Poor description, good fit Note: It is not interesting that a model can fit its observations (training data) exactly. To function as a prediction method, a model must be able to generalize, i.e. produce sensible output on new data.

A classification problem How complex a model should we choose? This depends on: The real complexity of the problem The size of the training data set The amount of noise in the data set

How to estimate parameters for prediction?

Model selection Linear Regression Quadratic RegressionJoin-the-dots

The test set method

Cross Validation

Which kind of Cross Validation? Note: Leave-one-out is also known as jack-knife

Problem: sequences are related If the sequences in the test set are closely related to those in the training set, we can not measure true generalization performance ALAKAAAAM ALAKAAAAN ALAKAAAAR ALAKAAAAT ALAKAAAAV GMNERPILT GILGFVFTM TLNAWVKVV KLNEPVLLL AVVPFIVSV

Solution: Homology reduction Calculate all pairwise similarities in the data set Define a threshold for being ”neighbours” (too closely related) Calculate numbers of neighbours for each example, and remove the example with most neighbours Repeat until there are no examples with neighbours left Alternative: Homology partitioning keep all examples, but cluster them so that no neighbours end up in the same fold Should be combined with weighting The Hobohm algorithm

Defining a threshold for homology reduction The Sander/Schneider curve: For protein structure prediction, 70% identical classification of secondary structure means prediction by alignment is possible This corresponds to  25% identical amino acids in a local alignment > 80 positions First approach: two sequences are too closely related, if the prediction problem can be solved by alignment

Defining a threshold for homology reduction The Pedersen / Nielsen / Wernersson curve: Use the extreme value distribution to define the BLAST score at which the similarity is stronger than random Second approach: two sequences are too closely related, if their homology is statistically significant