Computational analysis of membrane proteins implicated in metal transport in Arabidopsis thaliana Stefanie Hartmann Max Planck Institute for Molecular Plant Physiology Supervisors: Joachim Selbig, Ute Krämer CIAVVLCLVFMSVEVVGGIKANSLAILTDAAHLLSDVAAFAISLFSLWAAGWEATPRQTYGFFRIEILGALVSIQLI WLLT ALFLLINTAYMVVEFVAGFMSNSLGLISDACHMLFDCAALAIGLYASYISRLPANHQYNYGRGRFEVLSGYVNAV FLVLVG CFVVVLCLLFMSIEVVCGIKANSLAILADAAHLLTDVGAFAISMLSLWASSWEANPRQSYGFFRIEILGTLVSIQLI WLLT LIAVLLCAIFIVVEVVGGIKANSLAILTDAAHLLSDVAAFAISLFSLWASGWKANPQQSYGFFRIEILGALVSIQMIW LLA --- IFLYLIVMSVQIVGGFKANSLAVMTDAAHLLSDVAGLCVSLLAIKVSSWEANPRNSFGFKRLEVLAAFLSVQLIWL VS
12 membrane proteins involved in metal transport in Arabidopsis
Metal transporters are of great importance because… …they provide an adequate supply of essential trace metals …they prevent an excess of these potentially toxic ions in silico analyses may help design further experiments on basic research on metal homeostasis development of new ways of phytoremediation
Cation Diffusion Facilitator (CDF) proteins also referred to as cation efflux (CE) proteins occur in archaea, bacteria, eukaryotes are involved in transporting heavy metals (Co 2+, Cd 2+, Zn 2+, Ni 2+ ) the CDF family of proteins had 13 members in 1997 the CE Pfam family today has 348 members (July 2003) CDF signature sequence: S X (ASG) (LIVMT) 2 (SAT) (DA) (SGAL) (LIVFYA) (HDN) X 3 D X 2 (AS) 426 (Jan 2004)
CDF1: At2g46800 S LAILTDAAHLLS D VAA CDF2: At3g61940 S LAILADAAHLLT D VGA exact match CDF3: At3g58810 S LAILTDAAHLLS D VAA CDF4: At2g29410 S LAVMTDAAHLLS D VAG CDF5: At2g04620 S LGLISDACHMLF D CAA 1 mismatch CDF6: At2g47830 S TAIIADAAHSVS D VVL CDF7: At2g39450 S LAIIASTLDSLL D LLS CDF8: At1g16310 S MAVIASTLDSLL D LLS 2 mismatches CDF9: At1g79520 S MAVIASTLDSLL D LLS CDF10: At3g58060 S IAIAASTLDSLL D LMA CDF11: At3g12100 R VGLVSDAFHLTF G CGL CDF12: At1g51610 S HVIMAEVVHSVA D FAN 4 mismatches The Arabidopsis thaliana CDF protein family 3 mismatches
Research questions: Can all 12 proteins be classified as CDF proteins? i.e., are there predicted structural and functional similarities of these 12 Arabidopsis proteins? secondary structure prediction, inclusion in membrane- and transporter databases, evaluation of common motifs, etc
Research questions: Can all 12 proteins be classified as CDF proteins? i.e., are there predicted structural and functional similarities of these 12 Arabidopsis proteins? What are the relationships of the 12 Arabidopsis proteins among each other and to other published sequences? secondary structure prediction, inclusion in membrane- and transporter databases, evaluation of common motifs, etc intron/exon structure, phylogenetic reconstructions
Research questions: Can all 12 proteins be classified as CDF proteins? i.e., are there predicted structural and functional similarities of these 12 Arabidopsis proteins? What are the relationships of the 12 Arabidopsis proteins among each other and to other published sequences? Is it possible to predict the 3D structure of these proteins? secondary structure prediction, inclusion in membrane- and transporter databases, evaluation of common motifs, etc intron/exon structure, phylogenetic reconstructions fold recognition by threading
Sequence retrieval - four ambiguous sequences TIGR Arabidopsis thaliana database TAIR: The Arabidopsis Information Resource MIPS Arabidopsis thaliana genome database different assignment of introns, use of alternative start codons Sequence analysis - three additional ambiguous sequences SWALL Pfam vs. TIGR/TAIR/MIPS insertions and deletions, different amino acid sequence Cloning and RT-PCR revealed correct sequences for six of the seven ambiguous CDFs
Inclusion in membrane and transport databases cation efflux, Pfam entry PF01545 Arabidopsis Membrane Protein Library (AMPL) ARAMEMNON Transport Protein Database PlantsT CDF1 CDF2 CDF3 CDF4 CDF5 ()() - CDF6 CDF7 - CDF8 -- CDF9 -- CDF10 - CDF11 CDF12 -
Inclusion in membrane and transport databases cation efflux, Pfam entry PF01545 Arabidopsis Membrane Protein Library (AMPL) ARAMEMNON Transport Protein Database PlantsT CDF1 CDF2 CDF3 CDF4 CDF5 ()()()() - CDF6 ()()()() CDF7 ()()()() - CDF8 ()()()() -- CDF9 ()()()() -- CDF10 ()() - CDF11 ()() CDF12 ()() -
Inclusion in membrane and transport databases cation efflux, Pfam entry PF01545 Arabidopsis Membrane Protein Library (AMPL) ARAMEMNON Transport Protein Database PlantsT CDF1 CDF2 CDF3 CDF4 CDF5 ()()()() – CDF6 ()()()() CDF7 ()()()() – CDF8 ()()()() –– CDF9 ()()()() –– CDF10 ()() – CDF11 ()() CDF12 ()() –
Hidden Markov models used for secondary structure prediction states (loops, transmembrane domains, etc) are defined states are connected in a biologically reasonable way (transitions) each state has a specific probability distribution over the 20 amino acids each transition has a specific transition probability amino acid probabilities and transition probabilities are learned models are first taught using a training set, the trained model is then used for the prediction membranecytoplasmic sidenon-cytoplasmic side
number of TMD N-terminus within cytoplasm CDF162 / 3 CDF263 / 3 CDF362 / 3 CDF45-62 / 3 CDF563 / 3 CDF60-61 / 3 CDF74-62 / 3 CDF85-63 / 3 CDF95-63 / 3 CDF / 3 CDF1163 / 3 CDF / 3 Results of secondary structure predictions TMHMM v2(Tusnady and Simon, 1998, 2001) HMMTOP v2(Sonnhammer et al. 1998) Memsat2 (Jones et al. 1994, McGuffin et al. 2000) (14)
number of TMD N-terminus within cytoplasm CDF162 / 3 CDF263 / 3 CDF362 / 3 CDF45-62 / 3 CDF563 / 3 CDF60-61 / 3 CDF74-62 / 3 CDF85-63 / 3 CDF95-63 / 3 CDF / 3 CDF1163 / 3 CDF / 3 Results of secondary structure predictions TMHMM v2(Tusnady and Simon, 1998, 2001) HMMTOP v2(Sonnhammer et al. 1998) Memsat2 (Jones et al. 1994, McGuffin et al. 2000) (14)
CDF signature CE signature
Prediction of subcellular localization mTP: mitochondrialcTP: chloroplast SP: signal peptide targeting peptide transit peptide(ER/secretory pathway)
Prediction of subcellular localization - methods N-terminal sorting signals display characteristic amino acid compositions sequence-based methods predicting N-terminal sorting signals are based on this observation mTP: mitochondrialcTP: chloroplast SP: signal peptide targeting peptide transit peptide(ER/secretory pathway) TargetP mTP, cTP, SPneural network-based iPSORT mTP, cTP, SPdecision list Predotar mTP, cTPneural network-based SignalP NN SignalP HMM SP neural network-based based on hidden Markov models
TargetPiPSORT Predotar SignalP NN HMM CDF1 CDF23/4 CDF3 CDF4 CDF5cTP 1/4 CDF6mTPcTPmTP CDF7 CDF8cTP*mTP*2/4*Y* CDF9 CDF10 CDF11 CDF12mTP Prediction of subcellular localization - results mTP: mitochondrialcTP: chloroplast SP: signal peptide targeting peptide transit peptide(ER/secretory pathway)
Exon structure of the CDF proteins # of exons
Gene organization of the CDF proteins CDF1 CDF2 CDF3 CDF4 CDF5 CDF11 CDF6 CDF12 CDF7 CDF8 CDF9 CDF10
Phylogenetic Relationships within Cation Transporter Families of Arabidopsis Plant Physiology 2001; 126 (4): 1646–1667 CDF4 CDF3 CDF2 CDF1 CDF12 CDF10 CDF11 CDF6 omitted:CDFs 5, 7, 8, 9
Phylogenetic analysis of the Arabidopsis CDF proteins
Phylogenetic analysis of sequences containing the CE signature Arabidopsis group I sequences, monocot and dicot sequences, mammalian metal transporters Arabidopsis group II sequences, monocot and dicot sequences, prokaryotic and eukaryotic seqs several two-domain proteins outgroup
N C working model: topology of Arabidopsis CDF proteins CDF signature sequence cytoplasm cell exterior/organelle
Information derived from the 3D structure of a protein assignment of function guide mutagenesis- experiments ligand and functional sites evolutionary relationships residue solvent exposure putative interaction sites
Structure determination 1.Classical approaches 2.Computational approaches X-ray crystallography NMR spectroscopy comparative (“homology”) modeling fold recognition (“threading”) ab initio methods
The number of folds occurring in nature is limited: There are many sequences with no significant sequence identity but with the same or similar folds The basis of fold recognition (“threading”) …HEAIDHKPKLTGMKTGRVVSSMKSNFFADLP… …HDGRSSMTRFSRYFRKTGRVSEYYKKQERLLE… PDB statistics:
Fold recognition methods aim: to find an optimal sequence-structure alignment 1.“threading” of an unknown target sequence into the backbone structure of template proteins of known structure ………CLVFMSVEVVGGIKANSLAILTD………
4.99 Å Fold recognition methods 2. evaluation of the compatibility between target sequence and proposed 3D structure using environment-based mean force potentials or using knowledge-based mean force potentials 3.Output: a list of folds (sorted or unsorted), their “compatibility score”, sometimes other information such as SCOP descriptors, alignment, rudimentary 3D model of the query protein, raw scores, solvation energy for the model, links
No new insights regarding the structure of CDF proteins Membrane proteins are significantly under-represented in structural databases – and therefore also in fold libraries If there is no fold similar to the native fold of the target protein, this approach cannot succed. Threading methods cannot be used for modeling of transmembrane proteins
Will the 3D structure of CDFs be available soon? for fold recognition methods to be used successfully: significantly more 3D structures of membrane proteins are needed fold recognition methods specifically for integral membrane proteins may eventually be developed cyrystallization of bacterial homologs and subsequent extraploation of structural features as an alternative? approach for globular proteins: predicting a protein’s solubility and propensity to crystallize, based on results from high-throughput structure determination
Can threading results be used as an independent way to verify group assignment? Were some structural hits specific for any of the CDF groups? 1.Which hits were common to 2. “Phylothreading” which of the CDF sequences?
Can threading results be used as an independent way to verify group assignment? Were some structural hits specific for any of the CDF groups? 1.Which hits were common to 2. “Phylothreading” which of the CDF sequences?
Which hits were common to which of the CDF sequences? Structural hits predicted for most CDF sequences for group I sequences for group II sequences for CDF5 and CDF11 for CDF6 and CDF12 Results were unable to provide evidence to verify group assignments based on other methods 1… … … 11 12
“Phylothreading” Phylothreading results can neither verify nor refute group assignments based on other methods
N C cytoplasm cell exterior/organelle Threading: non-transmembrane CDF fragments N-terminus histidine-rich loop between TMD 4 and 5 C-terminus
“Phylothreading”: CDF C-terminal fragments “phylothreading” results confirm the assignment of CDF sequences to groups that were based on independent methods
Conclusions The 12 Arabidopsis protein sequences reveal structural and therefore probably functional conservation My results support the classification of these proteins as CDF metal transporters I propose that the CDF protein family of A. thaliana contains two groups, each containing at least four proteins that are structurally and functionally closely related Threading methods cannot be used for transmembrane proteins or for their non-transmembrane domains (yet) Threading results for multiple sequences may be used to confirm (or find?) relationships among these sequences (“phylothreading”) I was able to evaluate and compare a number of online tools that are available for the analysis of sequence data
Conclusions 1. Sequence retrieval revealed conflicting information for 7 of the 12 proteins 2. The 12 Arabidopsis protein sequences reveal striking structural and therefore probably functional conservation 3. My results support the classification of these proteins as CDF metal transporters 4. I propose that the CDF protein family of A. thaliana contains two groups, each containing four proteins that are structurally and functionally closely related 5. I was able to evaluate and compare a variety of online tools available for the analysis of sequence data
Conclusions 1. Sequence retrieval revealed conflicting information for 7 of the 12 proteins 2. The 12 Arabidopsis protein sequences reveal striking structural and therefore probably functional conservation 3. My results support the classification of these proteins as CDF metal transporters 4. I propose that the CDF protein family of A. thaliana contains two groups, each containing four proteins that are structurally and functionally closely related 5.I was able to evaluate and compare a variety of online tools available for the analysis of sequence data 6. Threading methods cannot be used for transmembrane proteins or for their non-transmembrane domains (yet) 7. Threading results for multiple sequences can be used to confirm (or find?) relationships among these sequences (“phylothreading”)
METHODS
Phylogenetic analysis: tree-building methods distance-based methods overall distance between all pairs of sequences are calculated and then used to calculate a tree (Neighbor Joining) character-based methods the individual substitutions among the sequences are used to determine the most likely ancestral relationships (Maximum Parsimony, Maximum Likelihood) Bayesian inference of phylogenies...CLVFMSVEVVGGIKANSLAILTD......NTAYMVVEFVAGFMSNSLGLISD......CLLFMSIEVVCGIKANSLAILAD......CAIFIVVEVVGGIKANSLAILTD......YLIVMSVQIVGGFKANSLAVMTD...
Phylogenetic analysis: statistical evaluation of trees bootstrap analysis how much support exists for particular branches in a phylogeny? 1.tree construction, determination of the “best” tree 2.bootstrap datasets (pseudosamples) are created from the original dataset by random sampling with replacement 3.tree construction using the bootstrap datasets 4.comparison of the bootstrap tree with the inferred tree 5.this is repeated several hundred times 6.bootstrap value: percentage of times an interior branch in the bootstrap tree was the same as the one in the inferred tree...CLVFMSVEVVGGIKANSLAILTD......NTAYMVVEFVAGFMSNSLGLISD......CLLFMSIEVVCGIKANSLAILAD......CAIFIVVEVVGGIKANSLAILTD......YLIVMSVQIVGGFKANSLAVMTD...
2. evaluation of the compatibility between target sequence and proposed 3D structure Fold recognition methods using environment-based mean force potentials (Bowie, Fischer, Eisenberg: ) - residue positions are categorized into environment classes - the 3D protein structure is converted into a 1D sequence - generate alignment of this 1D string to target sequence using knowledge-based mean force potentials (Sippl: ) - information is automatically learned from databases of protein structures - pairwise interactions between structurally adjacent residues are calculated - transformation of mean force potentials as a function of distance
Fold recognition methods aim: to find an optimal sequence-structure alignment 1.“threading” of an unknown target sequence into the backbone structure of template proteins of known structure ………CLVFMSVEVVGGIKANSLAILTD……… query sequence fold library
4.99 Å Fold recognition methods 2. evaluation of the compatibility between target sequence and proposed 3D structure using environment-based mean force potentials or using knowledge-based mean force potentials
4.99 Å 2. evaluation of the compatibility between target sequence and proposed 3D structure using environment-based mean force potentials* or using knowledge-based mean force potentials* Fold recognition methods * distant-dependent forces that act between atoms/residues (electrostatic and van der Waals interactions, influences on the surrounding medium on these interactions, contacts between two or three amino acids, angles between residue pairs, …)
4.99 Å Fold recognition methods 2. evaluation of the compatibility between target sequence and proposed 3D structure using environment-based mean force potentials or using knowledge-based mean force potentials 3.Output: a list of folds (sorted or unsorted), their “compatibility score”, sometimes other information such as SCOP descriptors, alignment, rudimentary 3D model of the query protein, raw scores, solvation energy for the model, links
Threading methods used UCLA-DOE Fold Server P. Mallick et al., 2002 (BLAST, PSI-BLAST, SDP, DASEY) Threader D.T. Jones et al., 1992 mGenThreader L.J. McGuffin & D.T. Jones D-PSSM L.A. Kelley et al., 2000 Arby I. Sommer et al., unpublished (PSI-BLAST, 123D, Jprop)
top 10 structural hits are returned, all were kept compatibility of target sequence and all 2000 available templates is evaluated; lists were sorted by Z-value, approximately best hits were kept top 20 structural hits are returned, all were kept a list of the best scores is returned; the corresponding hits were extracted from a large table UCLA-DOE: Threader: mGenThreader: 3D-PSSM: Arby: Selection of structural hits for further analysis
Evaluation of the top score for each CDF sequence UCLA very poor score poor score borderline significant significant very significant Threader scores: no guidelines highly confident worthy of attention guess low confidence medium confidence high confidence certain mGen- Threader 3D-PSSM
There is no consensus of top fold predicted by different methods example: top two structural hits for CDF1 Threader:1ONEphosphopyruvate hydrolase 1C3Qthiazole kinase mGenThreader:1L8Mhis-rich protein (model) 1QGRimportin beta UCLA-DOE:1B8Fhistidine ammonia-lyase 1HFAclathrin assembly protein 3D-PSSM:1PW4glycerol-3-phosphate transporter 1KPWgreen cone pigment Arby:1HZXbovine rhodopsin 1EZVyeast cytochrome bc1
No new insights regarding the structure of CDF proteins Membrane proteins are significantly under-represented in structural databases – and therefore also in fold libraries If there is no fold similar to the native fold of the target protein, this approach cannot succed. Threading methods cannot be used for modeling approaches
Threading results: C-termini 1. Structural information no information of domains for metal transport available. BUT: several of the returned hits are proteins in which bound metals have structural or catalytic roles 2. Verification of group assignment i. Hits predicted for more than one C-terminus:48 folds specific for group I: 3 specific for group II: 2 specific for CDF5 and CDF11: 2 ii. “Phylothreading”
IIIIIIIVVVI TMD Pfam CE signature CDF signature BLOCKS (eMOTIF) Positions of conserved domains and signature sequences , 12 10, 11
Arabidopsis CDF proteins group I: - contain his-rich region between TMD 4 and 5 - one member is confirmed to transport Zn ions - genome structure conserved (no introns) group II: - lack the his-rich region between TMD 4 and 5 - proteins may transport Mn ions - C-terminal regions differ from group I sequences no group assignment: - CDF6, CDF12: possibly distant common ancestry and mitochondrial localization - CDF5, CDF11: close relationship also in PFAM tree
N C working model: topology of Arabidopsis CDF proteins CDF signature sequence cytoplasm cell exterior/organelle
Gene organization of the CDF proteins
Phylogenetic analysis of sequences containing the CE signature
Phylogenetic analysis: tree-building methods maximum parsimony methods the best tree topology minimizes the total amount of evolutionary change that has occurred distance methods the best tree topology minimizes the the total distance among taxa maximum likelihood methods given a particular substitution model and given a particular tree, how likely is the observed data?...CLVFMSVEVVGGIKANSLAILTD......NTAYMVVEFVAGFMSNSLGLISD......CLLFMSIEVVCGIKANSLAILAD......CAIFIVVEVVGGIKANSLAILTD......YLIVMSVQIVGGFKANSLAVMTD...
Inclusion in membrane and transport databases cation efflux, Pfam entry PF01545 Arabidopsis Membrane Protein Library (AMPL) ARAMEMNON Transport Protein Database PlantsT CDF1 CDF zinc transporter CDF CDF2 CDF putative MTP CDF CDF3 CDF putative MTP CDF CDF4 CDF putative MTP CDF CDF5 singleton (CDF related) putative cation transporter CDF- CDF6 singleton unknown protein CDF CDF7 family unknown protein CDF- CDF8 family hypothetical protein -- CDF9 family unknown protein -- CDF10 family putative MTP -CDF CDF11 singleton putative MTP CDF CDF12 singleton putative MTP -CDF
Inclusion in membrane and transport databases cation efflux, Pfam entry PF01545 Arabidopsis Membrane Protein Library (AMPL) ARAMEMNON Transport Protein Database PlantsT CDF1 CDF zinc transporter CDF CDF2 CDF putative MTP CDF CDF3 CDF putative MTP CDF CDF4 CDF putative MTP CDF CDF5 singleton (CDF related) putative cation transporter CDF- CDF6 singleton unknown protein CDF CDF7 family unknown protein CDF- CDF8 family hypothetical protein -- CDF9 family unknown protein -- CDF10 family putative MTP -CDF CDF11 singleton putative MTP CDF CDF12 singleton putative MTP -CDF
Inclusion in membrane and transport databases cation efflux, Pfam entry PF01545 Arabidopsis Membrane Protein Library (AMPL) ARAMEMNON Transport Protein Database PlantsT CDF1 CDF zinc transporter CDF CDF2 CDF putative MTP CDF CDF3 CDF putative MTP CDF CDF4 CDF putative MTP CDF CDF5 singleton (CDF related) putative cation transporter CDF- CDF6 singleton unknown protein CDF CDF7 family unknown protein CDF- CDF8 family hypothetical protein -- CDF9 family unknown protein -- CDF10 family putative MTP -CDF CDF11 singleton putative MTP CDF CDF12 singleton putative MTP -CDF
Inclusion in membrane and transport databases cation efflux, Pfam entry PF01545 Arabidopsis Membrane Protein Library (AMPL) ARAMEMNON Transport Protein Database PlantsT CDF1 CDF zinc transporter CDF CDF2 CDF putative MTP CDF CDF3 CDF putative MTP CDF CDF4 CDF putative MTP CDF CDF5 singleton (CDF related) putative cation transporter CDF- CDF6 singleton unknown protein CDF CDF7 family unknown protein CDF- CDF8 family hypothetical protein -- CDF9 family unknown protein -- CDF10 family putative MTP -CDF CDF11 singleton putative MTP CDF CDF12 singleton putative MTP -CDF