Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bioinformatics databases & sequence retrieval Content of lecture I.Introduction II.Bioinformatics data & databases III.Sequence Retrieval with MRS Celia.

Similar presentations


Presentation on theme: "Bioinformatics databases & sequence retrieval Content of lecture I.Introduction II.Bioinformatics data & databases III.Sequence Retrieval with MRS Celia."— Presentation transcript:

1 Bioinformatics databases & sequence retrieval Content of lecture I.Introduction II.Bioinformatics data & databases III.Sequence Retrieval with MRS Celia van Gelder CMBI UMC Radboud September 2015

2 ©CMBI 2009 I. Bioinformatics questions Lookup Is the gene known for my protein (or vice versa)? What sequence patterns are present in my protein? To what class or family does my protein belong? Compare Are there sequences in the database which resemble the protein I cloned? How can I optimally align the members of this protein family? Predict Can I predict the active site residues of this enzyme? Can I predict a (better) drug for this target? How can I predict the genes located on this genome?

3 ©CMBI 2009 Sequence similarity MVVSGAPPAL GGGCLGTFTS LLLLASTAIL NAARIPVPPA CGKPQQLNRV VGGEDSTDSE WPWIVSIQKN GTHHCAGSLL TSRWVITAAH CFKDNLNKPY LFSVLLGAWQ LGNPGSRSQK VGVAWVEPHP VYSWKEGACA DIALVRLERS IQFSERVLPI CLPDASIHLP PNTHCWISGW GSIQDGVPLP HPQTLQKLKV PIIDSEVCSH LYWRGAGQGP ITEDMLCAGY LEGERDACLG DSGGPLMCQV DGAWLLAGII SWGEGCAERN RPGVYISLSA HRSWVEKIVQ GVQLRGRAQG Image, you sequenced this human protein. You know it is a serine protease. Which residues belong to the active site? Is its sequence similar to the mouse serine protease?

4 ©CMBI 2009 Sequence Alignment MVVSGAPPAL GGGCLGTFTS LLLLASTAIL NAARIPVPPA CGKPQQLNRV VGGEDSTDSE MMISRPPPAL GGDQFSILIL LVLLTSTAPI SAATIRVSPD CGKPQQLNRI VGGEDSMDAQ *::*.**** **. :. : *:**:*** :.** * *.* *********: ****** *:: WPWIVSIQKN GTHHCAGSLL TSRWVITAAH CFKDNLNKPY LFSVLLGAWQ LGNPGSRSQK WPWIVSILKN GSHHCAGSLL TNRWVVTAAH CFKSNMDKPS LFSVLLGAWK LGSPGPRSQK ******* ** *:******** *.***:**** ***.*::** *********: **.**.**** VGVAWVEPHP VYSWKEGACA DIALVRLERS IQFSERVLPI CLPDASIHLP PNTHCWISGW VGIAWVLPHP RYSWKEGTHA DIALVRLEHS IQFSERILPI CLPDSSVRLP PKTDCWIAGW **:*** *** ******: * ********:* ******:*** ****:*::** *:*.***:** GSIQDGVPLP HPQTLQKLKV PIIDSEVCSH LYWRGAGQGP ITEDMLCAGY LEGERDACLG GSIQDGVPLP HPQTLQKLKV PIIDSELCKS LYWRGAGQEA ITEGMLCAGY LEGERDACLG ********** ********** ******:*. ********. ***.****** ********** DSGGPLMCQV DGAWLLAGII SWGEGCAERN RPGVYISLSA HRSWVEKIVQ GVQLRGRAQG DSGGPLMCQV DDHWLLTGII SWGEGCAD-D RPGVYTSLLA HRSWVQRIVQ GVQLRG---- ********** *. ***:*** *******: : ***** ** * *****::*** ****** => Transfer of information

5 II. Bioinformatics data and databases mRNA expression profiles MS data Large amount of data Growing very very fast Heterogeneous data types

6 ©CMBI 2015 EMBL DNA database Note: In 2015: 609 millions & 1327 billions

7 ©CMBI 2014 Genome projects http://www.genomesonline.org /

8 ©CMBI 2015

9 ©CMBI 2014 Ebola

10 ©CMBI 2010 Biological databases (1) Primary databases contain biomolecular sequences or structures (experimental data!) and associated annotation information SequencesNucleic acid sequences EMBL, Genbank, DDBJ Protein sequences SwissProt, trEMBL, UniProt StructuresProtein Structures PDB Structures of small compounds CSD Genomes Ensembl UCSC

11 ©CMBI 2009 Biological databases (2) Secondary databases Contain data derived from primary database(s) Patterns, motifs, domainsPROSITE, PFAM, PRINTS, INTERPRO,...... Disease mutations OMIM / MIM SNPsdbSNP Pathways KEGG

12 ©CMBI 2015 Databases Data must be in a certain format for software to recognize Every database can have its own format but some data elements are essential for every database: 1. Unique identifier, or accession code 2. Name of depositor 3. Literature references 4. Deposition date 5. The real data Nomenclature: Database entry or database record Database fields

13 ©CMBI 2015 Quality of Data SwissProt Data is only entered by annotation experts (Annotation: attaching biological information to sequences) EMBL, PDB “Everybody” can submit data No human intervention when submitted; some automatic checks

14 ©CMBI 2015 SwissProt database Database of protein sequences 549000 sequence entries (sept 2015) Swissprot is manually annotated and reviewed Obligatory deposit of in SwissProt before publication SwissProt is part of UniProt The other main part of UniProt is Trembl (translated EMBL). Trembl is automatically annotated and is not reviewed.

15 ©CMBI 2015 Important fields in SwissProt (1) ID HBA_HUMAN Reviewed; 142 AA. AC P69905; P01922; Q3MIF5; Q96KF1; Q9NYR7; DT 21-JUL-1986, integrated into UniProtKB/Swiss-Prot. DT 23-JAN-2007, sequence version 2. DT 23-SEP-2008, entry version 63. DE RecName: Full=Hemoglobin subunit alpha; DE AltName: Full=Hemoglobin alpha chain; DE AltName: Full=Alpha-globin;

16 ©CMBI 2015 Important fields in SwissProt (2) Cross references section: Hyperlinks to all entries in other databases which are relevant for the protein sequence HBA_HUMAN genes & mRNA protein domains structures diseases

17 ©CMBI 2015 Important fields in SwissProt (3) Features section: post-translational modifications, signal peptides, binding sites, enzyme active sites, domains, disulfide bridges, local secondary structure, sequence conflicts between references etc. etc.

18 ©CMBI 2009 And finally, the amino acid sequence!

19 ©CMBI 2015 EMBL database Nucleotide database EMBL: 609 million sequence entries comprising 1327 billion nucleotides (Sept 2015) EMBL records follows roughly same scheme as SwissProt Obligatory deposit of sequence in EMBL before publication Most EMBL sequences never seen by a human

20 ©CMBI 2015 Protein Data Bank (PDB) Databank for 3-dimensional structures of biomolecules (by X-ray & NMR): Protein DNA RNA Ligands Obligatory deposit of coordinates in the PDB before publication ~ 111000 entries (Sep 2015) ( ~6000 “unique” structures)

21 ©CMBI 2009 Structure Visualization Structures from PDB can be visualized with: 1.Yasara / Yasaraview (www.yasara.org) 2.SwissPDBViewer (http://spdbv.vital-it.ch/) 3.Protein Explorer (http://www.umass.edu/microbio/rasmol/) 4.Cn3D (http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml)

22 ©CMBI 2009 PDB important records (1) PDB nomenclature Filename= accession number= PDB Code Filename is 4 positions (often 1 digit & 3 letters, e.g. 1CRN) HEADER describes molecule & gives deposition date HEADER PLANT SEED PROTEIN 30-APR-81 1CRN CMPND name of molecule COMPND CRAMBIN SOURCE organism SOURCE ABYSSINIAN CABBAGE (CRAMBE ABYSSINICA) SEED

23 ©CMBI 2009 PDB important records (2) SEQRES Sequence of protein; be aware: Not always all 3d-coordinates are present for all the amino acids in SEQRES!! SEQRES 1 46 THR THR CYS CYS PRO SER ILE VAL ALA ARG SER ASN PHE 1CRN 51 SEQRES 2 46 ASN VAL CYS ARG LEU PRO GLY THR PRO GLU ALA ILE CYS 1CRN 52 SEQRES 3 46 ALA THR TYR THR GLY CYS ILE ILE ILE PRO GLY ALA THR 1CRN 53 SEQRES 4 46 CYS PRO GLY ASP TYR ALA ASN 1CRN 54 SSBOND disulfide bridges SSBOND 1 CYS 3 CYS 40 SSBOND 2 CYS 4 CYS 32

24 ©CMBI 2015 PDB important records (3) and at the end of the PDB file the “real” data: ATOM one line for each atom with its unique name and its x,y,z coordinates (in Angstrom) ATOM 1 N THR 1 17.047 14.099 3.625 1.00 13.79 1CRN 70 ATOM 2 CA THR 1 16.967 12.784 4.338 1.00 10.80 1CRN 71 ATOM 3 C THR 1 15.685 12.755 5.133 1.00 9.19 1CRN 72 ATOM 4 O THR 1 15.268 13.825 5.594 1.00 9.85 1CRN 73 ATOM 5 CB THR 1 18.170 12.703 5.337 1.00 13.02 1CRN 74 ATOM 6 OG1 THR 1 19.334 12.829 4.463 1.00 15.06 1CRN 75 ATOM 7 CG2 THR 1 18.150 11.546 6.304 1.00 14.23 1CRN 76 ATOM 8 N THR 2 15.115 11.555 5.265 1.00 7.81 1CRN 77 ATOM 9 CA THR 2 13.856 11.469 6.066 1.00 8.31 1CRN 78 ATOM 10 C THR 2 14.164 10.785 7.379 1.00 5.80 1CRN 79 ATOM 11 O THR 2 14.993 9.862 7.443 1.00 6.94 1CRN 80

25 ©CMBI 2009 Part III: Sequence Retrieval with MRS GoogleThé best generic search and retrieval system Google searches everywhere for everything MRSMaarten’s Retrieval System (http://mrs.cmbi.ru.nl ) MRS searches in selected data environments MRS is the Google of the biological database world Search engine (like Google) Input/Query = word(s) Output = entry/entries from database Other programs exist: Entrez, SRS,....

26 ©CMBI 2015 MRS Search Steps Select database(s) of choice Formulate your query Hit “Search” The result is a “hitlist” Analyze the results

27 ©CMBI 2011 http://mrs.cmbi.ru.nl

28 ©CMBI 2009 MRS Database Selection You can choose between selecting all databases or just one of them. But think about your query first!!

29 ©CMBI 2009 Simply type your keywords in the keyword field and choose SEARCH. If you know the fields of the database you are searching in you can specify your query further But think about your query first!! MRS Search options

30 ©CMBI 2009 MRS Hitlist (1)

31 ©CMBI 2009 MRS Hitlist (2)

32 ©CMBI 2015 MRS Options MRS creates a result, or a “hitlist”. With the result you can do different things in MRS: –View the hits –Blast single hit sequences –Clustal multiple hit sequences

33 ©CMBI 2009 MRS - View Hits

34 ©CMBI 2009 Combine in MRS AND or & AND is implicit OR or | NOT or !

35 ©CMBI 2009 Try it yourself with the exercises! Ground rules for bioinformatics Don't always believe what programs tell you - they're often misleading & sometimes wrong! Don't always believe what databases tell you - they're often misleading & sometimes wrong! Don't always believe what lecturers tell you - they're sometimes wrong! Don't be a naive user, computers don’t do biology & bioinformatics, you do! free after Terri Attwood free after Terri Attwood


Download ppt "Bioinformatics databases & sequence retrieval Content of lecture I.Introduction II.Bioinformatics data & databases III.Sequence Retrieval with MRS Celia."

Similar presentations


Ads by Google