Download presentation
Presentation is loading. Please wait.
1
1 Exercise 1 Bioinformatics Databases
2
2 What’s in a database? Sequences – genes, proteins, etc. Full genomes Annotation – information about the gene/protein: - function - cellular location - chromosomal location - introns/exons - protein structure - phenotypes, diseases Publications
3
3 NCBI and Entrez One of the largest and most comprehensive databases belonging to the NIH – national institute of health (USA) Entrez is the search engine of NCBI Search for : genes, proteins, genomes, structures, diseases, publications and more. http://www.ncbi.nlm.nih.gov/
4
4 Searching for published papers Yang X, Kurteva S, Ren X, Lee S, Sodroski J. “Subunit stoichiometry of human immunodeficiency virus type 1 envelope glycoprotein trimers during virus entry into host cells “, J Virol. 2006 May;80(9):4388-95.
5
5 Use fields! Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA] For the full list of field tags: go to help -> Search Field Descriptions and Tags
6
6 Exercise Retrieve all publications in which the first author is: Pe'er I and the last author is: Shamir R
7
7 Using Limits Retrieve the publications of Friedman N, in the journals: Bioinformatics and Journal of Computational Biology, in the last 5 years
8
8 Google scholar http://scholar.google.com/
9
9
10
10 NCBI gene & protein databases: GenBank GenBank is an annotated collection of all publicly available DNA sequences Holds 65 billion bases (Oct. 2007) GenPept is a database of translated coding sequences from GenBank
11
11 Searching for CD4 human using Entrez Search demonstration
12
12
13
13 Using Field Descriptions, Qualifiers, and Boolean Operators Cd4[GENE] AND human[ORGN] Or Cd4[gene name] AND human[organism] List of field codes: http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers Boolean Operators: AND OR NOT Boolean Operators: AND OR NOT Note: do not use the field Protein name [PROT], only GENE!
14
14
15
15 RefSeq REFSEQ: sub-collection of NCBI databases with only non-redundant, highly annotated entries (genomic DNA, transcript (RNA), and protein products)
16
16
17
17 An explanation on GenBank records
18
18 Accession Numbers Two letters followed by six digits, e.g.: AY123456 One letter followed by five digits, e.g.: U12345 GenBankEMBL Three letters and five digits, e.g.: AAA12345 GenPept (a.a. translations of GenBank) RefSeq accession numbers can be distinguished from GenBank accessions by their prefix distinct format of [2 characters+underscore], e.g.: NP_015325. NM_: nucleotide, NP_: protein Refseq All are six characters: Character/Format 1 [O,P,Q] 2 [0-9] 3 [A-Z,0-9] 4 [A-Z,0-9] 5 [A-Z,0-9] 6 [0-9] e.g.:P12345 and Q9JJS7 SWISS-PROT (another protein database) one digit followed by three letters, e.g.: 1hxw PDB (Protein Data Bank – structure database)
19
19 Swiss-Prot A protein sequence database which strives to provide a high level of annotation: * the function of a protein * domains structure * post-translational modifications * variants One entry for each protein
20
20
21
21 GenBank Vs. Swiss-Prot GenBank results Swiss-Prot results
22
22 Downloading a sequence & Fasta format Fasta format > gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens] MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIK ILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQL LVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSG TWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWW QAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLA LEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWV LNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV RCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI Save Accession Numbers for future use (makes searching quicker): Refseq: NP_000607.1 Swiss-Prot: P01730
23
23
24
24 PDB: Protein Data Bank Main database of 3D structures Includes ~47,000 entries (proteins, nucleic acids, others) Proteins organized in groups, families etc. Is highly redundant http://www.rcsb.org
25
25 CD4 in complex with gp120 gp120 CD4 PDB ID 1G9M
26
26 Model organisms have independent database: Organism specific databases HIV database http://hiv-web.lanl.gov/content/index
27
27 Genecards All in one database of human genes (a project by Weizmann institute) Attempts to integrate as many as possible databases, publications and all available knowledge http://www.genecards.org
28
28
29
29 Summary General and comprehensive databases: NCBI, EMBL, DDBJ NCBI, EMBL, DDBJ Genome specific databases: ENSEMBL, UCSC genome browser ENSEMBL, UCSC genome browser Highly annotated databases: Human genes Human genes GenecardsGenecards Proteins: Proteins: Swiss-Prot, RefseqSwiss-Prot, Refseq Structures: Structures: PDBPDB
30
30 The MOST important of all 1. Google (or any search engine)
31
31 And always remember: 2. RT(F)M – Read the manual!!
32
32 Help! Read the Help section Read the FAQ section Google the question!
33
33 || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG GAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACT GATTGCTGTCTCCTGTGCTGCTTTCACCCCTCAGGCTGCTGGTCGTGTATCCCTGGACCCA GAGGTTCTTTGAAAGCTTTGGGGACTTGTCCACTCCTGCTGCTGTGTTCGCAAATGCTAAG GTAAAAGCCCATGGCAAGAAGGTGCTAACTTCCTTTGGTGAAGGTATGAATCACCTGGACA ACCTCAAGGGCACCTTTGCTAAACTGAGTGAGCTGCACTGTGACAAGCTGCACGTGGATCC TGAGAATTTCAAGGTGAGTCAATATTCTTCTTCTTCCTTCTTTCTATGGTCAAGCTCATGT CATGGGAAAAGGACATAAGAGTCAGTTTCCAGTTCTCAATAGAAAAAAAAATTCTGTTTGC ATCACTGTGGACTCCTTGGGACCATTCATTTCTTTCACCTGCTTTGCTTATAGTTATTGTT TCCTCTTTTTCCTTTTTCTCTTCTTCTTCATAAGTTTTTCTCTCTGTATTTTTTTAACACA ATCTTTTAATTTTGTGCCTTTAAATTATTTTTAAGCTTTCTTCTTTTAATTACTACTCGTT TCCTTTCATTTCTATACTTTCTATCTAATCTTCTCCTTTCAAGAGAAGGAGTGGTTCACTA CTACTTTGCTTGGGTGTAAAGAATAACAGCAATAGCTTAAATTCTGGCATAATGTGAATAG GGAGGACAATTTCTCATATAAGTTGAGGCTGATATTGGAGGATTTGCATTAGTAGTAGAGG TTACATCCAGTTACCGTCTTGCTCATAATTTGTGGGCACAACACAGGGCATATCTTGGAAC AAGGCTAGAATATTCTGAATGCAAACTGGGGACCTGTGTTAACTATGTTCATGCCTGTTGT CTCTTCCTCTTCAGCTCCTGGGCAATATGCTGGTGGTTGTGCTGGCTCGCCACTTTGGCAA GGAATTCGACTGGCACATGCACGCTTGTTTTCAGAAGGTGGTGGCTGGTGTGGCTAATGCC CTGGCTCACAAGTACCATTGA MVNLTSDEKTAVLALWNKVDVEDCGGEALGRLLVVYPWTQRFFE… Alignment teaser…
34
34 Pairwise Sequence Alignment
35
35 What is sequence alignment? Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. MVNLTSDEKTAVLALWNKVDVEDCGGE || || ||||| ||| || || || MVHLTPEEKTAVNALWGKVNVDAVGGE
36
36 Why sequence alignment? Predict characteristics of a protein – use the structure or function information on known proteins with similar sequences available in databases in order to predict the structure or function of an unknown protein Assumptions: similar sequences produce similar proteins
37
37 Local vs. Global Global alignment – finds the best alignment across the whole two sequences. Local alignment – finds regions of high similarity in parts of the sequences. Local alignment – finds regions of high similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ ADLG CDRYFQ |||| |||| | ADLG CDRYYQ
38
38 In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of changes: 1. Insertion - AAGA AAGTA Sequence evolution AAG T A Insertion
39
39 In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of changes : 1. Insertion - AAGA AAGTA 2. Deletion - AAGA AGA Sequence evolution AAG Deletion A
40
40 In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of mutations: 1. Insertion - AAGA AAGTA 2. Deletion - AAGA AGA 3. Substitution - AAGA AACA Evolutionary changes in sequences AAA Substitution G C Insertion + Deletion Indel
41
41 Sequence alignment AAGCTGAATTCGAA AGGCTCATTTCTGA AAGCTGAATT-C-GAA AGGCT-CATTTCTGA-
42
42 Scoring scheme Match/mismatch scores: substitution matrices Nucleic acids: Nucleic acids: Transition-transversionTransition-transversion Amino acids: Amino acids: Evolution (empirical data) based: (PAM, BLOSUM)Evolution (empirical data) based: (PAM, BLOSUM) Physico-chemical properties based (Grantham, McLachlan)Physico-chemical properties based (Grantham, McLachlan) Gap penalty
43
43 Amino Acid Scoring Matrices PAM matrices: PAM80, PAM120, PAM250 The number with PAM matrices represent evolutionary distance The number with PAM matrices represent evolutionary distance Greater numbers denote greater distances Greater numbers denote greater distances Low PAM: strong similarities Low PAM: strong similarities High PAM: weak similarities High PAM: weak similarities PAM120 for general use (40% identity) PAM120 for general use (40% identity) PAM60 for close relations (60% identity) PAM60 for close relations (60% identity) PAM250 for distant relations (20% identity) PAM250 for distant relations (20% identity) If uncertain, try several different matrices
44
44 Amino Acid Scoring Matrices BLOSUM matrices: BLOSUM45, BLOSUM62, BLOSUM80 The number with BLOSUM matrices represent average % identity The number with BLOSUM matrices represent average % identity Greater numbers denote greater identity Greater numbers denote greater identity Low BLOSUM: weak similarities Low BLOSUM: weak similarities High BLOSUM: strong similarities High BLOSUM: strong similarities BLOSUM62 for general use BLOSUM62 for general use BLOSUM80 for close relations BLOSUM80 for close relations BLOSUM45 for distant relations BLOSUM45 for distant relations If uncertain, try several different matrices
45
45 Web servers for pairwise alignment
46
46 BLAST 2 sequences (bl2Seq) at NCBI Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool) engine for local alignment BLAST Does not use an optimal algorithm but a heuristic
47
47 Back to NCBI
48
48 BLAST – bl2seq
49
49 blastn – nucleotide blastp – protein Bl2Seq - query
50
50 Bl2seq results
51
51 Bl2seq results Match Dissimilarity Gaps Similarity Low complexity
52
52 Bl2seq results: Bits score – A score for the alignment according to the number of identities, similarities, etc. Bits score – A score for the alignment according to the number of identities, similarities, etc. Expected-score (E-value) –The number of alignments with the same score one can “expect” to observe by chance when searching a database of a particular size. The closer the e- value approaches zero, the greater the confidence that the hit is real
53
53 BLAST – programs Query:DNAProtein Database:DNAProtein
54
54 BLAST – Blastp
55
55 Blastp - results
56
56 Blastp – results (cont’)
57
57 Blastp – acquiring sequences
58
58 Blastp – acquiring sequences (cont’)
59
59 Fasta format – multiple sequences >gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH >gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI ALAHKYH >gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH >gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS ALSSRYH
60
60 Searching for remote homologs Sometimes BLAST isn’t enough Large protein family, and BLAST only finds close members. We want more distant members PSI-BLAST Profile HMMs (not discussed)
61
61 PSI-BLAST Position Specific Iterated BLAST Regular blast Construct profile from blast results Blast profile search Final results
62
62 PSI-BLAST Advantage: PSI-BLAST looks for seq’s that are close to the query, and learns from them to extend the circle of friends Disadvantage: if we obtained a WRONG hit, we will get to unrelated sequences (contamination). This gets worse and worse each iteration
63
63 BLAST – PSI-Blast
64
64 PSI-Blast - results
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.