Download presentation
Presentation is loading. Please wait.
1
1 Exercise: BIOINFORMATIC DATABASES and BLAST
2
2 Outline NCBI and Entrez Pubmed Google scholar RefSeq Swissprot Fasta format PDB: Protein Data Bank Organism specific databases Summary Pairwise Sequence Alignment and BLAST Overview Query type: DNA or Protein
3
3 What’s in a database? Sequences – genes, proteins, etc Full genomes Annotation – information about genes/proteins: - function - cellular location - chromosomal location - introns/exons - protein structure - phenotypes, diseases Publications
4
4 NCBI and Entrez National center for biotechnology information One of the largest and most comprehensive databases belonging to the NIH (national institute of health) The primary Federal agency for conducting and supporting medical research in the USA The primary Federal agency for conducting and supporting medical research in the USA Entrez is the search engine of NCBI Search for : genes, proteins, genomes, structures, diseases, publications and more http://www.ncbi.nlm.nih.gov/
5
5 PubMed: search for published papers Yang X, Kurteva S, Ren X, Lee S, Sodroski J. “Subunit stoichiometry of human immunodeficiency virus type 1 envelope glycoprotein trimers during virus entry into host cells “, J Virol. 2006 May;80(9):4388-95.
6
6 Use fields! Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA] For the full list of field tags: go to help -> Search Field Descriptions and TagsSearch Field Descriptions and Tags
7
7 Exercise Retrieve all publications in which the first author is: Pe'er I and the last author is: Shamir R
8
8 Using limits Retrieve the publications of Friedman N, in the journals: Bioinformatics and Journal of Computational Biology, in the last 5 years
9
9 Google scholar http://scholar.google.com/
10
10
11
11 NCBI gene & protein databases: GenBank GenBank is an annotated collection of all publicly available DNA sequences (and their amino-acid translations) Holds 99 billion bases (2008)
12
12 Searching NCBI for the protein human CD4 Search demonstration
13
13
14
14 Using field descriptions, qualifiers, and boolean operators Cd4[GENE] AND human[ORGN] Or Cd4[gene name] AND human[organism] List of field codes: http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fields_and_Qualifiers Boolean Operators: AND OR NOT Boolean Operators: AND OR NOT Note: do not use the field Protein name [PROT], only GENE!
15
15 This time we directly search in the protein database
16
16 RefSeq RefSeq: sub-collection of NCBI databases with only non-redundant, highly annotated entries (genomic DNA, transcript (RNA), and protein products)
17
17
18
18 An explanation on GenBank records
19
19 Accession Numbers Two letters followed by six digits, e.g.: AY123456 One letter followed by five digits, e.g.: U12345 GenBankEMBL RefSeq accession numbers can be distinguished from GenBank accessions by their prefix: 2 characters+underscore], e.g.: NP_015325 NM_: mRNA, NP_: protein Refseq All are six characters: Character/Format 1 [O,P,Q] 2 [0-9] 3 [A-Z,0-9] 4 [A-Z,0-9] 5 [A-Z,0-9] 6 [0-9] e.g.:P12345 and Q9JJS7 SWISSPROT (another protein database) one digit followed by three letters, e.g.: 1hxw PDB (Protein Data Bank – structure database)
20
20 Swissprot A protein sequence database which strives to provide a high level of annotation: * the function of a protein * domains structure * post-translational modifications * variants One entry for each protein
21
21
22
22 GenBank Vs. Swissprot GenBank results Swiss-Prot results
23
23 Fasta format > gi|10835167|ref|NP_000607.1| CD4 antigen precursor [Homo sapiens] MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIK ILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQL LVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSG TWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWW QAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLA LEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWV LNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV RCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI Save accession numbers for future use (makes searching quicker): RefSeq accession number: NP_000607.1 header ID/accession description sequence
24
24 Downloading
25
25 PDB: Protein Data Bank Main database of 3D structures Includes ~56,000 entries (proteins, nucleic acids, others) Proteins organized in groups, families etc Is highly redundant different conformations (e.g., ligand dependent) different conformations (e.g., ligand dependent) http://www.rcsb.org
26
26 Human CD4 in complex with HIV gp120 gp120 CD4 PDB ID 1G9M
27
27 Model organisms have independent databases: Organism specific databases HIV database http://hiv-web.lanl.gov/content/index http://gmod.org/wiki/Main_Page?q=node/71
28
28 Summary General and comprehensive databases: NCBI, EMBL, DDBJ NCBI, EMBL, DDBJ Genome specific databases: ENSEMBL, UCSC genome browser ENSEMBL, UCSC genome browser Highly annotated databases: Proteins: Proteins: Swissprot, RefSeqSwissprot, RefSeq Structures: Structures: PDBPDB
29
29 And always remember: 1. Google (or any search engine) 2. RTFM - Read the manual!!! (/help/FAQ)
30
30 Pairwise Sequence Alignment and BLAST
31
31 What is sequence alignment? Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. MVNLTSDEKTAVLALWNKVDVEDCGGE || || ||||| ||| || || || MVHLTPEEKTAVNALWGKVNVDAVGGE
32
32 Local vs. Global Global alignment – finds the best alignment across the whole two sequences. Local alignment – finds regions of high similarity in parts of the sequences. Local alignment – finds regions of high similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ ADLG CDRYFQ |||| |||| | ADLG CDRYYQ
33
33 In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of mutations: 1. Insertion - AAGA AAGTA 2. Deletion - AAGA AGA 3. Substitution - AAGA AACA Evolutionary changes in sequences Insertion + Deletion Indel
34
34 Scoring scheme Match/mismatch scores: substitution matrices Nucleic acids: Nucleic acids: Transition-transversionTransition-transversion Amino acids: Amino acids: Evolution (empirical data) based: (PAM, BLOSUM)Evolution (empirical data) based: (PAM, BLOSUM) Physico-chemical properties based (Grantham, McLachlan)Physico-chemical properties based (Grantham, McLachlan) Gap penalty
35
35 Computation time: How do we search a database? If each pairwise alignment takes 1/10 of a second, and if the database contains 10 7 sequences, it will take 10 6 seconds = 11.5 days to complete one search. 150,000 searches (at least!!) are performed per day. >82,000,000 sequence records in GenBank.
36
36 Conclusion Using the exact comparison pairwise alignment algorithm between the query and all DB entries – too slow
37
37 Heuristic Definition: a heuristic is a design to solve a problem that does not provide an exact solution (but is not too bad) but reduces the time complexity of the exact solution
38
38 BLAST BLAST - Basic Local Alignment and Search Tool A heuristic for searching a database for similar sequences The heuristic based on restrictions of the similarity (such as using ungapped word matching instead of single character matching).
39
39 Query:DNAProtein Database:DNAProtein Query type: DNA or Protein All types of searches are possible blastn – nuc vs. nuc blastp – prot vs. prot blastx – translated query vs. protein database tblastn – protein vs. translated nuc. DB tblastx – translated query vs. translated database
40
40 Query type Information content in the letters: Nucleotides: 4 letter alphabet Nucleotides: 4 letter alphabet Amino acids: 20 letter alphabet Amino acids: 20 letter alphabet Two random DNA sequences will, on average, have 25% identity Two random protein sequences will, on average, have 5% identity The amino-acid sequence is often preferable for homology search Selection (and hence conservation) works (mostly) at the protein level
41
41 E-value The number of times we will theoretically find an alignment with a score ≥ Y of a random sequence vs. a random database Theoretically, we could trust any result with an E-value ≤ 1 In practice – BLAST uses estimations. E-values of 10 -4 and lower indicate a significant homology. E-values between 10 -4 and 10 -2 should be checked (similar domains, maybe non-homologous). E-values between 10 -2 and 1 do not indicate a good homology
42
42 Filtering low complexity Low complexity regions : e.g., Proline rich areas (in proteins), Alu repeats (in DNA) Regions of low complexity generate high scores of alignment, BUT – this does not indicate homology
43
43 BLAST 2 sequences at NCBI Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool) engine for local alignment BLAST Does not use an optimal algorithm but a heuristic
44
44 Back to NCBI
45
45 BLAST – bl2seq
46
46 blastn – nucleotide blastp – protein Bl2Seq - query
47
47 Bl2seq results
48
48 Bl2seq results Match Dissimilarity Gaps Similarity Low complexity
49
49 BLAST – programs Query:DNAProtein Database:DNAProtein
50
50 BLAST – Blastp
51
51 Blastp - results
52
52 Blastp – results (cont’)
53
53 Blastp – acquiring sequences
54
54 Blastp – acquiring sequences (cont’)
55
55 Fasta format – multiple sequences >gi|4504351|ref|NP_000510.1| delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH >gi|4504349|ref|NP_000509.1| beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH >gi|4885393|ref|NP_005321.1| epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI ALAHKYH >gi|6715607|ref|NP_000175.1| G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH >gi|28302131|ref|NP_000550.2| A-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS ALSSRYH
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.