1 Exercise: BIOINFORMATIC DATABASES and BLAST. 2 Outline  NCBI and Entrez  Pubmed  Google scholar  RefSeq  Swissprot  Fasta format  PDB: Protein.

Slides:



Advertisements
Similar presentations
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Advertisements

Bioinformatics Tutorial I BLAST and Sequence Alignment.
BLAST Sequence alignment, E-value & Extreme value distribution.
1 Introduction to Bioinformatics Fall Administration  Adi Doron  Nimrod Rubinstein  Dudu Burstein.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 2: “Homology” Searches and Sequence Alignments.
Aligning sequences and searching databases
1 Exercise: BIOINFORMATIC DATABASES and BLAST. 2 Outline  NCBI and Entrez  Pubmed  Google scholar  RefSeq  Swissprot  Fasta format  PDB: Protein.
Sequence Similarity Searching Class 4 March 2010.
Archives and Information Retrieval
Heuristic alignment algorithms and cost matrices
Biological databases.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Bioinformatics and Phylogenetic Analysis
BLAST Tutorial 3 What is BLAST? Basic Local Alignment Search Tool Is a set of similarity search programs designed to explore sequence databases. What are.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
|| || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG.
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
1 ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG TGGAAGACTGTGGTGGTGAGGCCCTGGGCAGGTTTGTATGGAGGTTACAAGGCTGCTTAAG GAGGGAGGATGGAAGCTGGGCATGTGGAGACAGACCACCTCCTGGATTTATGACAGGAACT.
Sequence similarity.
1 Exercise 1 Bioinformatics Databases. 2 What’s in a database?  Sequences – genes, proteins, etc.  Full genomes  Annotation – information about the.
Pairwise and Multiple Sequence Alignment Lesson 2
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
1 Lesson 3 Aligning sequences and searching databases.
Sequence alignment, E-value & Extreme value distribution
From Pairwise Alignment to Database Similarity Search.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Bioinformatics.
An Introduction to Bioinformatics
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)
Workshop OUTLINE Part 1: Introduction and motivation How does BLAST work? Part 2: BLAST programs Sequence databases Work Steps Extract and analyze results.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
1 P6a Extra Discussion Slides Part 1. 2 Section A.
Organizing information in the post-genomic era The rise of bioinformatics.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Construction of Substitution Matrices
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Part 2- OUTLINE Introduction and motivation How does BLAST work?
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Sequence Search Abhishek Niroula Department of Experimental Medical Science Lund University
Step 3: Tools Database Searching
Copyright OpenHelix. No use or reproduction without express written consent1.
Pairwise Sequence Alignment Exercise 2. || || ||||| ||| || || ||||||||||||||||||| MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFE… ATGGTGAACCTGACCTCTGACGAGAAGACTGCCGTCCTTGCCCTGTGGAACAAGGTGGACG.
What is BLAST? Basic BLAST search What is BLAST?
Summer Bioinformatics Workshop 2008 BLAST Chi-Cheng Lin, Ph.D., Professor Department of Computer Science Winona State University – Rochester Center
What is sequencing? Video: WlxM (Illumina video) WlxM.
1 Homology and sequence alignment.. Homology Homology = Similarity between objects due to a common ancestry Hund = Dog, Schwein = Pig.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
What is BLAST? Basic BLAST search What is BLAST?
Basics of BLAST Basic BLAST Search - What is BLAST?
BLAST.
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri
Basic Local Alignment Search Tool
Basic Local Alignment Search Tool (BLAST)
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

1 Exercise: BIOINFORMATIC DATABASES and BLAST

2 Outline  NCBI and Entrez  Pubmed  Google scholar  RefSeq  Swissprot  Fasta format  PDB: Protein Data Bank  Organism specific databases  Summary  Pairwise Sequence Alignment and BLAST  Overview  Query type: DNA or Protein

3 What’s in a database?  Sequences – genes, proteins, etc  Full genomes  Annotation – information about genes/proteins: - function - cellular location - chromosomal location - introns/exons - protein structure - phenotypes, diseases  Publications

4 NCBI and Entrez National center for biotechnology information  One of the largest and most comprehensive databases belonging to the NIH (national institute of health) The primary Federal agency for conducting and supporting medical research in the USA The primary Federal agency for conducting and supporting medical research in the USA  Entrez is the search engine of NCBI  Search for : genes, proteins, genomes, structures, diseases, publications and more 

5 PubMed: search for published papers  Yang X, Kurteva S, Ren X, Lee S, Sodroski J. “Subunit stoichiometry of human immunodeficiency virus type 1 envelope glycoprotein trimers during virus entry into host cells “, J Virol May;80(9):

6 Use fields! Yang[AU] AND glycoprotein[TI] AND 2006[DP] AND J virol[TA] For the full list of field tags: go to help -> Search Field Descriptions and TagsSearch Field Descriptions and Tags

7 Exercise  Retrieve all publications in which the first author is: Pe'er I and the last author is: Shamir R

8 Using limits Retrieve the publications of Friedman N, in the journals: Bioinformatics and Journal of Computational Biology, in the last 5 years

9 Google scholar

10

11 NCBI gene & protein databases: GenBank  GenBank is an annotated collection of all publicly available DNA sequences (and their amino-acid translations)  Holds 99 billion bases (2008)

12 Searching NCBI for the protein human CD4 Search demonstration

13

14 Using field descriptions, qualifiers, and boolean operators  Cd4[GENE] AND human[ORGN] Or Cd4[gene name] AND human[organism]  List of field codes: Boolean Operators: AND OR NOT Boolean Operators: AND OR NOT Note: do not use the field Protein name [PROT], only GENE!

15 This time we directly search in the protein database

16 RefSeq  RefSeq: sub-collection of NCBI databases with only non-redundant, highly annotated entries (genomic DNA, transcript (RNA), and protein products)

17

18 An explanation on GenBank records

19 Accession Numbers Two letters followed by six digits, e.g.: AY One letter followed by five digits, e.g.: U12345 GenBankEMBL RefSeq accession numbers can be distinguished from GenBank accessions by their prefix: 2 characters+underscore], e.g.: NP_ NM_: mRNA, NP_: protein Refseq All are six characters: Character/Format 1 [O,P,Q] 2 [0-9] 3 [A-Z,0-9] 4 [A-Z,0-9] 5 [A-Z,0-9] 6 [0-9] e.g.:P12345 and Q9JJS7 SWISSPROT (another protein database) one digit followed by three letters, e.g.: 1hxw PDB (Protein Data Bank – structure database)

20 Swissprot  A protein sequence database which strives to provide a high level of annotation: * the function of a protein * domains structure * post-translational modifications * variants  One entry for each protein

21

22 GenBank Vs. Swissprot GenBank results Swiss-Prot results

23 Fasta format > gi| |ref|NP_ | CD4 antigen precursor [Homo sapiens] MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIK ILGNQGSFLTKGPSKLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQL LVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQLELQDSG TWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGSGELWW QAERASSSKSWITFDLKNKEVSVKRVTQDPKLQMGKKLPLHLTLPQALPQYAGSGNLTLA LEAKTGKLHQEVNLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWV LNPEAGMWQCLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFCV RCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI Save accession numbers for future use (makes searching quicker): RefSeq accession number: NP_ header ID/accession description sequence

24 Downloading

25 PDB: Protein Data Bank  Main database of 3D structures  Includes ~56,000 entries (proteins, nucleic acids, others)  Proteins organized in groups, families etc  Is highly redundant different conformations (e.g., ligand dependent) different conformations (e.g., ligand dependent) 

26 Human CD4 in complex with HIV gp120 gp120 CD4 PDB ID 1G9M

27  Model organisms have independent databases: Organism specific databases HIV database

28 Summary  General and comprehensive databases: NCBI, EMBL, DDBJ NCBI, EMBL, DDBJ  Genome specific databases: ENSEMBL, UCSC genome browser ENSEMBL, UCSC genome browser  Highly annotated databases: Proteins: Proteins: Swissprot, RefSeqSwissprot, RefSeq Structures: Structures: PDBPDB

29 And always remember: 1. Google (or any search engine) 2. RTFM - Read the manual!!! (/help/FAQ)

30 Pairwise Sequence Alignment and BLAST

31 What is sequence alignment? Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences. MVNLTSDEKTAVLALWNKVDVEDCGGE || || ||||| ||| || || || MVHLTPEEKTAVNALWGKVNVDAVGGE

32 Local vs. Global  Global alignment – finds the best alignment across the whole two sequences.  Local alignment – finds regions of high similarity in parts of the sequences.  Local alignment – finds regions of high similarity in parts of the sequences. ADLGAVFALCDRYFQ |||| |||| | ADLGRTQN-CDRYYQ ADLG CDRYFQ |||| |||| | ADLG CDRYYQ

33 In the course of evolution, the sequences changed from the ancestral sequence by random mutations Three types of mutations: 1. Insertion - AAGA  AAGTA 2. Deletion - AAGA  AGA 3. Substitution - AAGA  AACA Evolutionary changes in sequences Insertion + Deletion  Indel

34 Scoring scheme  Match/mismatch scores: substitution matrices Nucleic acids: Nucleic acids: Transition-transversionTransition-transversion Amino acids: Amino acids: Evolution (empirical data) based: (PAM, BLOSUM)Evolution (empirical data) based: (PAM, BLOSUM) Physico-chemical properties based (Grantham, McLachlan)Physico-chemical properties based (Grantham, McLachlan)  Gap penalty

35 Computation time: How do we search a database?  If each pairwise alignment takes 1/10 of a second, and if the database contains 10 7 sequences, it will take 10 6 seconds = 11.5 days to complete one search.  150,000 searches (at least!!) are performed per day. >82,000,000 sequence records in GenBank.

36 Conclusion  Using the exact comparison pairwise alignment algorithm between the query and all DB entries – too slow

37 Heuristic  Definition: a heuristic is a design to solve a problem that does not provide an exact solution (but is not too bad) but reduces the time complexity of the exact solution

38 BLAST  BLAST - Basic Local Alignment and Search Tool  A heuristic for searching a database for similar sequences  The heuristic based on restrictions of the similarity (such as using ungapped word matching instead of single character matching).

39 Query:DNAProtein Database:DNAProtein Query type: DNA or Protein  All types of searches are possible blastn – nuc vs. nuc blastp – prot vs. prot blastx – translated query vs. protein database tblastn – protein vs. translated nuc. DB tblastx – translated query vs. translated database

40 Query type  Information content in the letters: Nucleotides: 4 letter alphabet Nucleotides: 4 letter alphabet Amino acids: 20 letter alphabet Amino acids: 20 letter alphabet Two random DNA sequences will, on average, have 25% identity Two random protein sequences will, on average, have 5% identity The amino-acid sequence is often preferable for homology search  Selection (and hence conservation) works (mostly) at the protein level

41 E-value  The number of times we will theoretically find an alignment with a score ≥ Y of a random sequence vs. a random database Theoretically, we could trust any result with an E-value ≤ 1 In practice – BLAST uses estimations. E-values of and lower indicate a significant homology. E-values between and should be checked (similar domains, maybe non-homologous). E-values between and 1 do not indicate a good homology

42 Filtering low complexity  Low complexity regions : e.g., Proline rich areas (in proteins), Alu repeats (in DNA)  Regions of low complexity generate high scores of alignment, BUT – this does not indicate homology

43 BLAST 2 sequences at NCBI Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool) engine for local alignment BLAST  Does not use an optimal algorithm but a heuristic

44 Back to NCBI

45 BLAST – bl2seq

46 blastn – nucleotide blastp – protein Bl2Seq - query

47 Bl2seq results

48 Bl2seq results Match Dissimilarity Gaps Similarity Low complexity

49 BLAST – programs Query:DNAProtein Database:DNAProtein

50 BLAST – Blastp

51 Blastp - results

52 Blastp – results (cont’)

53 Blastp – acquiring sequences

54 Blastp – acquiring sequences (cont’)

55 Fasta format – multiple sequences >gi| |ref|NP_ | delta globin [Homo sapiens] MVHLTPEEKTAVNALWGKVNVDAVGGEALGRLLVVYPWTQRFFESFGDLSSPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFSQLSELHCDKLHVDPENFRLLGNVLVCVLARNFGKEFTPQMQAAYQKVVAGVAN ALAHKYH >gi| |ref|NP_ | beta globin [Homo sapiens] MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLG AFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVAN ALAHKYH >gi| |ref|NP_ | epsilon globin [Homo sapiens] MVHFTAEEKAAVTSLWSKMNVEEAGGEALGRLLVVYPWTQRFFDSFGNLSSPSAILGNPKVKAHGKKVLT SFGDAIKNMDNLKPAFAKLSELHCDKLHVDPENFKLLGNVMVIILATHFGKEFTPEVQAAWQKLVSAVAI ALAHKYH >gi| |ref|NP_ | G-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDAIKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTGVAS ALSSRYH >gi| |ref|NP_ | A-gamma globin [Homo sapiens] MGHFTEEDKATITSLWGKVNVEDAGGETLGRLLVVYPWTQRFFDSFGNLSSASAIMGNPKVKAHGKKVLT SLGDATKHLDDLKGTFAQLSELHCDKLHVDPENFKLLGNVLVTVLAIHFGKEFTPEVQASWQKMVTAVAS ALSSRYH