NCBI Molecular Biology Resources

Slides:

Advertisements

Similar presentations

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.

Advertisements

1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.

BLAST Sequence alignment, E-value & Extreme value distribution.

Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.

NCBI Minicourses BLAST Quick Start

NCBI Minicourses BLAST Quick Start

1 Exercise: BIOINFORMATIC DATABASES and BLAST. 2 Outline  NCBI and Entrez  Pubmed  Google scholar  RefSeq  Swissprot  Fasta format  PDB: Protein.

Heuristic alignment algorithms and cost matrices

Project Proposals Due Monday Feb. 12 Two Parts: Background—describe the question Why is it important and interesting? What is already known about it? Proposed.

Introduction to bioinformatics

NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005.

Similar Sequence Similar Function Charles Yan Spring 2006.

1 Lesson 3 Aligning sequences and searching databases.

Sequence alignment, E-value & Extreme value distribution

Access to sequences: GenBank – a place to start and then some more... Links: embl nucleotide archive

NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

Sequence Alignment Lakshmanan Iyer, Ph. D.. The Building Blocks… ATGC VLMFNQEDHKRCSTPYW.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.

An Introduction to Bioinformatics

Protein Sequence Alignment and Database Searching.

BLAST : Basic local alignment search tool B L A S T !

NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Peter Cooper Using NCBI BLAST.

NCBI FieldGuide A Field Guide part 2 August 30, 2005 University of Colorado Health Sciences Center.

NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Using Entrez.

NCBI FieldGuide MapViewer Genome Resources and Sequence SimilarityLocusLink UniGene Homologene Basic Local Alignment Search Tool Gene database.

Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.

Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.

Bacterial Genetics - Assignment and Genomics Exercise: Aims –To provide an overview of the development and.

NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Peter Cooper Using NCBI BLAST.

BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.

NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.

Construction of Substitution Matrices

Tutorial 4 Substitution matrices and PSI-BLAST 1.

NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Using Entrez.

BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.

NCBI Literature Databases: PubMed

Basic Local Alignment Search Tool BLAST Why Use BLAST?

Database search. Overview ： 1. FastA ： is suitable for protein sequence searching 2. BLAST ： is suitable for DNA, RNA, protein sequence searching.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Sequence Alignment.

Construction of Substitution matrices

Step 3: Tools Database Searching

Copyright OpenHelix. No use or reproduction without express written consent1.

Practice -- BLAST search in your own computer 1.Download data file from the course web page, or Ensemble. Save in the blast\dbs folder. 2.Start a CMD window,

BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.

Sequence Similarity The bioinformatics for molecular biologists lecture series.

Using BLAST To Teach ‘E-value-tionary’ Concepts Cheryl A. Kerfeld 1, 2 and Kathleen M. Scott 3 1.Department of Energy-Joint Genome Institute, Walnut Creek,

Lecture 1: Introduction to Entrez October 16-19, 2007 NCBI PowerScripting.

9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.

Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.

Sequence similarity, BLAST alignments & multiple sequence alignments

A Practical Guide to NCBI BLAST

NCBI Molecular Biology Resources

Blast Basic Local Alignment Search Tool

Basics of BLAST Basic BLAST Search - What is BLAST?

BLAST Anders Gorm Pedersen & Rasmus Wernersson.

Identifying templates for protein modeling:

Sequence Based Analysis Tutorial

Sequence Based Analysis Tutorial

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool (BLAST)

Genome of the week Bacillus subtilis Gram-positive soil bacterium

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool (BLAST)

BLAST Slides adapted & edited from a set by

Sequence alignment, E-value & Extreme value distribution

BLAST Slides adapted & edited from a set by

Presentation transcript:

NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB

Database Searching with Entrez Using limits and field restriction to find human MutL homolog Linking and neighboring with MutL Mapping SNPs onto structure and the genome

Global Entrez Search

Document Summaries: MutL[All Fields]

Entrez Nucleotides: Limits & Preview/Index Tabs

Entrez Nucleotides: Limits Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Uid Volume Field Restriction MutL Exclude bulk sequences

Entrez Nucleotides: Limits MutL Title == Definition Exclude Bulk Sequences

Document Summaries: Limits

Adding Terms: Preview/Index Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Uid Volume

Human MutL Search Results

Human MutL RefSeq GenBank Records

NM_000249: Links

Literature Links PubMed OMIM

NM_000249: PubMed Books

Books Link

OMIM: Human Disease Genes Conserved Domain

Sequence Links Nucleotide Protein

NM_000249: Related Sequences similarity Original GenBank mRNAs Original GenBank genomic Genome Project BAC

The Tax Browser NCBI’s Taxonomy Taxonomy Link The Tax Browser NCBI’s Taxonomy

Taxonomy Link

The Tax Browser Nucleotide Protein Structures Popset

Marsupial PopSets

Mammalian Phylogenetic Study

Batch Downloads

Batch Downloads: FASTA and GI list

Batch Entrez / Entrez-utilities

Links Between and Within Nodes Word weight Computational PubMed abstracts Taxonomy 3-D Structure 3 -D Structures VAST Genomes Phylogeny Computational Nucleotide sequences Protein sequences BLAST BLAST Computational Computational

Text Pubmed Sequence BLAST Structure VAST

BLAST® Basic Local Alignment Search Tool Why align sequences ? - because it is the best way to infer structure-function relationships for the unknown biomolecules Global vs local alignments BLAST basics MegaBLAST Discontiguous MegaBLAST

Global vs Local Alignment Seq 1 Seq 2 Global alignment Seq 1 Seq 2 Local alignment

Global vs Local Alignment Seq1: WHEREISWALTERNOW (16aa) Seq2: HEWASHEREBUTNOWISHERE (21aa) Global Seq1: 1 W--HEREISWALTERNOW 16 W HERE Seq2: 1 HEWASHEREBUTNOWISHERE 21 Local Seq1: 1 W--HERE 5 Seq1: 1 W--HERE 5 W HERE W HERE Seq2: 3 WASHERE 9 Seq2: 15 WISHERE 21

Basic Local Alignment Search Tool Calculates similarity for biological sequences Finds best local alignments Heuristic approach based on Smith-Waterman algorithm Searches for matching “words” and then extends the hits Uses statistical theory to determine if a match might have occurred by chance

Align program (Lipman and Pearson) Global Alignment Align program (Lipman and Pearson) Human: 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84 +A + + + DL F K D+L I+ T+ W+ GR G IP+NYV + + +++ PW+ Worm: 63 VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNYVAREKSIES------QPWYF 125 Human: 85 GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-MYHASKLSIDEEVYFENLMQ 151 GK+ R AE+ L E G FLVR+S + D +L V + V+HYRI + H I F L Worm: 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQHDLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD 194 Human: 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220 L+ HY +ADGLC L P Y W ++ + ++L++ IG G+FG+V G + N VA Worm: 195 LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQIGAGQFGEVWEGRWNVNVPVA 264 Human: 221 VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGD 289 VK +K A FLAEA +M +LRH L+ L V ++ + IVTE M + +L+ +L+ RGR Worm: 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKLLSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332 Human: 290 CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT----KEASSTQDTG-KLPVKWTA 353 L++ S V M YLE NF+HRDLAARN+L++ K++DFGL KE TG + P+KWTA Worm: 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARNILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA 401 Human: 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423 PEA +F+TKSDVWSFGILL EI +FGR+PYP + +V+ +V+ GY+M P GCP +Y++M+ CW Worm: 402 PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYRMPCPAGCPVTLYDIMQQCWR 471 Human: 424 LDAAMRPSFLQLREQLEHI 443 D RP+F L+ +LE + Worm: 472 SDPDKRPTFETLQWKLEDL 492 human M--------------SAIQ----------------------AAWPSGT------------ECIAKYNFHG M S .. AA SG. . .A ... . worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPSIGPIAASSSGNTIDKNQNISQSANFVALFQYDA 1 20 40 60 Global alignments force a full-length comparison. In this example, the important domains are picked up by both methods, but looking at the first 15 a.a. of the query (60 a.a. of worm) shows how forcing an alignment in this region is not very helpful. 440 450 human REQLEHI--------KTHELHL . .:: . : ... worm QWKLEDLFNLDSSEYKEASINF 500

How BLAST Works Make a lookup table of all “words” in the query Scan the database for matching words Initiate extensions from these matches

Words Query: GTQITVEDLFYNIATRRKALKN GTQ TQI QIT ITV TVE VED Word Size = 3 Word size is adjustable 2 or 3 for protein ( 3 default) > 7 for blastn ( 11 default ) GTQ TQI QIT ITV TVE VED EDL DLF LFY … Make a lookup table of words Neighborhood Words LTV, MTV, ISV, LSV, etc.

Scan Database…Initiate Extensions Protein BLAST requires two hits GTQITVEDLFYNI <------ TVE FFN ------> two neighborhood words (threshold score) Nucleotide BLAST requires exact matches ATCGCCATGCTTAATTGGGCTT <------ CATGCTTAATT ------> exact word match

An Alignment That BLAST Can’t Find… 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC Hypera postica cysteine proteinase mRNA vs Boophilus microplus cathepsin L-like proteinase precursor Reason: no contiguous exact match of 7 bp.

…but the corresponding amino acid sequences are conserved much better

Protein alignment looks good

…and they have the same domains, too

Local Alignment Statistics High scores of local alignments between two random sequences follow the Extreme Value Distribution Expect Value E = number of database hits you expect to find by chance size of database (applies to ungapped alignments) E = Kmne-S E = mn2-S’ K = scale for search space  = scale for scoring system S’ = bitscore = (S - lnK)/ln2 your score Alignments expected number of random hits Score

Scoring Systems - Nucleotides Identity matrix A G C T A +1 –3 –3 -3 G –3 +1 –3 -3 C –3 –3 +1 -3 T –3 –3 –3 +1 CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA

Scoring Systems - Proteins Position Independent Matrices PAM Matrices (Percent Accepted Mutation) Derived from observation; small dataset of alignments Implicit model of evolution All calculated from PAM1 PAM250 widely used BLOSUM Matrices (BLOck SUbstitution Matrices) Derived from observation; large dataset of highly conserved blocks Each matrix derived separately from blocks with a defined percent identity cutoff BLOSUM62 - default matrix for BLAST Position Specific Score Matrices (PSSMs) PSI- and RPS-BLAST

BLOSUM62 Common amino acids have low weights R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X Positive for more likely substitutions Common amino acids have low weights Rare amino acids have high weights Negative for less likely substitutions

Options for Advanced Blast: Protein Example Entrez queries proteins all[Filter] NOT mammalia[Organism] green plants[Organism] srcdb refseq[Properties] Other advanced -W 2 word size –e 10000 expect value -v 2000 descriptions -b 2000 alignments Limit by taxon Mus musculus[Organism] Mammalia[Organism] Viridiplantae[Organism] Matrix Selection PAM30 -- most stringent BLOSUM45 -- least stringent

Options for Advanced Blasting: Nucleotide Example Entrez Queries nucleotide all[Filter] NOT mammalia[Organism] green plants[Organism] biomol mrna[Properties] biomol genomic[Properties] OtherAdvanced -W 7 word size –e 10000 expect value -v 2000 descriptions -b 2000 alignments

Homology Searches Find a homolog of human CSK in C. elegans Query = c-src tyrosine kinase (CSK) NP_004374 (450 aa) [Homo sapiens] Database = NCBI protein nr Entrez limit: Caenorhabditis elegans [ORGN] Program = BLASTP Query= >gi|4758078|ref|NP_004374.1| c-src tyrosine kinase [Homo sapiens] MSAIQAAWPSGTECIAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCDGKVEHYRIMYHASKLSIDEEVYFENLMQLVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGNKVAVKCIKNDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGDCLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLTKEASSTQDTGKLPVKWTAPEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWHLDAAMRPSFLQLREQLEHIKTHELHL Hits to the Conserved Domain Database:

BLAST Graphical Overview SH3 SH2 tyr kinase domain

BLAST Alignments gi|7160701|emb|CAB04427.2| C. elegans KIN-22 protein (corresponding sequence F49B2.5) [Caenorhabditis elegans] gi|17508235|ref|NP_493502.1| Tyrosine kinase with SH2, SH3 and N myristoylation domains, Drosophila suppressor of pole hole homolog (57.5 kD) (kin-22) [Caenorhabditis elegans] Length = 507 Score = 290 bits (742), Expect = 1e-78 Identities = 170/440 (38%), Positives = 245/440 (55%), Gaps = 21/440 (4%) Pick one hit . . .

3D Domains SH2 SH3 TyrKc In this example, 3D Domains and Conserved Domains are similar, with Tyr kinase catalytic domain (CD) composed of 2 3D domains. Yellow is the catalytic loop.

Low Complexity Filtering Filtered Unfiltered sp|P27476|NSR1_YEAST NUCLEAR LOCALIZATION SEQUENCE BINDING PROTEIN (P67) Length = 414 Score = 40.2 bits (92), Expect = 0.013 Identities = 35/131 (26%), Positives = 56/131 (42%), Gaps = 4/131 (3%) Query: 362 STTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPLSKPLS---SQPQAIVTEDKTD 418 S++S SSS+S SS + + ++S + + S S S+ + E K Sbjct: 29 SSSSSESSSSSSSSSESESESESESESSSSSSSSDSESSSSSSSDSESEAETKKEESKDS 88

Intermission?