NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 August 2-3, 2005
NCBI FieldGuide Web Access BLAST VAST Entrez Text Sequence Structure
NCBI FieldGuide Why do we need similarity searching? To identify and annotate sequences with… incomplete (or no) annotations (GenBank) incorrect annotations To assemble genomes To explore evolutionary relationships by… finding homologous molecules developing phylogenetic trees NOTE: Similar sequences may NOT have similar function! Searching with Sequences
NCBI FieldGuide Basic Local Alignment Search Tool Widely used similarity search tool Heuristic approach based on Smith Waterman algorithm Finds best local alignments Provides statistical significance All combinations (DNA/Protein) query and database. –DNA vs DNA –DNA translation vs Protein –Protein vs Protein –Protein vs DNA translation –DNA translation vs DNA translation www, standalone, and network clients
NCBI FieldGuide Global vs Local Alignment Seq 1 Seq 2 Seq 1 Seq 2 Global alignment Local alignment
NCBI FieldGuide Global vs. Local Alignment Human: 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84 +A DL F K D+L I+ T+ W+ GR G IP+NYV PW+ Worm: 63 VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNYVAREKSIES------QPWYF 125 Human: 85 GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-MYHASKLSIDEEVYFENLMQ 151 GK+ R AE+ L E G FLVR+S + D +L V + V+HYRI + H I F L Worm: 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQHDLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD 194 Human: 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220 L+ HY +ADGLC L P Y W L++ IG G+FG+V G + N VA Worm: 195 LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQIGAGQFGEVWEGRWNVNVPVA 264 Human: 221 VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGD 289 VK +K A FLAEA +M +LRH L+ L V ++ + IVTE M + +L+ +L+ RGR Worm: 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKLLSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332 Human: 290 CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT----KEASSTQDTG-KLPVKWTA 353 L++ S V M YLE NF+HRDLAARN+L++ K++DFGL KE TG + P+KWTA Worm: 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARNILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA 401 Human: 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423 PEA +F+TKSDVWSFGILL EI +FGR+PYP + +V+ +V+ GY+M P GCP +Y++M+ CW Worm: 402 PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYRMPCPAGCPVTLYDIMQQCWR 471 Human: 424 LDAAMRPSFLQLREQLEHI 443 D RP+F L+ +LE + Worm: 472 SDPDKRPTFETLQWKLEDL 492 human M SAIQ AAWPSGT ECIAKYNFHG M S.. AA SG...A.... worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPSIGPIAASSSGNTIDKNQNISQSANFVALFQYDA human REQLEHI KTHELHL..::. :... worm QWKLEDLFNLDSSEYKEASINF 500 Align program (Lipman and Pearson) BLASTp
NCBI FieldGuide Nucleotide Words GTACTGGACATGGACCCTACAGGAA Query : Word Size = 11 GTACTGGACAT TACTGGACATG ACTGGACATGG CTGGACATGGA TGGACATGGAC GGACATGGACC GACATGGACCC ACATGGACCCT Make a lookup table of words Minimum word size = 7 blastn default = 11 megablast default = 28
NCBI FieldGuide Protein Words GTQITVEDLFYNIATRRKALKN Query : Word Size = 3 Neighborhood Words LTV, MTV, ISV, LSV, etc. GTQ TQI QIT ITV TVE VED EDL DLF... Make a lookup table of words Word Size can be 2 or 3 (default = 3)
NCBI FieldGuide Initial Matches and Extensions Protein BLAST requires two neighboring matches within 40 aa GTQITVEDLFYNI ATCGCCATGCTTAATTGGGCTT neighborhood words exact word match one match two matches Nucleotide BLAST requires one exact match
NCBI FieldGuide An alignment that BLAST can’t find 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC
NCBI FieldGuide An Alignment BLAST Can Make Solution: compare protein sequences; BLASTX Score = 290 bits (741), Expect = 7e-77 Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%) Frame = +3 BLAST 2 Sequences (blastx) output:
NCBI FieldGuide Scoring Systems - Nucleotides A G C T A +1 –3 –3 -3 G –3 +1 –3 -3 C –3 – T –3 –3 –3 +1 Identity matrix CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA
NCBI FieldGuide Scoring Systems - Proteins Position Independent Matrices PAM Matrices (Percent Accepted Mutation) Derived from observation; small dataset of alignments Implicit model of evolution All calculated from PAM1 PAM250 widely used BLOSUM Matrices (BLOck SUbstitution Matrices) Derived from observation; large dataset of highly conserved blocks Each matrix derived separately from blocks with a defined percent identity cutoff BLOSUM62 - default matrix for BLAST Position Specific Score Matrices (PSSMs) PSI- and RPS-BLAST
NCBI FieldGuide A 4 R -1 5 N D C Q E G H I L K M F P S T W Y V X A R N D C Q E G H I L K M F P S T W Y V X BLOSUM62 Common amino acids have low weights Rare amino acids have high weights Negative for less likely substitutions Positive for more likely substitutions
NCBI FieldGuide Gapped Alignments Gapping provides more biologically realistic alignments Statistical behavior is not completely understood for gapped alignments Gapped BLAST parameters must be found by simulations for each matrix Affine gap costs = -(a+bk) a = gap open penalty b = gap extend penalty A gap of length 1 receives the score -(a+b)
NCBI FieldGuide Scores V D S – C Y V E T L C F BLOSUM PAM Simply add the scores for each pair of aligned residues Different matrices produce different scores!
NCBI FieldGuide Local Alignment Statistics High scores of local alignments between two random sequences follow the Extreme Value Distribution Score Alignments (applies to ungapped alignments) E = Kmne - S E = mn2 -S’ K = scale for search space = scale for scoring system S’ = bitscore = ( S - lnK)/ln2 Expect Value E = number of database hits you expect to find by chance size of database your score expected number of random hits
NCBI FieldGuide Advanced BLAST Options: Nucleotide Example Entrez Queries nucleotide all[Filter] NOT mammalia[Organism] green plants[Organism] biomol mrna[Properties] gbdiv est[Properties] AND rat[organism] Other Advanced –e expect value -v 2000 descriptions -b 2000 alignments
NCBI FieldGuide Advanced BLAST Options: Protein Matrix Selection PAM30 -- most stringent BLOSUM45 -- least stringent Example Entrez Queries proteins all[Filter] NOT mammalia[Organism] green plants[Organism] srcdb refseq[Properties] Other Advanced –e expect value -v 2000 descriptions -b 2000 alignments Limit by taxon Mus musculus[Organism] Mammalia[Organism] Viridiplantae[Organism]
NCBI FieldGuide sp|P27476|NSR1_YEAST NUCLEAR LOCALIZATION SEQUENCE BINDING PROTEIN (P67) Length = 414 Score = 40.2 bits (92), Expect = Identities = 35/131 (26%), Positives = 56/131 (42%), Gaps = 4/131 (3%) Query: 362 STTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPLSKPLS---SQPQAIVTEDKTD 418 S++S SSS+S SS S + + S S S+ + E K Sbjct: 29 SSSSSESSSSSSSSSESESESESESESSSSSSSSDSESSSSSSSDSESEAETKKEESKDS 88 Filtered Unfiltered Low Complexity Filtering
NCBI FieldGuide >gi| |sp|Q96RF0|SNXI_HUMAN Sorting nexin 18 Length = 628 Score = 1048 bits (2710), Expect = 0.0 Identities = 528/628 (84%), Positives = 528/628 (84%) Query: 1 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIR 60 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIR Sbjct: 1 MALRARALYDFRSENPGEISLREHEVLSLCSEQDIEGWLEGVNSRGDRGLFPASYVQVIR 60 Query: 61 XXXXXXXXXXXXXXXXXXXNVPPGGFEXXXXXXXXXXXXXXXXXXXXXXXXXXXXXSTFQ 120 NVPPGGFE STFQ Sbjct: 61 APEPGPAGDGGPGAPARYANVPPGGFEPLPVAPPASFKPPPDAFQALLQPQQAPPPSTFQ Low Complexity Filter low complexity sequence
NCBI FieldGuide Neighbors: Precomputed BLAST Nucleotide Protein Entrez Related Sequences produces a list of sequences sorted by BLAST score, but with no alignment details.
NCBI FieldGuide Blink – Protein BLAST Alignments Lists only 200 hits List is nonredundant
NCBI FieldGuide Blink – Best Hits
NCBI FieldGuide Megablast: NCBI’s Genome Annotator Long alignments of similar DNA sequences Greedy algorithm Concatenation of query sequences Faster than blastn; less sensitive
NCBI FieldGuide MegaBLAST AI AI AI BE C:\seq\hs.4.fsa > gnl|UG|Hs#S qd43b11.x1 Homo sapiens cDNA, 3' end CATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGTTTG GTGAGAAGTGCTCGATTAGTTCAGACAACATCTGGCACTTGATGTCTGTCCTTCCCTCCT TTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAAGGT GACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACACCG TCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAAAAC CACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGC > gnl|UG|Hs#S qv37f11.x1 Homo sapiens cDNA, 3' end GAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGT GCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATA CATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAA GTCGTATCGATGT > gnl|UG|Hs#S qv33c06.x1 Homo sapiens cDNA, 3' end GAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGT GCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATA CATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAA GTCGTATCGATGT > gnl|UG|Hs#S e65f04.x1 Homo sapiens cDNA, 3' end TTTCATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGT TTGGTGAGAAGTGCTCGATTAGTTCAAACAACATCTGGCACTTGATGTCTGTCCTTCCCT CCTTTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAA GGTGACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACA CCGTCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAA AACCACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGCCTCCCAACCGCATTC CTGCCTGTGTAGCAGGCGGTGAGCACCCAGAAGGGGCACATACCTCTCCAAGCCTTGAAA GCAAAGCATGGAGATCTACAAAAATAGGATTTCCACTTGGAGAAATGTCGCTGGGACAGT
NCBI FieldGuide Discontiguous Megablast Uses discontiguous word matches Better for cross-species comparisons
NCBI FieldGuide Templates for Discontiguous MegaBLAST W = 11, t = 16, coding: W = 11, t = 16, non-coding: W = 12, t = 16, coding: W = 12, t = 16, non-coding: W = 11, t = 18, coding: W = 11, t = 18, non-coding: W = 12, t = 18, coding: W = 12, t = 18, non-coding: W = 11, t = 21, coding: W = 11, t = 21, non-coding: W = 12, t = 21, coding: W = 12, t = 21, non-coding: Ma, B., Tromp, J., Li, M., "PatternHunter: faster and more sensitive homology search", Bioinformatics 2002 Mar;18(3):440-5
NCBI FieldGuide Nucleotide vs. Protein BLAST aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaaggc Human: N R V T V V L G A Q W G D E G + + V + V L G Q W G D E G A.th.: S Q V S G V L G C Q W G D E G agtcaagtatctggtgtactcggttgccaatggggagatgaaggt Comparing ADSS from H. sapiens and A. thaliana BLASTp finds three matching words BLASTn finds no match, because there are no 7 bp words Protein searches are generally more sensitive than nucleotide searches.
NCBI FieldGuide Translated BLAST QueryDatabaseProgram NP ucleotide rotein N N N N P P blastx tblastn tblastx PPP PPP PPP PPP PPP PPP PPP PPP Particularly useful for nucleotide sequences without protein annotations, such as ESTs or genomic DNA
NCBI FieldGuide Genomic BLAST These pages provide customized nucleotide and protein databases for each genome If a Map Viewer is available, the BLAST hits can be viewed on the maps
NCBI FieldGuide BLAST the Chicken Genome Program Accession for human TPO mRNA
NCBI FieldGuide BLAST Hit on the Genome
NCBI FieldGuide BLASTn Hit on the Map Viewer
NCBI FieldGuide TBLASTN Results Using NP_000538
NCBI FieldGuide Linking Protein Sequence, Structure, and Function sequence function (pfam, smart) Structure PSI-BLAST RPS-BLAST VAST BLASTp sequence structure structure structure sequence structure + function (cd)
NCBI FieldGuide Position Specific Substitution Rates Active site serineWeakly conserved serine
NCBI FieldGuide Position Specific Score Matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V 206 D G V I S S C N G D S G G P L N C Q A Serine is scored differently in these two positions Active site nucleophile
NCBI FieldGuide PSI-BLAST Create your own PSSM: Confirming relationships of purine nucleotide metabolism proteins query BLOSUM62 PSSM Alignment
NCBI FieldGuide PSI BLAST >gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE AMINOH MAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGFLAKFDYY VIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVDLVNQGLQ EQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAYEGAVKNG RTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGAWDPKTTH VRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKKELLERLY e value cutoff for PSSM
NCBI FieldGuide PSI Results: Initial BLAST Run
NCBI FieldGuide First PSSM Search Other purine nucleotide metabolizing enzymes not found by ordinary BLAST
NCBI FieldGuide Third PSSM Search: Convergence Just below threshold, another nucleotide metabolism enzyme
NCBI FieldGuide Pfam-A seeds: HMM based models representing a wide variety of functional domains derived from SWISS-PROT COG SMART CD Entrez Domains (CDD) HMM based models originally concentrating on eukaryotic signaling domains, now expanding BLAST based alignments derived from complete proteomes of prokaryotes NCBI curated domains based on sequence and structural alignments Pfam pfam01234 smart00123 cd01234 COG0123 NCBI Sanger EMBL Single Domains Protein Families
NCBI FieldGuide Protein Links: Domains
NCBI FieldGuide Results of a CD-Search CD SMART Pfam Click on a colored bar to align your sequence to the CD
NCBI FieldGuide CDD Record – heme peroxidases aligned query red = high conservation blue = low conservation
NCBI FieldGuide Curated CD Record Curated CDs (cd12345) are based on sequence and structure alignments Annotated features Structural evidence aligned query
NCBI FieldGuide Blink: Sequence to Structure related structures
NCBI FieldGuide Related Structures Cn3D
NCBI FieldGuide Entrez Structure Derived from experimentally determined PDB records Add value to PDB records by: –Adding explicit chemical bonding information –Validating and indexing the sequences –Annotating 3D domains and secondary structure –Adding links to CDD, Taxonomy, Pubmed –Converting PDB data to ASN.1 Structure neighbors determined by Vector Alignment Search Tool (VAST) MM MMDB: Molecular Modeling Data Base Structure
NCBI FieldGuide Structure Summary Page Conserved Domains VAST Neighbors for chain C (domain 0) Cn3D VAST Neighbors for domain 2
NCBI FieldGuide VAST: Structure Neighbors Vector Alignment Search Tool For each 3D domain, locate SSEs (secondary structure elements), and represent them as individual vectors Human IL-4 VAST uses 3D Domains only! Whole polypeptides are assigned 3D domain 0 (zero).
NCBI FieldGuide VAST Neighbors 1D2V 1Q4G 3D domains!
NCBI FieldGuide Viewing a VAST Alignment RMSD in Angstroms Sequence percent identityVAST P value Cn3D
NCBI FieldGuide Submitting a PDB File to VAST Redesigned interface! This is the best way to convert PDB into MMDB format! New!
NCBI FieldGuide Entrez PubChem PC Substance PC Compound PC BioAssay Primary database of chemical samples Derived database of known chemicals from PC Substance records Primary database of bioactivity screens of samples in PC Substance New!
NCBI FieldGuide Links from Structure N-acetylglucosamine heme mannose fucose
NCBI FieldGuide Search for thyroxine ChemID 24 KEGG 4 DTP-NCI 3 NIST 3 Biocyc 2 BIND 2 Chembank 2 NIAID 1 TOTAL 41
NCBI FieldGuide Sequence Polymorphisms SNPOMIM Primary database of submitted SNPs Curated database of reference SNPs Contains more than just SNPs: True SNPs MNP (multiple nucleotide) Insertions Deletions Microsatellites Mixed No variation (constant) Clinical literature database Curated at Johns Hopkins Univ Links human genes and genetic disorders to human disease Lists allelic variants that have clinical consequences Variations in SNP are not necessarily in OMIM, and vice versa! General PolymorphismsHuman Phenotypes
NCBI FieldGuide Linking to SNP Links to SNP are also available from Nucleotide and Protein Entrez Gene - TPO
NCBI FieldGuide Entrez SNP primary data: ss# SNP UID: rs#
NCBI FieldGuide Find Non-synonymous SNPs #7 AND coding nonsynon[Function Class] Function Class
NCBI FieldGuide Non-synonymous TPO SNPs Link to Map Viewer View all SNPs in locus Link to related 3D structures
NCBI FieldGuide GeneView in dbSNP
NCBI FieldGuide Links to OMIM Links to SNP are also available from Nucleotide and Protein Entrez Gene - TPO
NCBI FieldGuide OMIM Record
NCBI FieldGuide Explore a Disease SNP 799
NCBI FieldGuide Curated CD Record E799 Cn3D
NCBI FieldGuide For More Information…
NCBI FieldGuide For More Information… General addresses The (free!) NCBI Newsletter The NCBI Handbook The NCBI Education Page Follow the link from the NCBI Home Page