NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB
Database Searching with Entrez Using limits and field restriction to find human MutL homolog Linking and neighboring with MutL Mapping SNPs onto structure and the genome
Global Entrez Search
Document Summaries: MutL[All Fields]
Entrez Nucleotides: Limits & Preview/Index Tabs
Entrez Nucleotides: Limits Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Uid Volume Field Restriction MutL Exclude bulk sequences
Entrez Nucleotides: Limits MutL Title == Definition Exclude Bulk Sequences
Document Summaries: Limits
Adding Terms: Preview/Index Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Uid Volume
Human MutL Search Results
Human MutL RefSeq GenBank Records
NM_000249: Links
Literature Links PubMed OMIM
NM_000249: PubMed Books
Books Link
OMIM: Human Disease Genes Conserved Domain
Sequence Links Nucleotide Protein
NM_000249: Related Sequences similarity Original GenBank mRNAs Original GenBank genomic Genome Project BAC
The Tax Browser NCBI’s Taxonomy Taxonomy Link The Tax Browser NCBI’s Taxonomy
Taxonomy Link
The Tax Browser Nucleotide Protein Structures Popset
Marsupial PopSets
Mammalian Phylogenetic Study
Batch Downloads
Batch Downloads: FASTA and GI list
Batch Entrez / Entrez-utilities
Links Between and Within Nodes Word weight Computational PubMed abstracts Taxonomy 3-D Structure 3 -D Structures VAST Genomes Phylogeny Computational Nucleotide sequences Protein sequences BLAST BLAST Computational Computational
Text Pubmed Sequence BLAST Structure VAST
BLAST® Basic Local Alignment Search Tool Why align sequences ? - because it is the best way to infer structure-function relationships for the unknown biomolecules Global vs local alignments BLAST basics MegaBLAST Discontiguous MegaBLAST
Global vs Local Alignment Seq 1 Seq 2 Global alignment Seq 1 Seq 2 Local alignment
Global vs Local Alignment Seq1: WHEREISWALTERNOW (16aa) Seq2: HEWASHEREBUTNOWISHERE (21aa) Global Seq1: 1 W--HEREISWALTERNOW 16 W HERE Seq2: 1 HEWASHEREBUTNOWISHERE 21 Local Seq1: 1 W--HERE 5 Seq1: 1 W--HERE 5 W HERE W HERE Seq2: 3 WASHERE 9 Seq2: 15 WISHERE 21
Basic Local Alignment Search Tool Calculates similarity for biological sequences Finds best local alignments Heuristic approach based on Smith-Waterman algorithm Searches for matching “words” and then extends the hits Uses statistical theory to determine if a match might have occurred by chance
Align program (Lipman and Pearson) Global Alignment Align program (Lipman and Pearson) Human: 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84 +A + + + DL F K D+L I+ T+ W+ GR G IP+NYV + + +++ PW+ Worm: 63 VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNYVAREKSIES------QPWYF 125 Human: 85 GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-MYHASKLSIDEEVYFENLMQ 151 GK+ R AE+ L E G FLVR+S + D +L V + V+HYRI + H I F L Worm: 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQHDLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD 194 Human: 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220 L+ HY +ADGLC L P Y W ++ + ++L++ IG G+FG+V G + N VA Worm: 195 LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQIGAGQFGEVWEGRWNVNVPVA 264 Human: 221 VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGD 289 VK +K A FLAEA +M +LRH L+ L V ++ + IVTE M + +L+ +L+ RGR Worm: 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKLLSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332 Human: 290 CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT----KEASSTQDTG-KLPVKWTA 353 L++ S V M YLE NF+HRDLAARN+L++ K++DFGL KE TG + P+KWTA Worm: 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARNILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA 401 Human: 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423 PEA +F+TKSDVWSFGILL EI +FGR+PYP + +V+ +V+ GY+M P GCP +Y++M+ CW Worm: 402 PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYRMPCPAGCPVTLYDIMQQCWR 471 Human: 424 LDAAMRPSFLQLREQLEHI 443 D RP+F L+ +LE + Worm: 472 SDPDKRPTFETLQWKLEDL 492 human M--------------SAIQ----------------------AAWPSGT------------ECIAKYNFHG M S .. AA SG. . .A ... . worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPSIGPIAASSSGNTIDKNQNISQSANFVALFQYDA 1 20 40 60 Global alignments force a full-length comparison. In this example, the important domains are picked up by both methods, but looking at the first 15 a.a. of the query (60 a.a. of worm) shows how forcing an alignment in this region is not very helpful. 440 450 human REQLEHI--------KTHELHL . .:: . : ... worm QWKLEDLFNLDSSEYKEASINF 500
How BLAST Works Make a lookup table of all “words” in the query Scan the database for matching words Initiate extensions from these matches
Words Query: GTQITVEDLFYNIATRRKALKN GTQ TQI QIT ITV TVE VED Word Size = 3 Word size is adjustable 2 or 3 for protein ( 3 default) > 7 for blastn ( 11 default ) GTQ TQI QIT ITV TVE VED EDL DLF LFY … Make a lookup table of words Neighborhood Words LTV, MTV, ISV, LSV, etc.
Scan Database…Initiate Extensions Protein BLAST requires two hits GTQITVEDLFYNI <------ TVE FFN ------> two neighborhood words (threshold score) Nucleotide BLAST requires exact matches ATCGCCATGCTTAATTGGGCTT <------ CATGCTTAATT ------> exact word match
An Alignment That BLAST Can’t Find… 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC Hypera postica cysteine proteinase mRNA vs Boophilus microplus cathepsin L-like proteinase precursor Reason: no contiguous exact match of 7 bp.
…but the corresponding amino acid sequences are conserved much better
Protein alignment looks good
…and they have the same domains, too
Local Alignment Statistics High scores of local alignments between two random sequences follow the Extreme Value Distribution Expect Value E = number of database hits you expect to find by chance size of database (applies to ungapped alignments) E = Kmne-S E = mn2-S’ K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2 your score Alignments expected number of random hits Score
Scoring Systems - Nucleotides Identity matrix A G C T A +1 –3 –3 -3 G –3 +1 –3 -3 C –3 –3 +1 -3 T –3 –3 –3 +1 CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA
Scoring Systems - Proteins Position Independent Matrices PAM Matrices (Percent Accepted Mutation) Derived from observation; small dataset of alignments Implicit model of evolution All calculated from PAM1 PAM250 widely used BLOSUM Matrices (BLOck SUbstitution Matrices) Derived from observation; large dataset of highly conserved blocks Each matrix derived separately from blocks with a defined percent identity cutoff BLOSUM62 - default matrix for BLAST Position Specific Score Matrices (PSSMs) PSI- and RPS-BLAST
BLOSUM62 Common amino acids have low weights R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X Positive for more likely substitutions Common amino acids have low weights Rare amino acids have high weights Negative for less likely substitutions
Options for Advanced Blast: Protein Example Entrez queries proteins all[Filter] NOT mammalia[Organism] green plants[Organism] srcdb refseq[Properties] Other advanced -W 2 word size –e 10000 expect value -v 2000 descriptions -b 2000 alignments Limit by taxon Mus musculus[Organism] Mammalia[Organism] Viridiplantae[Organism] Matrix Selection PAM30 -- most stringent BLOSUM45 -- least stringent
Options for Advanced Blasting: Nucleotide Example Entrez Queries nucleotide all[Filter] NOT mammalia[Organism] green plants[Organism] biomol mrna[Properties] biomol genomic[Properties] OtherAdvanced -W 7 word size –e 10000 expect value -v 2000 descriptions -b 2000 alignments
Homology Searches Find a homolog of human CSK in C. elegans Query = c-src tyrosine kinase (CSK) NP_004374 (450 aa) [Homo sapiens] Database = NCBI protein nr Entrez limit: Caenorhabditis elegans [ORGN] Program = BLASTP Query= >gi|4758078|ref|NP_004374.1| c-src tyrosine kinase [Homo sapiens] MSAIQAAWPSGTECIAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCDGKVEHYRIMYHASKLSIDEEVYFENLMQLVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGNKVAVKCIKNDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGDCLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLTKEASSTQDTGKLPVKWTAPEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWHLDAAMRPSFLQLREQLEHIKTHELHL Hits to the Conserved Domain Database:
BLAST Graphical Overview SH3 SH2 tyr kinase domain
BLAST Alignments gi|7160701|emb|CAB04427.2| C. elegans KIN-22 protein (corresponding sequence F49B2.5) [Caenorhabditis elegans] gi|17508235|ref|NP_493502.1| Tyrosine kinase with SH2, SH3 and N myristoylation domains, Drosophila suppressor of pole hole homolog (57.5 kD) (kin-22) [Caenorhabditis elegans] Length = 507 Score = 290 bits (742), Expect = 1e-78 Identities = 170/440 (38%), Positives = 245/440 (55%), Gaps = 21/440 (4%) Pick one hit . . .
3D Domains SH2 SH3 TyrKc In this example, 3D Domains and Conserved Domains are similar, with Tyr kinase catalytic domain (CD) composed of 2 3D domains. Yellow is the catalytic loop.
Low Complexity Filtering Filtered Unfiltered sp|P27476|NSR1_YEAST NUCLEAR LOCALIZATION SEQUENCE BINDING PROTEIN (P67) Length = 414 Score = 40.2 bits (92), Expect = 0.013 Identities = 35/131 (26%), Positives = 56/131 (42%), Gaps = 4/131 (3%) Query: 362 STTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPLSKPLS---SQPQAIVTEDKTD 418 S++S SSS+S SS + + ++S + + S S S+ + E K Sbjct: 29 SSSSSESSSSSSSSSESESESESESESSSSSSSSDSESSSSSSSDSESEAETKKEESKDS 88
Intermission?