Download presentation
Presentation is loading. Please wait.
1
NCBI Molecular Biology Resources
A Field Guide part 2 September 30, ICGEB
2
Database Searching with Entrez
Using limits and field restriction to find human MutL homolog Linking and neighboring with MutL Mapping SNPs onto structure and the genome
3
Global Entrez Search
4
Document Summaries: MutL[All Fields]
5
Entrez Nucleotides: Limits & Preview/Index
Tabs
6
Entrez Nucleotides: Limits
Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Uid Volume Field Restriction MutL Exclude bulk sequences
7
Entrez Nucleotides: Limits
MutL Title == Definition Exclude Bulk Sequences
8
Document Summaries: Limits
9
Adding Terms: Preview/Index
Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Uid Volume
10
Human MutL Search Results
11
Human MutL RefSeq GenBank Records
12
NM_000249: Links
13
Literature Links PubMed OMIM
14
NM_000249: PubMed Books
15
Books Link
16
OMIM: Human Disease Genes
Conserved Domain
17
Sequence Links Nucleotide Protein
18
NM_000249: Related Sequences
similarity Original GenBank mRNAs Original GenBank genomic Genome Project BAC
19
The Tax Browser NCBI’s Taxonomy
Taxonomy Link The Tax Browser NCBI’s Taxonomy
20
Taxonomy Link
21
The Tax Browser Nucleotide Protein Structures Popset
22
Marsupial PopSets
23
Mammalian Phylogenetic Study
24
Batch Downloads
25
Batch Downloads: FASTA and GI list
26
Batch Entrez / Entrez-utilities
27
Links Between and Within Nodes
Word weight Computational PubMed abstracts Taxonomy 3-D Structure 3 -D Structures VAST Genomes Phylogeny Computational Nucleotide sequences Protein sequences BLAST BLAST Computational Computational
28
Text Pubmed Sequence BLAST Structure VAST
29
BLAST® Basic Local Alignment Search Tool
Why align sequences ? - because it is the best way to infer structure-function relationships for the unknown biomolecules Global vs local alignments BLAST basics MegaBLAST Discontiguous MegaBLAST
30
Global vs Local Alignment
Seq 1 Seq 2 Global alignment Seq 1 Seq 2 Local alignment
31
Global vs Local Alignment
Seq1: WHEREISWALTERNOW (16aa) Seq2: HEWASHEREBUTNOWISHERE (21aa) Global Seq1: W--HEREISWALTERNOW 16 W HERE Seq2: 1 HEWASHEREBUTNOWISHERE Local Seq1: 1 W--HERE Seq1: 1 W--HERE 5 W HERE W HERE Seq2: 3 WASHERE Seq2: 15 WISHERE 21
32
Basic Local Alignment Search Tool
Calculates similarity for biological sequences Finds best local alignments Heuristic approach based on Smith-Waterman algorithm Searches for matching “words” and then extends the hits Uses statistical theory to determine if a match might have occurred by chance
33
Align program (Lipman and Pearson)
Global Alignment Align program (Lipman and Pearson) Human: 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84 +A DL F K D+L I+ T+ W GR G IP+NYV PW+ Worm: 63 VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNYVAREKSIES------QPWYF 125 Human: 85 GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-MYHASKLSIDEEVYFENLMQ 151 GK+ R AE+ L E G FLVR+S + D +L V + V+HYRI + H I F L Worm: 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQHDLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD 194 Human: 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220 L+ HY +ADGLC L P Y W L++ IG G+FG+V G + N VA Worm: 195 LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQIGAGQFGEVWEGRWNVNVPVA 264 Human: 221 VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGD 289 VK +K A FLAEA +M +LRH L+ L V IVTE M + +L+ +L+ RGR Worm: 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKLLSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332 Human: 290 CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT----KEASSTQDTG-KLPVKWTA 353 L++ S V M YLE NF+HRDLAARN+L K++DFGL KE TG + P+KWTA Worm: LVEISAQVAAGMAYLEEMNFIHRDLAARNILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA 401 Human: 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423 PEA +F+TKSDVWSFGILL EI +FGR+PYP + +V+ +V+ GY+M P GCP +Y++M+ CW Worm: 402 PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYRMPCPAGCPVTLYDIMQQCWR 471 Human: 424 LDAAMRPSFLQLREQLEHI 443 D RP+F L+ +LE + Worm: 472 SDPDKRPTFETLQWKLEDL 492 human M SAIQ AAWPSGT ECIAKYNFHG M S AA SG A ... . worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPSIGPIAASSSGNTIDKNQNISQSANFVALFQYDA Global alignments force a full-length comparison. In this example, the important domains are picked up by both methods, but looking at the first 15 a.a. of the query (60 a.a. of worm) shows how forcing an alignment in this region is not very helpful. human REQLEHI KTHELHL . .:: : ... worm QWKLEDLFNLDSSEYKEASINF 500
34
How BLAST Works Make a lookup table of all “words” in the query
Scan the database for matching words Initiate extensions from these matches
35
Words Query: GTQITVEDLFYNIATRRKALKN GTQ TQI QIT ITV TVE VED
Word Size = 3 Word size is adjustable 2 or 3 for protein ( 3 default) > 7 for blastn ( 11 default ) GTQ TQI QIT ITV TVE VED EDL DLF LFY … Make a lookup table of words Neighborhood Words LTV, MTV, ISV, LSV, etc.
36
Scan Database…Initiate Extensions
Protein BLAST requires two hits GTQITVEDLFYNI < TVE FFN > two neighborhood words (threshold score) Nucleotide BLAST requires exact matches ATCGCCATGCTTAATTGGGCTT < CATGCTTAATT > exact word match
37
An Alignment That BLAST Can’t Find…
1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC Hypera postica cysteine proteinase mRNA vs Boophilus microplus cathepsin L-like proteinase precursor Reason: no contiguous exact match of 7 bp.
38
…but the corresponding amino acid sequences are conserved much better
39
Protein alignment looks good
40
…and they have the same domains, too
41
Local Alignment Statistics
High scores of local alignments between two random sequences follow the Extreme Value Distribution Expect Value E = number of database hits you expect to find by chance size of database (applies to ungapped alignments) E = Kmne-S E = mn2-S’ K = scale for search space = scale for scoring system S’ = bitscore = (S - lnK)/ln2 your score Alignments expected number of random hits Score
42
Scoring Systems - Nucleotides
Identity matrix A G C T A +1 –3 –3 -3 G –3 +1 –3 -3 C –3 – T –3 –3 –3 +1 CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA
43
Scoring Systems - Proteins
Position Independent Matrices PAM Matrices (Percent Accepted Mutation) Derived from observation; small dataset of alignments Implicit model of evolution All calculated from PAM1 PAM250 widely used BLOSUM Matrices (BLOck SUbstitution Matrices) Derived from observation; large dataset of highly conserved blocks Each matrix derived separately from blocks with a defined percent identity cutoff BLOSUM62 - default matrix for BLAST Position Specific Score Matrices (PSSMs) PSI- and RPS-BLAST
44
BLOSUM62 Common amino acids have low weights
R -1 5 N D C Q E G H I L K M F P S T W Y V X A R N D C Q E G H I L K M F P S T W Y V X Positive for more likely substitutions Common amino acids have low weights Rare amino acids have high weights Negative for less likely substitutions
45
Options for Advanced Blast: Protein
Example Entrez queries proteins all[Filter] NOT mammalia[Organism] green plants[Organism] srcdb refseq[Properties] Other advanced -W 2 word size –e expect value -v descriptions -b alignments Limit by taxon Mus musculus[Organism] Mammalia[Organism] Viridiplantae[Organism] Matrix Selection PAM30 -- most stringent BLOSUM45 -- least stringent
46
Options for Advanced Blasting: Nucleotide
Example Entrez Queries nucleotide all[Filter] NOT mammalia[Organism] green plants[Organism] biomol mrna[Properties] biomol genomic[Properties] OtherAdvanced -W 7 word size –e expect value -v descriptions -b alignments
47
Homology Searches Find a homolog of human CSK in C. elegans
Query = c-src tyrosine kinase (CSK) NP_ (450 aa) [Homo sapiens] Database = NCBI protein nr Entrez limit: Caenorhabditis elegans [ORGN] Program = BLASTP Query= >gi| |ref|NP_ | c-src tyrosine kinase [Homo sapiens] MSAIQAAWPSGTECIAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCDGKVEHYRIMYHASKLSIDEEVYFENLMQLVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGNKVAVKCIKNDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGDCLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLTKEASSTQDTGKLPVKWTAPEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWHLDAAMRPSFLQLREQLEHIKTHELHL Hits to the Conserved Domain Database:
48
BLAST Graphical Overview
SH3 SH2 tyr kinase domain
49
BLAST Alignments gi| |emb|CAB | C. elegans KIN-22 protein (corresponding sequence F49B2.5) [Caenorhabditis elegans] gi| |ref|NP_ | Tyrosine kinase with SH2, SH3 and N myristoylation domains, Drosophila suppressor of pole hole homolog (57.5 kD) (kin-22) [Caenorhabditis elegans] Length = 507 Score = 290 bits (742), Expect = 1e-78 Identities = 170/440 (38%), Positives = 245/440 (55%), Gaps = 21/440 (4%) Pick one hit . . .
50
3D Domains SH2 SH3 TyrKc In this example, 3D Domains and Conserved Domains are similar, with Tyr kinase catalytic domain (CD) composed of 2 3D domains. Yellow is the catalytic loop.
51
Low Complexity Filtering
Filtered Unfiltered sp|P27476|NSR1_YEAST NUCLEAR LOCALIZATION SEQUENCE BINDING PROTEIN (P67) Length = 414 Score = 40.2 bits (92), Expect = 0.013 Identities = 35/131 (26%), Positives = 56/131 (42%), Gaps = 4/131 (3%) Query: 362 STTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPLSKPLS---SQPQAIVTEDKTD 418 S++S SSS+S SS S S S S E K Sbjct: 29 SSSSSESSSSSSSSSESESESESESESSSSSSSSDSESSSSSSSDSESEAETKKEESKDS 88
52
Intermission?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.