NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB

NCBI FieldGuide Genomes Taxonomy Links Between and Within Nodes PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure 3 -D Structures Word weight VAST BLAST Phylogeny Computational

NCBI FieldGuide BLAST VAST Pubmed Text Sequence Structure

NCBI FieldGuide Pubmed: Computation of Related Articles The neighbors of a document are those documents in the database that are the most similar to it. The similarity between documents is measured by the words they have in common, with some adjustment for document lengths. The value of a term is dependent on global and local types of information: 1) the number of different documents in the database that contain the term; 2) the number of times the term occurs in a particular document;

NCBI FieldGuide Global and local weights The global weight of a term is greater for the less frequent terms. The presence of a term that occurred in most of the documents would really tell one very little about a document. On the other hand, a term that occurred in only 100 documents of one million would be very helpful in limiting the set of documents of interest. The local weight of a term is the measure of its importance in a particular document. Generally, the more frequent a term is within a document, the more important it is in representing the content of that document. However, this relationship is saturating, i.e., as the frequency continues to go up, the importance of the word increases less rapidly and finally comes to a finite limit.

NCBI FieldGuide How we define similar documents The similarity between two documents is computed by adding up the weights (local wt1 × local wt2 × global wt) of all of the terms the two documents have in common. This provides an indication of how related two documents are. Once the similarity score of a document in relation to each of the other documents in the database has been computed, that document's neighbors are identified as the most similar (highest scoring) documents found. These closely related documents are pre-computed for each document in PubMed.

NCBI FieldGuide Related articles: difficult task

NCBI FieldGuide E-utilities: Top Level of Entrez

NCBI FieldGuide E-utilities course

NCBI FieldGuide E-utilities A set of seven server-side programs. Support a uniform URL syntax. Translate a standard set of URL- encoded input parameters for the array of programs comprising the Entrez system.

NCBI FieldGuide Entrez Functions and E-utilities Searches: esearch.fcgi DocSums: esummary.fcgi Links: elink.fcgi Uploads: epost.fcgi Downloads: efetch.fcgi Global Query: egquery.fcgi Information: einfo.fcgi

NCBI FieldGuide A Docsum via esummary.fcgi and via the Web

NCBI FieldGuide A Simple Eutilities Pipeline

NCBI FieldGuide Search for upstream regions of homologous genes #!/usr/local/bin/perl #where the Perl is located use LWP::Simple; # we use LWP:Simple to get the content of URLs $ebase="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/"; # this is a base URL we will add details to while(<>){ # we are reading file of gene names; file name is read from the command line; chomp;$gene=$_; $term=$gene."[gene+name]+AND+human[orgn]"; # we are interested in human genes only #1. Search in Homologene $url=$ebase."esearch.fcgi?db=homologene&term=$term"; #search Entrez Gene with gene name $result=get($url); #with the help of LWP's "get" command we download the content of the corresponding URL while($result=~/ (\d+) /sg) #parsing out the content, reading gi's from Id lines {$id.="$1,";} #...and concatenating them in one string, with commas as delimiters chop $id; #2. Link Homologene -> Nucleotide $url=$ebase."elink.fcgi?db=nucleotide&id=$id&dbfrom=homologene";#link back to nucleotides to get list of homolog NM gi's $result=get($url); $id=""; while($result=~/ [^ (\d+) /sg){$id.="$1,";} chop $id; #3. Link Nucleotide -> Gene

NCBI FieldGuide Lots of precomputed data and a little bit of parsing $url=$ebase."elink.fcgi?db=gene&id=$id&dbfrom=nucleotide"; #link to Entrez Gene again to get the genomic coordinates $result=get($url);$id=""; while($result=~/ [^ (\d+) /sg){push @ids,$1;} chop $id; print @ids; foreach $id (@ids){ #foreach NM accession gi #4. Fetch XML document with gene information from Gene $url=$ebase."efetch.fcgi?db=gene&id=$id&retmode=xml"; #fetch the gene report that gives the genomic sequence and coordinates $result=get($url); $result=~/.+? (\d+)/s; $id=$1; $result=~/ (\d+)/;$from=$1; $result=~/ (\d+)/;$to=$1; $result=~/<Na-strand value="(\w+)"/;$strand=$1;if($strand eq "minus"){$strand=2;}else{$strand=1;} if($strand==1){ $to=$from;$from-=1000; }else{ $from=$to;$to+=1000; } #5. Fetch upstream sequence from Nucleotide $url=$ebase."efetch.fcgi?db=nucleotide&id=$id&retmode=text&rettype=fasta&seq_start=$from&seq_stop=$to&strand=$strand"; #fetch sequence $result=get($url);$result=~s/>ref/>lcl|$gene|/; print "$result"; }

NCBI FieldGuide A General Design Approach Know what you want before you begin –Do I need the full record? (EFetch) –Will a DocSum be sufficient? (ESummary) Know what Entrez database contains the data you want –If it’s not in Entrez, the eUtils can’t access it Try your pipeline in interactive web Entrez first –Some Entrez queries may surprise you –Some Entrez data may surprise you –Some Entrez links may surprise you

NCBI FieldGuide Others use E-utilities too: PubCrawler

NCBI FieldGuide MedBlast: searching for articles related to a sequence.

NCBI FieldGuide  Fairness issue.  Gate is only so wide.  Scripts use the resources of many to satisfy a few. Why Regulate?

NCBI FieldGuide Scripts are like “fat” bunnies!!!

NCBI FieldGuide Web Servers and Browsers Your browser makes one connection. Each server has an finite number of slots. A slot is allotted to a connection 1 st come 1 st served. Connections are (typically) not persistent. Scripts use more slots, and approach “persistent” connection.

NCBI FieldGuide Normal Use

NCBI FieldGuide Scripting

NCBI FieldGuide Detection Weblogs are monitored by a script. Alarm e-mails are sent hourly and a daily encapsulation once a day. Analysis – copyright versus volume. Not automatic! Blocking occurs. –Copyrighted material can be very light volume. –Blast is “sensitive” can also be light in volume. –Entrez and PubMed mostly a “fairness” issue.

NCBI FieldGuide How you are blocked. The IP address is blacklisted from the main NCBI web servers. You get a very obvious error message. Remember Spock: “The needs of the many outweigh the needs of the few.”

NCBI FieldGuide How to avoid blockage. Plan your project. Can I use other methods? –FTP –Batch Entrez Write good scripts. –Expect errors –Multiple UIDs Follow the E-utils recommendations. Ask us for advice.

NCBI FieldGuide Recommendations. Use ‘eutils.ncbi.nlm.nih.gov’. Use the &tool and &email fields. Do not submit more than once every 3 seconds. Limit to 9 PM – 5 AM EST (our time).

NCBI FieldGuide BLAST VAST Pubmed Text Sequence Structure

NCBI FieldGuide BLAST ® Basic Local Alignment Search Tool Why align sequences ? - because it is the best way to infer structure-function relationships for the unknown biomolecules Global vs local alignments BLAST basics MegaBLAST Discontiguous MegaBLAST

NCBI FieldGuide Basic Local Alignment Search Tool  Calculates similarity for biological sequences  Finds best local alignments  Heuristic approach based on Smith-Waterman algorithm  Searches for matching “words” and then extends the hits  Uses statistical theory to determine if a match might have occurred by chance

NCBI FieldGuide Global Alignment Human: 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84 +A + + + DL F K D+L I+ T+ W+ GR G IP+NYV + + +++ PW+ Worm: 63 VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNYVAREKSIES------QPWYF 125 Human: 85 GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-MYHASKLSIDEEVYFENLMQ 151 GK+ R AE+ L E G FLVR+S + D +L V + V+HYRI + H I F L Worm: 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQHDLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD 194 Human: 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220 L+ HY +ADGLC L P Y W ++ + ++L++ IG G+FG+V G + N VA Worm: 195 LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQIGAGQFGEVWEGRWNVNVPVA 264 Human: 221 VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGD 289 VK +K A FLAEA +M +LRH L+ L V ++ + IVTE M + +L+ +L+ RGR Worm: 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKLLSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332 Human: 290 CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT----KEASSTQDTG-KLPVKWTA 353 L++ S V M YLE NF+HRDLAARN+L++ K++DFGL KE TG + P+KWTA Worm: 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARNILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA 401 Human: 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423 PEA +F+TKSDVWSFGILL EI +FGR+PYP + +V+ +V+ GY+M P GCP +Y++M+ CW Worm: 402 PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYRMPCPAGCPVTLYDIMQQCWR 471 Human: 424 LDAAMRPSFLQLREQLEHI 443 D RP+F L+ +LE + Worm: 472 SDPDKRPTFETLQWKLEDL 492 human M--------------SAIQ----------------------AAWPSGT------------ECIAKYNFHG M S.. AA SG...A.... worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPSIGPIAASSSGNTIDKNQNISQSANFVALFQYDA 1 20 40 60 440 450 human REQLEHI--------KTHELHL..::. :... worm QWKLEDLFNLDSSEYKEASINF 500 Align program (Lipman and Pearson)

NCBI FieldGuide How BLAST Works  Make a lookup table of all “words” in the query  Scan the database for matching words  Initiate extensions from these matches

NCBI FieldGuide Words GTQITVEDLFYNIATRRKALKN Query : Word Size = 3 Word size is adjustable 2 or 3 for protein ( 3 default) > 7 for blastn ( 11 default ) Neighborhood Words LTV, MTV, ISV, LSV, etc. GTQ TQI QIT ITV TVE VED EDL DLF LFY … Make a lookup table of words

NCBI FieldGuide Scan Database…Initiate Extensions Protein BLAST requires two hits GTQITVEDLFYNI two neighborhood words (threshold score) Nucleotide BLAST requires exact matches exact word match ATCGCCATGCTTAATTGGGCTT

NCBI FieldGuide An Alignment That BLAST Can’t Find… 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC

NCBI FieldGuide …but the corresponding amino acid sequences are conserved much better

NCBI FieldGuide Protein alignment looks good

NCBI FieldGuide …and they have the same domains, too

NCBI FieldGuide Local Alignment Statistics High scores of local alignments between two random sequences follow the Extreme Value Distribution Score Alignments (applies to ungapped alignments) E = Kmne - S E = mn2 -S’ K = scale for search space = scale for scoring system S’ = bitscore = ( S - lnK)/ln2 Expect Value E = number of database hits you expect to find by chance size of database your score expected number of random hits

NCBI FieldGuide Scoring Systems - Nucleotides A G C T A +1 –3 –3 -3 G –3 +1 –3 -3 C –3 –3 +1 -3 T –3 –3 –3 +1 Identity matrix CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA

NCBI FieldGuide Scoring Systems - Proteins Position Independent Matrices PAM Matrices (Percent Accepted Mutation) Derived from observation; small dataset of alignments Implicit model of evolution All calculated from PAM1 PAM250 widely used BLOSUM Matrices (BLOck SUbstitution Matrices) Derived from observation; large dataset of highly conserved blocks Each matrix derived separately from blocks with a defined percent identity cutoff BLOSUM62 - default matrix for BLAST Position Specific Score Matrices (PSSMs) PSI- and RPS-BLAST

NCBI FieldGuide A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X BLOSUM62 Common amino acids have low weights Rare amino acids have high weights Negative for less likely substitutions Positive for more likely substitutions

NCBI FieldGuide Options for Advanced Blast: Protein Matrix Selection PAM30 -- most stringent BLOSUM45 -- least stringent Example Entrez queries proteins all[Filter] NOT mammalia[Organism] green plants[Organism] srcdb refseq[Properties] Other advanced -W 2 word size –e 10000 expect value -v 2000 descriptions -b 2000 alignments Limit by taxon Mus musculus[Organism] Mammalia[Organism] Viridiplantae[Organism]

NCBI FieldGuide Options for Advanced Blasting: Nucleotide Example Entrez Queries nucleotide all[Filter] NOT mammalia[Organism] green plants[Organism] biomol mrna[Properties] biomol genomic[Properties] OtherAdvanced -W 7 word size –e 10000 expect value -v 2000 descriptions -b 2000 alignments

NCBI FieldGuide Find a homolog of human CSK in C. elegans Query = c-src tyrosine kinase (CSK) NP_004374 (450 aa) [Homo sapiens] Database = NCBI protein nr Entrez limit: Caenorhabditis elegans [ORGN] Program = BLASTP Homology Searches Hits to the Conserved Domain Database: Query= >gi|4758078|ref|NP_004374.1| c-src tyrosine kinase [Homo sapiens] MSAIQAAWPSGTECIAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPA NYVQKREGVKAGTKLSLMPWFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCDGK VEHYRIMYHASKLSIDEEVYFENLMQLVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGW ALNMKELKLLQTIGKGEFGDVMLGDYRGNKVAVKCIKNDATAQAFLAEASVMTQLRHSNLVQL LGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGDCLLKFSLDVCEAMEYLEGNNFVHRDL AARNVLVSEDNVAKVSDFGLTKEASSTQDTGKLPVKWTAPEALREKKFSTKSDVWSFGILLWE IYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWHLDAAMRPSFLQLREQL EHIKTHELHL

NCBI FieldGuide BLAST Graphical Overview SH3 SH2tyr kinase domain

NCBI FieldGuide BLAST Alignments gi|7160701|emb|CAB04427.2| C. elegans KIN-22 protein (corresponding sequence F49B2.5) [Caenorhabditis elegans] gi|17508235|ref|NP_493502.1| Tyrosine kinase with SH2, SH3 and N myristoylation domains, Drosophila suppressor of pole hole homolog (57.5 kD) (kin-22) [Caenorhabditis elegans] Length = 507 Score = 290 bits (742), Expect = 1e-78 Identities = 170/440 (38%), Positives = 245/440 (55%), Gaps = 21/440 (4%)

NCBI FieldGuide Local (BLAST) Alignment: Domains Catalytic loop

NCBI FieldGuide 3D Domains TyrKc SH3 SH2

NCBI FieldGuide sp|P27476|NSR1_YEAST NUCLEAR LOCALIZATION SEQUENCE BINDING PROTEIN (P67) Length = 414 Score = 40.2 bits (92), Expect = 0.013 Identities = 35/131 (26%), Positives = 56/131 (42%), Gaps = 4/131 (3%) Query: 362 STTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDAFLQPLSKPLS---SQPQAIVTEDKTD 418 S++S SSS+S SS + + ++S + + S S S+ + E K Sbjct: 29 SSSSSESSSSSSSSSESESESESESESSSSSSSSDSESSSSSSSDSESEAETKKEESKDS 88 Filtered Unfiltered Low Complexity Filtering

NCBI FieldGuide PSI-BLAST Position-Specific Iterated BLAST Mining for protein domains Confirming relationships among related proteins

NCBI FieldGuide Position Specific Substitution Rates Active site serine Weakly conserved serine

NCBI FieldGuide Position Specific Score Matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V 206 D 0 -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -6 1 0 -1 -6 -4 -1 207 G -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 -3 -2 -2 -1 0 -6 -5 208 V -1 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 -6 -4 0 -2 -6 -4 -2 209 I -3 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 -5 -3 0 -1 -4 0 210 D -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -7 -5 1 -3 -7 -5 -6 211 S 4 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -5 -1 4 3 -6 -5 -3 212 C -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 0 -7 -4 -4 -5 0 -4 213 N -2 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 -2 -5 -1 -3 -3 -4 -3 214 G -2 -3 -3 -4 -4 -4 -5 7 -4 -7 -7 -5 -4 -4 -6 -3 -5 -6 -6 -6 215 D -5 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -7 -5 -4 -4 -8 -7 -7 216 S -2 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 7 -2 -6 -5 -5 217 G -3 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -7 -6 -4 -5 -6 -7 -7 218 G -3 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 -7 -6 -2 -4 -6 -7 -7 219 P -2 -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -7 9 -4 -4 -7 -7 -6 220 L -4 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 0 -6 -6 -5 -5 -4 0 221 N -1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -3 -6 -2 -1 -6 -1 6 222 C 0 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 -1 -4 -1 0 -5 0 0 223 Q 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 0 0 -1 -1 -3 -3 -4 224 A -1 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 -2 -3 0 -2 -2 -2 -3 Active site nucleophile Serine scored differently in these two positions

NCBI FieldGuide >gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE MAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGF VIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVD EQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAY RTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGA VRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKK PSI-BLAST e value cutoff for PSSM

NCBI FieldGuide RESULTS: Initial BLASTP Same results as protein-protein BLAST

NCBI FieldGuide Results of First PSSM Search Other purine nucleotide metabolizing enzymes not found by ordinary BLAST

NCBI FieldGuide Third PSSM Search: Convergence Just below threshold, another nucleotide metabolism enzyme Check to add to PSSM

NCBI FieldGuide MegaBLAST AI217550 AI251192 AI254381 BE645079 C:\seq\hs.4.fsa > 1133045 gnl|UG|Hs#S1133045 qd43b11.x1 Homo sapiens cDNA, 3' end CATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGTTTG GTGAGAAGTGCTCGATTAGTTCAGACAACATCTGGCACTTGATGTCTGTCCTTCCCTCCT TTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAAGGT GACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACACCG TCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAAAAC CACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGC > 1141828 gnl|UG|Hs#S1141828 qv37f11.x1 Homo sapiens cDNA, 3' end GAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGT GCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATA CATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAA GTCGTATCGATGT > 1145899 gnl|UG|Hs#S1145899 qv33c06.x1 Homo sapiens cDNA, 3' end GAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGT GCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATA CATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAA GTCGTATCGATGT > 2291670 gnl|UG|Hs#S2291670 7e65f04.x1 Homo sapiens cDNA, 3' end TTTCATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGT TTGGTGAGAAGTGCTCGATTAGTTCAAACAACATCTGGCACTTGATGTCTGTCCTTCCCT CCTTTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAA GGTGACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACA CCGTCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAA AACCACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGCCTCCCAACCGCATTC CTGCCTGTGTAGCAGGCGGTGAGCACCCAGAAGGGGCACATACCTCTCCAAGCCTTGAAA GCAAAGCATGGAGATCTACAAAAATAGGATTTCCACTTGGAGAAATGTCGCTGGGACAGT

NCBI FieldGuide What is Discontiguous (Cross-species) MegaBLAST? W = 11, t = 16, coding: 1101101101101101 W = 11, t = 16, non-coding: 1110010110110111 W = 12, t = 16, coding: 1111101101101101 W = 12, t = 16, non-coding: 1110110110110111 W = 11, t = 18, coding: 101101100101101101 W = 11, t = 18, non-coding: 111010010110010111 W = 12, t = 18, coding: 101101101101101101 W = 12, t = 18, non-coding: 111010110010110111 W = 11, t = 21, coding: 100101100101100101101 W = 11, t = 21, non-coding: 111010010100010010111 W = 12, t = 21, coding: 100101101101100101101 W = 12, t = 21, non-coding: 111010010110010010111 Ma, B., Tromp, J., Li, M., "PatternHunter: faster and more sensitive homology search", Bioinformatics 2002 Mar;18(3):440-5

NCBI FieldGuide Neighbors: Precomputed BLAST Nucleotide Protein Entrez Related Sequences produces a list of sequences sorted by BLAST score, but with no alignment details.

NCBI FieldGuide Blink – Protein BLAST Alignments Lists only 200 hits List is nonredundant

NCBI FieldGuide Blink – Linking Sequence to Structure Cn3D

NCBI FieldGuide BLAST: Related Structures

NCBI FieldGuide BLAST Databases: Non-redundant protein nr (non-redundant protein sequences) –GenBank CDS translations –NP_ RefSeqs –Outside Protein PIR, Swiss-Prot, PRF –PDB (sequences from structures)

NCBI FieldGuide BLAST Databases: Nucleic Acid nr (nt) –Traditional GenBank Divisions –NM_ and XM_ RefSeqs dbest –EST Division htgs –HTG division gss –GSS division chromosome –NC_ RefSeqs wgs –whole genome shotgun

NCBI FieldGuide Genomic BLAST These pages provide customized nucleotide and protein databases for each genome If a Map Viewer is available, the BLAST hits can be viewed on the maps

NCBI FieldGuide What if Your Favorite Gene is not found in the latest genome build? POSSIBLE VARIANTS: The gene does not exist; It exists, but there is a problem with assembly; It exists, but there is a problem with annotation

NCBI FieldGuide An example: finding prestin in Human genome We start with rat prestin, BLAST it against the Human genome and look for evidences that human prestin exists as well.

NCBI FieldGuide Searching the Human Genome >gi|12188917|emb|AJ303372.1|RNO303372 Rattus norvegicus ATGGATCATGCTGAAGAAAATGAAATTCCTGCAGAGATCAGAAGTACCTCGTGGAA GTCATCCGGTCCTCCAGGAGAGGCTGCACGTCAAGGACAAAGTCACAGACTCCATC GCAGGCATTCACGTGCACTCCTAAAAAAGTAAGAAACATCATCTACATGTTCTTGC TTGCCAGCATATAAATTCAAGGAGTATGTGCTGGGTGACTTGGTCTCGGGCATAAG AGCTCCCCCAAGGCTTAGCCTTCGCGATGCTGGCAGCTGTGCCTCCGGTGTTCGGC On for same species comparisons

NCBI FieldGuide BLAST Results 16 hits to one contig Human Genome Database 953 contigs 2.9 billion letters

NCBI FieldGuide Map Viewer: Genomic Context of BLAST Hits Genes Genome Scan Models Human EST hits Contig GenBank Mouse EST hits

NCBI FieldGuide Human prestin: now appears in Build 34

NCBI FieldGuide Now we can compare genes

NCBI FieldGuide Three prestin genes: finally together!

NCBI FieldGuide Same prestin, different assemblies

NCBI FieldGuide Does homology mean the common biological function? Not always; the existence of the common ancestor does not guarantee that some function won’t be lost or acquired after the divergence. An example: zeta-crystallin is a component of a transparent lens matrix of the vertebrate eye. Its homolog in E.coli is the metabolic enzyme quinone oxidoreductase.

NCBI FieldGuide BLAST VAST Entrez Text Sequence Structure

NCBI FieldGuide Structure similarity: No More BLASTing! Three-dimensional structures are most conserved during the evolution; One still can detect the existence of the common ancestor based on the structure similarity; Spatial similarity is not calculated the same way we do it for sequences

NCBI FieldGuide VAST: Structure Neighbors Vector Alignment Search Tool For each protein chain, locate SSEs (secondary structure elements), and represent them as individual vectors. 1 2 3 4 5 6 Human IL-4

NCBI FieldGuide VAST: Structure Neighbors

NCBI FieldGuide Structure Neighbors in Cn3D SH3SH2 C-Src kinase Human vs. Chicken

NCBI FieldGuide 3D Domain Neighbors Human C-Src Kinase (Tyr) vs. Chk1 kinase (Ser/Thr)

NCBI FieldGuide NCBI is changing From sequence data storage facility to one-stop shop with integrated databases of various kind. You can be part of the future – work with us! Your expertise and data are indispensable.

NCBI FieldGuide GenBank

NCBI FieldGuide Refseq

NCBI FieldGuide Entrez Gene

NCBI FieldGuide Homologene database

NCBI FieldGuide New generation of databases: an example

NCBI FieldGuide Protein interaction database: a seed for future precomputed resources

NCBI FieldGuide New databases: GenSAT

NCBI FieldGuide PubChem

NCBI FieldGuide Headache? Take Aspirin

NCBI FieldGuide Aspirin has 432 neighbors

NCBI FieldGuide Link to 3D protein structures

NCBI FieldGuide For More Information… General Helpinfo@ncbi.nlm.nih.gov BLASTblast-help@ncbi.nlm.nih.gov E-mail addresses The (free!) NCBI Newsletter The NCBI Handbook http://www.ncbi.nih.gov/Education/index.html The NCBI Education Page http://www.ncbi.nih.gov/About/newsletter.html Follow the link from the NCBI Home Page

NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

Similar presentations

Presentation on theme: "NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB.

Similar presentations

Presentation on theme: "NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 September 30, 2004 ICGEB."— Presentation transcript:

Similar presentations

About project

Feedback