Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) Institute of Biomedical Sciences, Academia Sinica
Bioinformatics Bioinformatics is the application of information technology to analyze, process, and manage biological data. Bioinformatics provides computational tools to facilitate the process of Data Information Knowledge Discovery Don’t believe everything you see in DB or even in GenBank! QC is the most important aspect and concern in Bioinformatics!
Roadmap to Genomics Shotgun sequencing Full length cDNAs Functional Genomics
A Vision for the Future of Genome Research Francis S. Collins (National Human Genome Research Institute, NIH, USA) Nature 422:835 (2003)
EBI GenBank DDBJ EMBL EMBL Entrez SRS getentry NIG CIB NCBI NIH Submissions Updates Submissions Updates Submissions Updates International Sequence Database Collaboration
Lecture
Data Bases and Scientific Algorithms Integration Bioinformatics Medline (Asn.1) Medline (Asn.1) Entrez/NCBI (Asn.1) Entrez/NCBI (Asn.1) PDB (Oracle, 3D images) PDB (Oracle, 3D images) BLAST (FASTA) BLAST (FASTA) IntegrationBioInformaticsIntegrationBioInformatics ClustalW (FASTA) KEGG (HTML Text, Binary Images) OMIN (Text File) Microarray Data (RDBMS, Excel)
The (ever expanding) Entrez System Entrez PopSet Structure PubMed Books 3D Domains Taxonomy GEO/GDS UniGene Nucleotide Protein Genome OMIM CDD/CDART Journals SNP UniSTS PubMed Central
Web Access:
NCBI Web Traffic Christmas and New Year’s Day User’s per day
The Entrez System: Text Searches
Types of Databases Primary Databases –Original submissions by experimentalists –Content controlled by the submitter Examples: GenBank, SNP, GEO Derivative Databases –Built from primary data –Content controlled by third party (NCBI) Examples: Refseq, TPA, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain
Entrez Nucleotides Primary GenBank / EMBL / DDBJ 49,675,750 Derivative RefSeq 545,503 Third Party Annotation 4,544 PDB 5,561 Total 50,231,358
Entrez Protein: Derivative Databases GenPept 3,950,968 RefSeq 1,348,072 Third Party Annotation 4,133 Swiss Prot 170,087 PIR 282,821 PRF 12,079 PDB 61,845 Total5,830,005 BLAST nr total 2,336,522
The Growth of GenBank Release 148: 45.2 million records 49.4 billion nucleotides Average doubling time ≈ 14 months*
Organization of GenBank: Traditional Divisions Records are divided into 17 Divisions. 11 Traditional 6 Bulk Traditional Divisions: Direct Submissions (Sequin and BankIt) Accurate Well characterized PRI (28) Primate PLN (13) Plant and Fungal BCT (11) Bacterial and Archeal INV (7) Invertebrate ROD (15) Rodent VRL (4) Viral VRT (7) Other Vertebrate MAM (1) Mammalian PHG (1) Phage SYN (1) Synthetic (cloning vectors) UNA (1) Unannotated Entrez query: gbdiv_xxx[Properties]
Organization of GenBank: Bulk Divisions Records are divided into 17 Divisions. 11 Traditional 6 Bulk BULK Divisions: Batch Submission ( and FTP) Inaccurate Poorly characterized EST (355) Expressed Sequence Tag GSS (132) Genome Survey Sequence HTG (62) High Throughput Genomic STS (5) Sequence Tagged Site HTC (6) High Throughput cDNA PAT (17) Patent Entrez query: gbdiv_xxx[Properties]
File Formats of the Sequence Databases Each sequence is represented by a text record called a flat file. GenBank/GenPept (useful for scientists) FASTA (the simplest format) ASN.1 & XML (useful for programmers)
A Traditional GenBank Record LOCUS AY bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY VERSION AY GI: KEYWORDS. SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi: FEATURES Location/Qualifiers source /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene /gene="AFS1" CDS /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO " /db_xref="GI: " /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN" ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt // Header Feature Table Sequence The Flatfile Format
LOCUS AY bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY VERSION AY GI: KEYWORDS. SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi: The Header
LOCUS AY bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY VERSION AY GI: KEYWORDS. SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi: Header: Locus Line LOCUS AY bp mRNA linear PLN 04-MAY-2004 Molecule type Division Modification Date Locus name Length
LOCUS AY bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY VERSION AY GI: KEYWORDS. SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi: Header: Database Identifiers ACCESSION AY VERSION AY GI: ACCESSION AY VERSION AY GI: Accession Stable Reportable Universal Accession Stable Reportable Universal Version Tracks changes in sequence Version Tracks changes in sequence GI number NCBI internal use GI number NCBI internal use
LOCUS AY bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY VERSION AY GI: KEYWORDS. SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi: Header: Organism SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. NCBI-controlled taxonomy
FEATURES Location/Qualifiers source /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene /gene="AFS1" CDS /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO " /db_xref="GI: " /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN" The Feature Table Coding sequence start (atg) stop (tag) Implied protein Implied protein GenPept Identifiers
The Sequence: 99.99% Accurate ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 1741 ggacccacat cctgtcttta ctattccaac ctcttgtaaa ctagtactca tatagtttga 1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a // 1741 ggacccacat cctgtcttta ctattccaac ctcttgtaaa ctagtactca tatagtttga 1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a //
>gi|30256|emb|CAA | c-src-kinase [Homo sapiens] MSAIQAAWPSGTECIAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREG VKAGTKLSLMPWFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCDGKVEHYRIMYHASKLSI DEEVYFENLMQLVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVM LGDYRGNKVAVKCIKNDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRS RGRSVLGGDCLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLTKEASSTQDTGKLPV KWTAPEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMK NCWHLDAAMRPSFLQLREQLEHIKTHELHL MSAIQAAWPSGTECIAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREG VKAGTKLSLMPWFHGKITREQAERLLYPPETGLFLVRESTNYPGDYTLCVSCDGKVEHYRIMYHASKLSI DEEVYFENLMQLVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVM LGDYRGNKVAVKCIKNDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRS RGRSVLGGDCLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLTKEASSTQDTGKLPV KWTAPEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMK NCWHLDAAMRPSFLQLREQLEHIKTHELHL FASTA Format >gi|30256|emb|CAA | c-src-kinase [Homo sapiens] > gi number Database Identifiers: gbGenBank embEMBL dbjDDBJ refRefSeq spSWISS-PROT pdbProtein Databank pirPIR prf PRF tpgTPA-GenBank tpeTPA-EMBL tpjTPA-DDBJ Accession.Version Locus Name Organism
Seq-entry ::= set { class nuc-prot, descr { title "Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.", source { org { taxname "Malus x domestica", common "cultivated apple", db { { db "taxon", tag id 3750 } }, orgname { name binomial { genus "Malus", species "x domestica" }, mod { { subtype cultivar, subname "'Law Rome'" }, { subtype old-name, subname "Malus domestica", attrib "(10)cultivar='Law Rome'" } }, lineage "Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus", gcode 1,, Abstract Syntax Notation: ASN.1 FASTA Nucleotide FASTA Nucleotide FASTA Protein FASTA Protein GenPept GenBank ASN.1
Bulk Divisions Expressed Sequence Tag –1 st pass single read cDNA Genome Survey Sequence –1 st pass single read gDNA High Throughput Genomic –incomplete sequences of genomic clones Sequence Tagged Site –PCR-based mapping reagents Batch Submission and htg ( and ftp) Inaccurate Poorly Characterized
EST Division: Expressed Sequence Tags RNA gene products nucleus 30,000 genes ,000 unique cDNA clones in library - isolate unique clones -sequence once from each end make cDNA library 5’ 3’ >IMAGE: ', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC >IMAGE: ', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC >IMAGE: ' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG >IMAGE: ' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG gbdiv_est[Properties]
ESTs in Entrez Total 26 million records Human 6.0 million Mouse 4.3 million Rat 0.7 million Zebrafish 0.6 million Wheat0.6 million Barley0.3 million Maize0.4 million Total 26 million records Human 6.0 million Mouse 4.3 million Rat 0.7 million Zebrafish 0.6 million Wheat0.6 million Barley0.3 million Maize0.4 million
Genome Sequencing - HTG, GSS, (WGS) Draft Sequence ( HTG division ) shredding Whole BAC insert (or genome) cloning isolating assembly sequencing GSS division or trace archive whole genome shotgun assemblies (traditional division)
Maize Genome Survey Sequences Surveys of BAC Libraries BAC end sequences More than 100K per project
HTG Division: Rice Draft Sequences Unfinished sequences of BACs Gaps and unordered pieces Finished sequences move to traditional GenBank division
Whole Genome Shotgun Projects Traditional GenBank Divisions projects –Virus – Bacteria – Environmental sequences – Archaea –51 Eukaryotes featuring: Cow, Chicken, Rat, Mouse, Dog, Chimpanzee, Human Pufferfish (2) Honeybee, Anopheles, Fruit Flies (3), Silkworm Nematode (C. briggsae) Yeasts (8), Aspergillus (2) Rice
Zebrafish: WGS wgs_master[Properties]
Derivative Databases UniGene RefSeq TPA
Primary vs. Derivative Sequence DatabasesGenBank SequencingCenters GA ATT C C GA ATT C C AT GA ATT C C GA ATT C C TTGACA ATTGACTA ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG C ATT GA ATT C C GA ATT C C Labs Algorithms UniGene Curators RefSeq Genome Assembly TATAGCCG AGCTCCGATA CCGATGACAA Updated continually by NCBI Updated ONLY by submitters
A gene-oriented view of sequence entries MegaBlast based automated sequence clustering Now informed by genome hits New! Nonredundant set of gene oriented clusters Each cluster a unique gene Information on tissue types and map locations Includes known genes and uncharacterized ESTs Useful for gene discovery and selection of mapping reagents What is UniGene?
EST hits: Human mRNA Albumin mRNA 5’ EST hits 3’ EST hits
UniGene: Expressed Sequences
Expression Data
RELEASE 11 (May 13, 2005) AVAILABLE ON THE FTP SITE! Forming the “ best representative ” sequence Standardizing nomenclature and record structure Adding annotation (references, sequence features) Stable reference for example, gene identification, polymorphism discovery, comparative analysis RefSeq Release 11 includes over 1,425,971 proteins and 2928 organisms. The release is available by FTP at: ftp://ftp.ncbi.nih.gov/refseq/release/ ftp://ftp.ncbi.nih.gov/refseq/release/ RefSeq number is still not fixed. srcdb_refseq[Properties]
Curated RefSeq Records COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from X Summary: Adenylosuccinate synthetase catalyzes the first committed step in the conversion of IMP to AMP. LOCUS ADSS 1368 bp mRNA linear PRI 27-AUG-2002 DEFINITION Homo sapiens adenylosuccinate synthase (ADSS), mRNA. ACCESSION NM_ VERSION NM_ GI: RefSeq Nucleotide LOCUS ADSS 455 aa linear PRI 27-AUG-2002 DEFINITION adenylosuccinate synthase; Adenylosuccinate synthetase (Ade(-)H-complementing) Homo sapiens. ACCESSION NP_ VERSION NP_ GI: DBSOURCE REFSEQ: accession NM_ RefSeq Protein X records: Genome Annotation & Inferred or Predicted vs vs N records: Provisional, Reviewed or Validated
RefSeq Accession Numbers mRNAs and Proteins NM_ Curated mRNA NP_ Curated Protein NR_ Curated non-coding RNA XM_ Predicted mRNA XP_ Predicted Protein XR_ Predicted non-coding RNA Gene Records NG_ Reference Genomic Sequence Chromosome NC_ Microbial replicons, organelle genomes, human chromosomes Assemblies NT_ Contig NW_ WGS Supercontig
Curated genomic DNA (NC, NT, NW) Curated Model mRNA (XM) (XR) Curated mRNA (NM) (NR) Model protein (XP) RefSeq Curation Processes Protein (NP) Scanning....
RefSeq: NCBI ’ s Derivative Sequence Database Curated transcripts and proteins –reviewed –human, mouse, rat, fruit fly, zebrafish, arabidopsis microbial genomes (proteins), and more Model transcripts and proteins Assembled Genomic Regions (contigs) –human genome –mouse genome –rat genome Chromosome records –Human genome –microbial –organelle ftp://ftp.ncbi.nih.gov/refseq/release / srcdb_refseq[Properties]
RefSeq Benefits non-redundancy explicitly linked nucleotide and protein sequences updates to reflect current sequence data and biology data validation format consistency distinct accession series stewardship by NCBI staff and collaborators
Third Party Annotation (TPA) Database Annotations of existing GenBank sequences Allows for community annotation of genomes Direct submissions –BankIt –Sequin tpa[Properties]
TPA record: WGS Assembly CDS Feature TPA protein
Human Nucleotide Sequences ISDC 8,965,327 (GenBank/EMBL/DDBJ) PRI 916,017 (WGS 601,855) EST 6,003,916 GSS 905,645 HTG 18,364 HTC 49,373 STS 117,870 PAT 953,269 RefSeq 35,934 TPA 893 Total 9,002,154 ISDC 8,965,327 (GenBank/EMBL/DDBJ) PRI 916,017 (WGS 601,855) EST 6,003,916 GSS 905,645 HTG 18,364 HTC 49,373 STS 117,870 PAT 953,269 RefSeq 35,934 TPA 893 Total 9,002,154
Other NCBI Databases dbSNP: nucleotide polymorphism Geo: Gene Expression Omnibus microarray and other expression data Gene: gene records Unifies LocusLink and Microbial Genomes Structure: imported structures (PDB) Cn3D viewer, NCBI curation CDD: conserved domain database Protein families (COGs) Single domains (PFAM, SMART, CD)
NCBI ’ s SNP Database Primary Database and Derivative (RefSNP) Single Nucleotide Polymorphism Repeat polymorphisms Insertion-Deletion Polymorphisms 24 Species Over 15 million submissions
Submitted SNP Hemachromatosis SNP
Non-redundant Computational Analysis BLAST hits to genome, mRNA, protein and structure RefSNP
Sequence Similarity Searching Basic Local Alignment Search Tool (BLAST)
BLAST VAST Pubmed Text Sequence Structure
Best score for aligning part of sequences Dynamic programming Algorithm: Smith-Waterman Table cells never score below zero Best score for aligning the full length sequences Dynamic programming Algorithm: Needelman- Wunch Table cells are allowed any score Global Local Pairwise Alignment Summary
Global vs Local Alignment Seq 1 Seq 2 Seq 1 Seq 2 Global alignment Local alignment
Global Alignment Human: 15 IAKYNFHGTAEQDLPFCKGDVLTIVAVTKDPNWYKAKNKVGREGIIPANYVQKREGVKAGTKLSLMPWFH 84 +A DL F K D+L I+ T+ W+ GR G IP+NYV PW+ Worm: 63 VALFQYDARTDDDLSFKKDDILEILNDTQGDWWFARHKATGRTGYIPSNYVAREKSIES------QPWYF 125 Human: 85 GKITREQAERLLYPP--ETGLFLVRESTNYPGDYTLCVSCDGKVEHYRI-MYHASKLSIDEEVYFENLMQ 151 GK+ R AE+ L E G FLVR+S + D +L V + V+HYRI + H I F L Worm: 126 GKMRRIDAEKCLLHTLNEHGAFLVRDSESRQHDLSLSVRENDSVKHYRIQLDHGGYF-IARRRPFATLHD 194 Human: 152 LVEHYTSDADGLCTRLIKPKVMEGTVAAQDEFYRSGWALNMKELKLLQTIGKGEFGDVMLGDYRGN-KVA 220 L+ HY +ADGLC L P Y W L++ IG G+FG+V G + N VA Worm: 195 LIAHYQREADGLCVNLGAPCAKSEAPQTTTFTYDDQWEVDRRSVRLIRQIGAGQFGEVWEGRWNVNVPVA 264 Human: 221 VKCIK-NDATAQAFLAEASVMTQLRHSNLVQLLGVIVEEKGGLYIVTEYMAKGSLVDYLRSRGRSVLGGD 289 VK +K A FLAEA +M +LRH L+ L V ++ + IVTE M + +L+ +L+ RGR Worm: 265 VKKLKAGTADPTDFLAEAQIMKKLRHPKLLSLYAVCTRDE-PILIVTELMQE-NLLTFLQRRGRQCQMPQ 332 Human: 290 CLLKFSLDVCEAMEYLEGNNFVHRDLAARNVLVSEDNVAKVSDFGLT----KEASSTQDTG-KLPVKWTA 353 L++ S V M YLE NF+HRDLAARN+L++ K++DFGL KE TG + P+KWTA Worm: 333 -LVEISAQVAAGMAYLEEMNFIHRDLAARNILINNSLSVKIADFGLARILMKENEYEARTGARFPIKWTA 401 Human: 354 PEALREKKFSTKSDVWSFGILLWEIYSFGRVPYPRIPLKDVVPRVEKGYKMDAPDGCPPAVYEVMKNCWH 423 PEA +F+TKSDVWSFGILL EI +FGR+PYP + +V+ +V+ GY+M P GCP +Y++M+ CW Worm: 402 PEAANYNRFTTKSDVWSFGILLTEIVTFGRLPYPGMTNAEVLQQVDAGYRMPCPAGCPVTLYDIMQQCWR 471 Human: 424 LDAAMRPSFLQLREQLEHI 443 D RP+F L+ +LE + Worm: 472 SDPDKRPTFETLQWKLEDL 492 human M SAIQ AAWPSGT ECIAKYNFHG M S.. AA SG...A.... worm MGSCIGKEDPPPGATSPVHTSSTLGRESLPSHPRIPSIGPIAASSSGNTIDKNQNISQSANFVALFQYDA human REQLEHI KTHELHL..::. :... worm QWKLEDLFNLDSSEYKEASINF 500 Align program (Lipman and Pearson)
Basic Local Alignment Search Tool Widely used similarity search tool Heuristic approach based on Smith Waterman algorithm Finds best local alignments Provides statistical significance All combinations (DNA/Protein) query and database. –DNA vs DNA –DNA translation vs Protein –Protein vs Protein –Protein vs DNA translation –DNA translation vs DNA translation www, standalone, and network clients
What BLAST tells you BLAST reports surprising alignments –Different than chance Assumptions –Random sequences –Constant composition Conclusions –Surprising similarities imply evolutionary homology Evolutionary Homology: descent from a common ancestor Does not always imply similar function
BLAST/FASTA variants for different searches ProgramQueryDatabaseComparisonSearching purpose blastn/fasta blastp/fasta blastx/fastx tblastn/tfasta tblastx/tfastx DNA Protein DNA level Protein level homologous DNA homologous protein New genes from DNA New genes from peptide New genes from DNA BLAST Web site: FASTA Web sites: or
BLASTN Databases nr GenBank, EMBL, DDBJ, PDB and NCBI reference sequences (RefSeq) htgsHigh-throughput genomic sequences (draft) patPatented nucleotide sequences mitoMitochondrial sequences vectorVector subset of GenBank monthGenBank, EMBL, DDBJ, PDB from 30 days chromContigs and chromosomes from RefSeq
BLASTP Databases nr GenBank CDS translations, RefSeq, PDB, SWISS-PROT, PIR, PRF swissprotSWISS-PROT patPatented protein sequences pdbProtein Data Bank month GenBank CDS translations, PDB, SWISS- PROT, PIR, PRF from 30 days
GTACTGGACATGGACCCTACAGGAACGT TGGACATGGACCCTACAGGAACGTATAC CATGGACCCTACAGGAACGTATACGTAA... Nucleotide Words GTACTGGACAT TACTGGACATG ACTGGACATGG CTGGACATGGA TGGACATGGAC GGACATGGACC GACATGGACCC ACATGGACCCT... Make a lookup table of words GTACTGGACATGGACCCTACAGGAACGTATACGTAAG Query 11-mer 1228 megablast 711 blastn Min.Def.WORD SIZE
Protein Words GTQITVEDLFYNIATRRKALKN Query : Neighborhood Words LTV, MTV, ISV, LSV, etc. GTQ TQI QIT ITV TVE VED EDL DLF... Make a lookup table of words Word size = 3 (default) Word size can only be 2 or 3
Minimum Requirements for a Hit Nucleotide BLAST requires one exact match Protein BLAST requires two neighboring matches within 40 aa GTQITVEDLFYNI SEI YYN ATCGCCATGCTTAATTGGGCTT CATGCTTAATT neighborhood words exact word match one match two matches
BLASTP Summary YLS HFL Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEI 333 Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI 47 Gapped extension with trace back Query 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESI-LEV… 50 +E YA YL K F+YLSL +SP+ +DVNVHP+K VHFL+++ I + + Sbjct 287 LEETYAKYLHKGASYFVYLSLNMSPEQLDVNVHPSKRIVHFLYDQEIATSI… 337 Final HSP +E YA YL K F+ L +SP+ +DVNVHP+K V +++ I High-scoring pair (HSP) HFL 18 HFV 15 HFS 14 HWL 13 NFL 13 DFL 12 HWV 10 etc … YLS 15 YLT 12 YVS 12 YIT 10 etc … Neighborhood words Neighborhood score threshold T (-f) =11 Query: IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILEV… example query words
Query sequence Words of length W (1) (2) Compare the word list to the database and identify exact matches BLAST Algorithm W default = 11
(3) For each word match, extend alignment in both directions (4) Compute E-value
An alignment that BLAST can ’ t find 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC Here there are no words longer than 6…...for nucleotides there must be an exact match of at least 7.
An Alignment BLAST Can Make Solution: compare protein sequences; BLASTX Score = 290 bits (741), Expect = 7e-77 Identities = 147/331 (44%), Positives = 206/331 (61%), Gaps = 8/331 (2%) Frame = +3 BLAST 2 Sequences (blastx) output:
Nucleotide vs. Protein BLAST aaccgggtgacggtggtgctcggtgcgcagtggggcgacgaaggc H.sapiens: N R V T V V L G A Q W G D E G + + V + V L G Q W G D E G A.thaliana: S Q V S G V L G C Q W G D E G agtcaagtatctggtgtactcggttgccaatggggagatgaaggt Comparing ADSS from H. sapiens and A. thaliana BLASTp finds three matching words BLASTn finds no match, because there are no 7 bp words Protein searches are generally more sensitive than nucleotide searches.
The Flavors of BLAST Standard BLAST –traditional “ contiguous ” word hit –position independent scoring –nucleotide, protein and translations (blastn, blastp, blastx, tblastn, tblastx) Megablast –optimized for large batch searches –can use discontiguous words PSI-BLAST –constructs PSSMs automatically; uses as query –very sensitive protein search RPS BLAST –searches a database of PSSMs –tool for conserved domain searches
Megablast: NCBI ’ s Genome Annotator Long alignments for similar DNA sequences Concatenation of query sequences Faster than blastn Contiguous Megablast –exact word match –Word size 28 Discontiguous Megablast –initial word hit with mismatches –cross-species comparison
MegaBLAST AI AI AI BE C:\seq\hs.4.fsa > gnl|UG|Hs#S qd43b11.x1 Homo sapiens cDNA, 3' end CATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGTTTG GTGAGAAGTGCTCGATTAGTTCAGACAACATCTGGCACTTGATGTCTGTCCTTCCCTCCT TTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAAGGT GACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACACCG TCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAAAAC CACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGC > gnl|UG|Hs#S qv37f11.x1 Homo sapiens cDNA, 3' end GAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGT GCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATA CATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAA GTCGTATCGATGT > gnl|UG|Hs#S qv33c06.x1 Homo sapiens cDNA, 3' end GAGAAGACGACAGAAGGGGAGAAGAGAGTAGGAAAAAGGAGGGAAGGACAGACATCAAGT GCCAGATGTTGTCTGAACTAATCGAGCACTTCTCACCAAACTTCATGTATAAATAAAATA CATATTTTTAAAACAAACCAATAAATGGCTTACATCAAAAAAAAAAAAAAAAAAAAAAAA GTCGTATCGATGT > gnl|UG|Hs#S e65f04.x1 Homo sapiens cDNA, 3' end TTTCATGTAAGCCATTTATTGGTTTGTTTTAAAAATATGTATTTTATTTATACATGAAGT TTGGTGAGAAGTGCTCGATTAGTTCAAACAACATCTGGCACTTGATGTCTGTCCTTCCCT CCTTTTTCCTACTCTCTTCTCCCCTCCTGCTGGTCATTGTGCAGTTCTGGAAATTAAAAA GGTGACAGCCAGGCTAAAAGCTAAGGGTTGGGTCTAGCTCACCTCCCACCCCCAACCACA CCGTCTGCAGCCAGCCCCAGGCACCTGTCTCAAAGCTCCCGGGCTGTCCACACACACAAA AACCACAGTCTCCTTCCGGCCAGCTGGGCTGGCAGCCCGACCTGCCTCCCAACCGCATTC CTGCCTGTGTAGCAGGCGGTGAGCACCCAGAAGGGGCACATACCTCTCCAAGCCTTGAAA GCAAAGCATGGAGATCTACAAAAATAGGATTTCCACTTGGAGAAATGTCGCTGGGACAGT
Templates for Discontiguous Words W = 11, t = 16, coding: W = 11, t = 16, non-coding: W = 12, t = 16, coding: W = 12, t = 16, non-coding: W = 11, t = 18, coding: W = 11, t = 18, non-coding: W = 12, t = 18, coding: W = 12, t = 18, non-coding: W = 11, t = 21, coding: W = 11, t = 21, non-coding: W = 12, t = 21, coding: W = 12, t = 21, non-coding: Reference: Ma, B, Tromp, J, Li, M. PatternHunter: faster and more sensitive homology search. Bioinformatics March, 2002; 18(3):440-5 W = word size; # matches in template t = template length (window size within which the word match is evaluated)
Scoring Systems - Nucleotides A G C T A +1 –3 –3 -3 G –3 +1 –3 -3 C –3 – T –3 –3 –3 +1 Identity matrix CAGGTAGCAAGCTTGCATGTCA || |||||||||||| ||||| raw score = 19-9 = 10 CACGTAGCAAGCTTG-GTGTCA
Scoring Systems - Proteins Position Independent Matrices PAM Matrices (Percent Accepted Mutation) Derived from observation; small dataset of alignments Implicit model of evolution All calculated from PAM1 PAM250 widely used BLOSUM Matrices (BLOck SUbstitution Matrices) Derived from observation; large dataset of highly conserved blocks Each matrix derived separately from blocks with a defined percent identity cutoff BLOSUM62 - default matrix for BLAST Position Specific Score Matrices (PSSMs) PSI- and RPS-BLAST
A 4 R -1 5 N D C Q E G H I L K M F P S T W Y V X A R N D C Q E G H I L K M F P S T W Y V X BLOSUM62 Common amino acids have low weightsRare amino acids have high weights Negative for less likely substitutions Positive for more likely substitutions
Position-Specific Score Matrix DAF-1 Serine/Threonine protein kinases catalytic loop 174 PSSM scores 5 4
A R N D C Q E G H I L K M F P S T W Y V 435 K E S N K P A M A H R D I K S K N I M V K N D L Position-Specific Score Matrix catalytic loop [ >./blastpgp -i NP_ d nr -j 3 -Q NP_ pssm ]
Gapped Alignments Gapping provides more biologically realistic alignments Statistical behavior is not completely understood for gapped alignments Gapped BLAST parameters must be found by simulations for each matrix Gap costs: -(a+bk) a = gap open penalty b = gap extend penalty k= number of residues For example: A gap of 1 residue receives the score “-(a+b)”.
Scores V D S – C Y V E T L C F BLOSUM = 7 PAM =. 11 Simply add the scores for each pair of aligned residues and (as necessary) factor in the gaps! Different matrices produce different scores!
Lower BLOSUM series means more divergence Higher PAM series means more divergence better for finding local alignments better for finding global alignments and remote homologs based on groups of related sequences counted as one based on minimum replacement or maximum parsimony Built from vast amout of dataBuilt from small amout of data Built from local alignmentsBuilt from global alignmentsBLOSUMPAM Matrix differences
Matrices - Rules of thumb Need different levels of sensitivity ? –Close relationships (Low PAM number (PAM 1) or high Blosum number, eg. 80) –Distant relationships (High PAM (e.g. PAM 250), low Blosum (BLOSUM 45)
Local Alignment Statistics High scores of local alignments between two random sequences follow the Extreme Value Distribution Score Alignments (applies to ungapped alignments) E = Kmne - S E = mn2 -S’ K = scale for search space = scale for scoring system S’ = bitscore = ( S - lnK)/ln2 Expect Value E = number of database hits you expect to find by chance size of database your score expected number of random hits
WWW BLAST
The BLAST homepage Standard databases Specialized Databases
BLAST Databases: Nucleic Acid nr (nt) –Traditional GenBank –NM_ and XM_ RefSeqs refseq_rna refseq_genomic –NC_ RefSeqs dbest –EST Division est_human, mouse, others htgs –HTG division gss –GSS division wgs –whole genome shotgun env_nt –environmental samples
Options for Advanced Blasting: Nucleotide Example Entrez Queries nucleotide all[Filter] NOT mammalia[Organism] green plants[Organism] biomol mrna[Properties] biomol genomic[Properties] OtherAdvanced -W 7 word size –e expect value -v 2000 descriptions -b 2000 alignments
BLAST Databases: Non-redundant protein nr ( non-redundant protein sequences ) –GenBank CDS translations –NP_ RefSeqs –Outside Protein PIR, Swiss-Prot, PRF PDB (sequences from structures) pat protein patents env_nr environmental samples nr ( non-redundant protein sequences ) –GenBank CDS translations –NP_ RefSeqs –Outside Protein PIR, Swiss-Prot, PRF PDB (sequences from structures) pat protein patents env_nr environmental samples
Advanced Options: Filter all[Filter] NOT mammals[Organism] gene_in_mitochondrion[Properties] 2003:2005 [Modification Date] tpa[Filter] Nucleotide biomol_mrna[Properties] biomol_genomic[Properties] all[Filter] NOT mammals[Organism] gene_in_mitochondrion[Properties] 2003:2005 [Modification Date] tpa[Filter] Nucleotide biomol_mrna[Properties] biomol_genomic[Properties] Default setting Hides low complexity for initial word hits only Hides low complexity for initial word hits only Masks regions of query in lower case (pre-masked) Masks regions of query in lower case (pre-masked)
BLAST Formatting Page
BLAST Output: Graphic mouse over Sort by taxonomy
BLAST Output: Descriptions link to entrez Sorted by e values 3 X Default e value cutoff 10 Gene Linkout
TaxBLAST: Taxonomy Reports
>gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615 Score = 42.0 bits (97), Expect = 3e-04 Identities = 26/59 (44%), Positives = 33/59 (55%), Gaps = 9/59 (15%) Query 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILEV-QQHIESKL 58 L + P L LEI P VDVNVHP KHEV F +H+ + +L V QQ +E+ L Sbjct 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338 >gi|127552|sp|P23367|MUTL_ECOLI DNA mismatch repair protein mutL Length = 615 Score = 42.0 bits (97), Expect = 3e-04 Identities = 26/59 (44%), Positives = 33/59 (55%), Gaps = 9/59 (15%) Query 9 LPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHF-----LHE---ESILEV-QQHIESKL 58 L + P L LEI P VDVNVHP KHEV F +H+ + +L V QQ +E+ L Sbjct 280 LGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQLETPL 338 BLAST Output: Alignments Identical match positive score (conservative) negative substitution gap
BLAST Output: Alignments >gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 1) Length = 756 Score = 233 bits (593), Expect = 8e-62 Identities = 117/131 (89%), Positives = 117/131 (89%) Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL Sbjct: 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335 Query: 61 GSNSSRMYFTQTLLPGLAGPSGEMVKXXXXXXXXXXXXXXDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVK DKVYAHQMVRTDSREQKLDA Sbjct: 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395 Query: 121 FLQPLSKPLSS 131 FLQPLSKPLSS Sbjct: 396 FLQPLSKPLSS 406 >gi|730028|sp|P40692|MLH1_HUMAN DNA mismatch repair protein Mlh1 1) Length = 756 Score = 233 bits (593), Expect = 8e-62 Identities = 117/131 (89%), Positives = 117/131 (89%) Query: 1 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 60 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL Sbjct: 276 IETVYAAYLPKNTHPFLYLSLEISPQNVDVNVHPTKHEVHFLHEESILERVQQHIESKLL 335 Query: 61 GSNSSRMYFTQTLLPGLAGPSGEMVKXXXXXXXXXXXXXXDKVYAHQMVRTDSREQKLDA 120 GSNSSRMYFTQTLLPGLAGPSGEMVK DKVYAHQMVRTDSREQKLDA Sbjct: 336 GSNSSRMYFTQTLLPGLAGPSGEMVKSTTSLTSSSTSGSSDKVYAHQMVRTDSREQKLDA 395 Query: 121 FLQPLSKPLSS 131 FLQPLSKPLSS Sbjct: 396 FLQPLSKPLSS 406 low complexity sequence filtered
Neighbors: Precomputed BLAST Nucleotide Protein Entrez Related Sequences produces a list of sequences sorted by BLAST score, but with no alignment details.
Blink – Protein BLAST Alignments Lists only 200 hits List is nonredundant
PSI-BLAST Position-Specific Iterated BLAST Mining for protein domains Confirming relationships among related proteins
Position - Specific Scoring Matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V 206 D G V I S S C N G D S G G P L N C Q A Serine is scored differently in these two positions. Active site nucleophile
Position Specific Iterative BLAST: PSI-BLAST Create your own PSSM: Finding protein families based on your own sequence. query BLOSUM62 PSSM Alignment
>gi|113340|sp|P03958|ADA_MOUSE ADENOSINE DEAMINASE (ADENOSINE MAQTPAFNKPKVELHVHLDGAIKPETILYFGKKRGIALPADTVEELRNIIGMDKPLSLPGF VIAGCREAIKRIAYEFVEMKAKEGVVYVEVRYSPHLLANSKVDPMPWNQTEGDVTPDDVVD EQAFGIKVRSILCCMRHQPSWSLEVLELCKKYNQKTVVAMDLAGDETIEGSSLFPGHVEAY RTVHAGEVGSPEVVREAVDILKTERVGHGYHTIEDEALYNRLLKENMHFEVCPWSSYLTGA VRFKNDKANYSLNTDDPLIFKSTLDTDYQMTKKDMGFTEEEFKRLNINAAKSSFLPEEEKK PSI-BLAST e value cutoff for PSSM
RESULTS: Initial BLASTP Same results as protein-protein BLAST
Results of First PSSM Search Other purine nucleotide metabolizing enzymes not found by ordinary BLAST
Third PSSM Search: Convergence Just below threshold, another nucleotide metabolism enzyme Check to add to PSSM
Reverse Position Specific Iterative-BLAST (a.k.a. RPS-BLAST or CDD Search) A sequence search of the Conserved Domain Database (CDD) containing curated Position-Specific Scoring Matrices *.... |.... *.... |.... *.... |.... *.... |.... *.... |.... *.... | consensus 1 KWEIPREDLTLGKKLGEGAFGEVYKGTLKGkgd---nkSIDVAVKTLKEDASEeqIKEFL 57 1FGI A 1 aWEIPRESLRLEVKLGQGCFGEVWMGTWNG TTRVAIKTLKPGTMS--PEAFL 311 1BYG A 1 RWELPRDRLVLgkPLGEGAFGQVYLAEAIglgkdkpnrvTKVAVKMLKSDAtedkLSLDI 74 gi GWALNMKELKLlqTIGKGEFGDVMLGDYRg NKVAVKCIKNDAt---AQAFL 62 gi KYEIPRTDLTLkhKLGGGQYGEVYEGVWKky sLTVAVKTLKEDTm--eVEEFL 284 gi KWEIPRSELTIlrKLGRGNFGEVFYGKWRn sIDVAVKTLREGTm--sTAAFL 325 PSSM Sources PfamSanger7255 SMARTEMBL 663 COGNCBI4873 KOGNCBI4825 CDNCBI 645
Reverse Position Specific Iterative-BLAST (a.k.a. RPS-BLAST or CD Search) Query: sequence Database: PSSMs P03958
Result: TyrKc
Questions: Searching for p53 protein homologs with annotation of CDD. Can you put codon 72 SNP into 3D protein structure?
Other Areas to Cover Genomic Data Annotation Common Domains prediction WWW Other Useful Genome Browsers