Download presentation
Presentation is loading. Please wait.
Published byLuke Williamson Modified over 9 years ago
1
Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management
2
Molecular Biology Databases Excellent means of storing a vast amount of Information in a central, sharable location Biological databases are designed especially for the proper storing, searching & retrieving biological data –Keyword Searches –Cross-Referencing –3D capabilities
3
Database Categories Categories –Nucleotide Sequence Databases Gene Databases Genome Databases –Protein Sequence Databases –Structure Databases –Metabolic and Signaling Pathways –Human Genes and Diseases –Microarray Data and other Expression Databases –… Each contains specific information Each is interrelated
4
Nucleotide & Protein Sequence Databases
5
National Center for Biotechnology Information (NCBI) Created as a part of National Library of Medicine in 1988 –Establish public databases –Perform research in computational biology –Develop software tools for sequence analysis –Disseminate biomedical information Databases –Sequence, such as GeneBank, RefSeq, dbSNP –Literature, such as PubMed, OMIM Tools –Entrez. Blast, Cn3D, etc.
6
NCBI Homepage
7
NCBI Site Map
8
All Databases at NCBI :
9
Let’s Check out NCBI http://www.ncbi.nlm.nih.gov/sites/gquery?ito ol=toolbar
11
Multiple ways to find Genes…
12
Let’s Look at BRCA1
15
GenBank http://www.ncbi.nlm.nih.gov/Genbank/
16
GenBank Nucleotide only sequence database GenBank Data –Direct submissions individual records (BankIt, Sequin) –Batch submissions via email (EST, GSS, STS) –ftp accounts established for sequencing centers Data shared nightly amongst three collaborating databases: –GenBank –DNA Database of Japan (DDBJ). –European Molecular Biology Laboratory Database (EMBL)
18
GeneBank Release 175.0 ftp://ftp.ncbi.nih.gov/genbank/ ftp://ftp.ncbi.nih.gov/genbank/ Full release every two months Incremental and cumulative updates daily Release 175.0 (12/15/2009) 112,910,950 Sequences 110,118,557,163 Bases
19
NCBI Reference Sequences
20
GenBank Record (Header) LOCUSNM_001963 4913 bp mRNA linear PRI 20-SEP-2009 DEFINITIONHomo sapiens epidermal growth factor (beta-urogastrone) (EGF), mRNA. ACCESSIONNM_001963 VERSIONNM_001963.3 GI:166362727 KEYWORDS. SOURCEHomo sapiens (human) ORGANISMHomo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo. REFERENCE1 (bases 1 to 4913) AUTHORSHosgood,H.D. III, Menashe,I., He,X., Chanock,S. and Lan,Q. TITLEPTEN identified as important risk factor of chronic obstructive pulmonary disease JOURNALRespir Med (2009) In press PUBMED19625176 REMAKRGeneRIF: Observational study of gene-disease association.
21
Summary
22
GenBank Record (Sequence) ORIGIN 1 aaaaagagaa actgttggga gaggaatcgt atctccatat ttcttctttc agccccaatc 61 caagggttgt agctggaact ttccatcagt tcttcctttc tttttcctct ctaagccttt 121 gccttgctct gtcacagtga agtcagccag agcagggctg ttaaactctg tgaaatttgt 181 cataagggtg tcaggtattt cttactggct tccaaagaaa catagataaa gaaatctttc 241 ctgtggcttc ccttggcagg ctgcattcag aaggtctctc agttgaagaa agagcttgga 301 ggacaacagc acaacaggag agtaaaagat gccccagggc tgaggcctcc gctcaggcag 361 ccgcatctgg ggtcaatcat actcaccttg cccgggccat gctccagcaa aatcaagctg 421 ttttcttttg aaagttcaaa ctcatcaaga ttatgctgct cactcttatc attctgttgc 481 cagtagtttc aaaatttagt tttgttagtc tctcagcacc gcagcactgg agctgtcctg 541 aaggtactct cgcaggaaat gggaattcta cttgtgtggg tcctgcaccc ttcttaattt 601 tctcccatgg aaatagtatc tttaggattg acacagaagg aaccaattat gagcaattgg 661 tggtggatgc tggtgtctca gtgatcatgg attttcatta taatgagaaa agaatctatt 721 gggtggattt agaaagacaa cttttgcaaa gagtttttct gaatgggtca aggcaagaga 781 gagtatgtaa tatagagaaa aatgtttctg gaatggcaat aaattggata aatgaagaag 841 ttatttggtc aaatcaacag gaaggaatca ttacagtaac agatatgaaa ggaaataatt 901 cccacattct tttaagtgct ttaaaatatc ctgcaaatgt agcagttgat ccagtagaaa 961 ggtttatatt ttggtcttca gaggtggctg gaagccttta tagagcagat ctcgatggtg
23
FASTA: Sequence Format
24
Sequence Viewer Graphics
25
RefSeq
26
RefSeq Database of reference sequences –http://www.ncbi.nlm.nih.gov/RefSeq/http://www.ncbi.nlm.nih.gov/RefSeq/ Curated –Many experimentally validated –Some partially validated via ESTs –Some computationally predicted Non-redundant; one record for each gene, or each splice variant, from each organism represented
27
Accession Numbers DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data RefSeq provides an expertly curated accession number that corresponds to the most stable, agreed-upon “reference” version of a sequence. RefSeq identifiers include the following formats: –Complete chromosomeNC_###### –Genomic contigNT_###### –mRNA (DNA format)NM_###### –ProteinNP_######
28
Accession Numbers: More Examples AC_123456 Genomic Alternate complete genomic AP_123456 ProteinProtein products; alternate NG_123456 Genomic Incomplete genomic regions NR_123456 RNANon-coding transcripts NW_123456 GenomicGenomic assemblies NZ_ABCD12345678 Genomic Whole genome shotgun data XM_123456 mRNATranscript products XP_123456 ProteinProtein products XR_123456 RNATranscript products YP_123456 ProteinProtein products ZP_12345678 ProteinProtein products
29
EST
30
EST Expressed Sequence Tags database (dbEST) is a division of GenBank that contains sequence data and other information on "single-pass" cDNA sequences, or "Expressed Sequence Tags", from a number of organismsGenBank http://www.ncbi.nlm.nih.gov/sites/entrez?db=nuce st&cmd=search&term=
31
EST mRNA: Genomic regions actively transcribed in cell cDNA (complementary DNA) –Copy of mRNA using mRNA as a template –Sequence is complementary to mRNA EST: Expressed Sequence Tag (a short sub- sequence of a transcribed cDNA sequence) –Partial cDNA sequence –Can be 5’ or 3’ –Typical size: 200 - 500 bp –Represents mRNA actively transcribed in cell –Use to identify Genes; Alternative splicing; etc.
32
Access to dbEST Data EST sequences are included in the EST division of GenBank, available from NCBI by anonymous ftp and through Entrez The nucleotide sequences may be searched using the BLAST server –The TBLASTN program which takes an amino acid query sequence and compares it with six-frame translations of dbEST DNA sequences is particularly useful. EST sequences are also available as a flat file in the FASTA format by anonymous ftp in the /repository/dbEST directory at ftp.ncbi.nih.gov
33
UniGene
34
UniGene www.ncbi.nlm.nih.gov/UniGene Each UniGene entry is a set of transcript sequences that appear to come from the same transcription locus (gene or expressed pseudogene) In addition to sequences of well-characterized genes, hundreds of thousands novel expressed sequence tag (EST) sequences have been included. UniGene may be of use as a resource for gene discovery. UniGene has also been used by experimentalists to select reagents for gene mapping projects and large-scale expression analysis.
35
Numbers of UniGene Entries Bos taurus (cow) 42,843 Canis lupus familiaris (dog) 27,853 Equus caballus (horse) 8,348 Homo sapiens (human) 123,396 Mus musculus (mouse) 78,289 Ovis aries (sheep) 18,814 Rattus norvegicus (Norway rat) 63,434 Sus scrofa (pig) 51,576 Danio rerio (zebrafish) 51,481
36
UniGene UniGene is a useful tool to look up information about expressed genes UniGene displays information about the abundance of a transcript (expressed gene), as well as its regional distribution of expression
37
Protein Structure
39
Now… Let’s Give these databases a closer look with a Lab
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.