Presentation is loading. Please wait.

Presentation is loading. Please wait.

Instructor: Kritika Karri

Similar presentations


Presentation on theme: "Instructor: Kritika Karri"— Presentation transcript:

1 Instructor: Kritika Karri
Biological Databases Lecture 9 2/16/2018 Instructor: Kritika Karri

2 Class Objectives Why are databases the backbone of bioinformatics ?
The basic structure of a database Data storage versus annotation- Refseq Database Types of DBs: Genbank, PubMed, and NCBI Query strategies Quality of data issues

3 Biologists Collect Lots of Data
Hundreds of thousand of species Million of articles in scientific literature Genetic Information Gene names (thousands) Phenotype of mutants Location of genes/mutations on chromosomes Linkage (distances between genes)

4 What is a Database ? A collection data that needs to be : Structured
Searchable Updated (periodically) Cross referenced Challenge: To change “meaningless” data into useful information that can be accessed and analysed the best way possible. For example: How would you organise all biological sequences so that the biological information is optimally accessible?

5 A spreadsheet can be a Database
Columns are Fields Rows are Records Can search for a term within just one field Or combine searched across several fields.

6 Database Organisation
Internal Organisation Controls speed and flexibility. A unit of programs that Store Extract Modify Flat file databases (flat DBMS) Simple, restrictive, table Hierarchical databases Simple, restrictive, tables Relational databases (RDBMS) Complex, versatile, tables Object-oriented databases (ODBMS) Data warehouses and distributed databases

7 Where do the data come from ?

8 Types of Data Sequence or Structure Nucleic acid or protein
Important biological information such as about genes and their metabolic pathways, mutations, diseases, drugs, images etc.

9 Biological Database Architecture

10 Types of Database Primary Databases:
Original submissions by experimentalists Content controlled by the submitter Examples: GenBank, Trace, SRA, SNP, GEO Secondary databases: Results of analysis of primary databases Aggregate of many databases Content controlled by third party (NCBI) Examples: NCBI Protein, Refseq, TPA, RefSNP, GEO datasets, UniGene, Homologene, Structure, Conserved Domain

11

12 International Sequence Database Collaboration
National Centre for Biotechnology Information (NCBI) : European Nucleotide Archive (ENA) : DNA Data Bank of Japan (DDBJ) :

13 Data sharing collaboration
Ensure data consistency Avoid duplication Open data sharing

14 Biological Databases I: Biomedical Literature

15 Biological Database I : Biomedical Literature Database
Medline: d/pmresources.html NLM journal citation database. Includes citations 5,600 scholarly journals published around the world. PubMed ~28 million citations mainly from: MEDLINE indexed journals journals/manuscripts deposited in PMC NCBI Bookshelf

16 Pubmed query builder using MeSH terms
MeSH (Medical Subject Headings) is the NLM controlled vocabulary thesaurus used for indexing articles for PubMed. the U.S. National Library of Medicine's controlled vocabulary (thesaurus). arranged in a hierarchical manner called the MeSH Tree Structures. updated annually

17 PubMed search demo

18 Hands On Exercise I Find all article related to PTEN gene on pubmed.
How many articles did you find ? Modify your search to find entries in Pubmed for PTEN related work from authored by Hui Liang How many articles did you find? Restrict your search and find PTEN related articles by author Hui Liang in Cell Metabolism Journal. What is the full title of the article? Which year it was published in ? Reflection question: What are some advantages of using MeSH term builder? More tutorials on building Pubmed queries for efficient search :

19 Biological Databases II: Genomics and Transcriptomics

20 Biological Database II- Genomics and Transcriptomics
GenBank: Flat file Nucleotide only sequence database Archival in nature: Historical, Redundant Data: Direct submissions (traditional records), Batch submissions, FTP accounts (genome data) Sample GenBank record (accession number U49845) NCBI: ENA: DDBJ:

21 GenBank Flat File

22 Ensembl Contains all the vertebrate genome DNA sequences currently available in the public domain. Automated annotation: by using different software tools, features are identified in the DNA sequences: Genes (known or predicted) Single nucleotide polymorphisms (SNPs) Repeats Homologies Ensembl tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species.

23 Nucleic Acid Structure Database
NDB Nucleic acid-containing structures NTDB Thermodynamic data for nucleic acids RNABase RNA-containing structures from PDB and NDB SCOR Structural classification of RNA: RNA motifs by structure, function and tertiary interactions

24 Biological Databases III: Proteomics

25 Biological Database III- Proteomics
Protein sequence database:

26 Genpept

27 Uniprot The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data. UniProt is a collaboration between the European Bioinformatics Institute (EMBL-EBI), the SIB Swiss Institute of Bioinformatics and the Protein Information Resource (PIR). the entry belongs to the Swiss-Prot section of UniProtKB (reviewed) or to the computer-annotated TrEMBL section (unreviewed).

28 Protein Structure database- PDB
Protein Data Bank (PDB) Archive-information about the 3D shapes of proteins, nucleic acids, and complex assemblies that helps students and researchers understand all aspects of biomedicine and agriculture, from protein synthesis to health and disease.

29 Protein Family Database
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models

30 Protein-Protein Interaction Database
STRING: (Search Tool for the Retrieval of Interacting Genes/Proteins) is a biological database and web resource of known and predicted protein– protein interactions. Information from numerous sources, including experimental data, computational prediction methods and public text collections Nodes: Network nodes represent proteins Edges: Edges represent protein-protein associations

31 Hands-on Exercise II Search Genbank or ensembl for human PTEN gene.
What chromosome is this gene located on? Is it a protein coding gene ? How many transcripts this gene have? How many transcripts are functional ? Does this gene has an alternative splicing events What protein does PTEN gene code for? How many of those protein entries are reviewed? Number of protein-protein interactions for PTEN gene in humans? Are there any records of Post Translational Modification (PTM) ?

32 Data vs Annotation Database
RefSeq provide a scientist-curated nonredundant set of biological sequences. (Derivative) Source: Genbank (INSDC) Annotated: Community collaboration, automated computer, NCBI staff curation Advantages of using RefSeq Non-redundancy Updates to reflect current sequence data and biology Data validation Format consistency Distinct accession series

33 Selected Refseq Accession

34 High-Throughput Sequencing Database
Gene Expression Omnibus (GEO) archives and freely distributes high throughput gene expression data submitted by the scientific community. NCBI Sequence Read Archive (SRA) archives raw sequencing data and alignment information from high-throughput sequencing platforms. SRA experiment includes sequence data and metadata regarding how a biological sample was sequenced. Example dataset : database of Genotype and Phenotype(dbGAP): public repository for individual-level phenotype, exposure, genotype, and sequence data, and the associations between them. European Genome Phenome Archive: repository for a sequence and genotype experiments, case-control, population, and family studies. GEO (e.g. GSE37757) Submitters supply their gene expression data in four sections: Platform: describes the list of features on the array (e.g.,cDNAs, oligonucleotides, etc.) Sample: describes the biological material and the experimental conditions under which the sample was handled, and the abundance measurement of each feature derived from it. Series: defines a set of related Samples that are considered to be part of an experiment. Supplementary data: original microarray scan images or raw quantification data. FTP download: All Platform, Sample and Series records, raw data, with annotation are available for bulk download via FTP at

35 Other Specialised Databases
UCSC Xena: Genotype-Tissue Expression Gtex: Correlations between genotype and tissue-specific gene expression levels will help identify regions of the genome that influence whether and how much a gene is expressed. mirBase: Database of published miRNA sequences and annotation. Each entry represents a predicted hairpin portion of a miRNA transcript (termed mir in the database), with information on the location and sequence of the mature miRNA sequence (termed miR). Pubchem: chemical information with structures, information and links DrugBank: combines detailed drug data with comprehensive drug target information. AND Many MORE !!!!!

36 Database Retrieval Problem with Traditional link method
Rapidly growing databases with complex and changing relationships Rapidly changing interfaces to match the above Many people don’t know: Where to begin Where to click on a Web page Why it might be useful to click there Entrez GQuery is a retrieval system for searching several linked databases such as: Pubmed, GenBank etc.

37 BLAST BLAST stands for Basic Local Alignment Search Tool
Good balance of sensitivity and speed Reliable Flexible Produce local alignments: short significant stretches of similarity, irrespective of where they are in the sequence Blast applies heuristic approach, it does not necessarily find the best hit for your search.

38 BLAST Output List of sequences with scores Raw score Higher is better
Depends on aligned length Expect Value (E-value) Smaller is better Independent of length and database size The Expect value (E) is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. It decreases exponentially as the Score (S) of the match increases. Where can I BLAST ? NCBI BLAST web service : EBI BLAST web service : FlyBase BLAST : Drosophila and other insects

39 Hands on Exercise III This fragment of genomic DNA belong to a part of gene. >query 1 CTAAACTACCAAGGCCATCTCTACTTAAAAACAGTTGTCTTTTGTTTGTGATTTCAGGGGCCCTGGGTATAAGCGAAGTCCCTGTTTAGAGACCTTGTGATGGGTTCAAAATATCAAGAAAGATAGCAAAATATCACAAGCCTCCTGACCCGAGAAGATTAGCGTTGAAAGGGTCTGTCGTGTTTGTTTGGGCCTGGGGCTAAATTCCCAGCCCAAGTGCTGAGGCTGATAATAATCGGGGCGGCGATCAGACAGCCCCGGTGTGGGAAATCGTCCGCCCGGTCTCCCTAAGTCCCCGAAGTCGCCTCCCACTTTTGGTGACTGCTTGTTTATTTACATGCAGTCAATGATAGTAAATGGATGCGCGCCAGTATAGGCCGACCCTGAGGGTGGCGGGGTGCTCTTCGCAGCTTCTCTGTGGAGACCGGTCAGCGGGGCGGCGTGGCCGCTCGCGGCGTCTCCCTGGTGGCATCCGCACAGCCCGCCGCGGTCCGGTCCCGCTCCGGGTCAGAATTGGCGGCTGCGGGGACAGCCTTGCGGCTAGGCAGGGGGCGGGCCGCCGCGTGGGTCCGGCAGTCCCTCCTCCCGCCAAGGCGCCGCCCAGACCCGCTCTCCAGCCGGCCCGGCTCGCCACCCTAGACCGCCCCAGCCACCCCTTCCTCCGCCGGCCCGGCCCCCGCTCCTCCCCCGCCGGCCCGGCCCGGCCCCCTCCTTCTCCCCGCCGGCGCTCGCTGCCTCCCCCTCTTCCCTCTTCCCACACCGCCCTCAGCCGCTCCCTCTCGTACGCCCGTCTGAAGAAGAATCGAGCGCGGAACGCATCGATAGCTCTGCCCTCTGCGGCCGCCCGGCCCCGAACTCATCGGTGTGCTCGGAGCTCGATTTTCCTAGGCGGCGGCCGCGGCGGCGGAGGCAGCAGCGGCGGCGGCAGTGGCGGCGGCGAAGGTGGCGGCGGCTCGGCCAGTACTCCCGGCCCCCGCCATTTCGGACTGGGAGCGAGCGCGGCGCAGGCACTGAAGGCGGCGGCGGGGCCAGAGGCTCAGCGGCTCCCAG Using BLAST search determine which gene/genes is this query fragment associated with?

40


Download ppt "Instructor: Kritika Karri"

Similar presentations


Ads by Google