Biology 4900 Biocomputing. Chapter 2 Molecular Databases and Data Analysis.

Slides:

Advertisements

Similar presentations

Bioinformatics Ayesha M. Khan Spring 2013.

Advertisements

Databases (“knowledge bases”) used in genome analysis

Bunu databases’in icine koy lecture 5i de sonuna

Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.

COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.

On line (DNA and amino acid) Sequence Information Lecture 7.

The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.

1 Introduction to Bioinformatics Fall Administration  Adi Doron  Nimrod Rubinstein  Dudu Burstein.

How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171

Archives and Information Retrieval

Biological databases.

Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.

Lecture 2.21 Retrieving Information: Using Entrez.

Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.

Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center

EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:

Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.

An Introduction to Bioinformatics Molecular Biology Databases.

On line (DNA and amino acid) Sequence Information

Bioinformatics.

The Ensembl Gene set The “Genebuild” 21 April 2008.

Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.

Bioinformatics for biomedicine

Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.

Bioinformatics Jack Min Office 3012 Office hours: TR 12:15 – 4.

Genomics, Proteomics, and Bioinformatics Biology 224 Instructor: Tom Peavy January 29, 2008.

Introduction to Bioinformatics Part 1 of 2 Jonathan Pevsner, Ph.D. M.E: September 8, 2003.

NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.

Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.

Genomics, Proteomics, and Bioinformatics Biology 224 Instructor: Tom Peavy August 31, 2009.

Introduction to Bioinformatics Monday, November 15, 2010 Jonathan Pevsner Bioinformatics M.E:

Searching PubMed® NCBI, NLM Resources, Micromedex -GSBS TTUHSC Preston Smith Library presents Rev. 08/17/14.

Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management.

GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.

Biological Databases By : Lim Yun Ping E mail :

Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.

Copyright OpenHelix. No use or reproduction without express written consent 2 Overview of Genome Browsers Materials prepared by Warren C. Lathe, Ph.D.

UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.

Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

Part I: Identifying sequences with … Speaker : S. Gaj Date

Introduction to Bioinformatics Introduction to Databases

DAY 1c: Accessing Completed Genomes 1. UCSC Genome Bioinformatics 2. Ensembl 3. NCBI Genomic Biology.

Introduction to Bioinformatics Databases. DNARNAphenotypeprotein Central dogma of molecular biology A main focus of bioinformatics is to study molecular.

Organizing information in the post-genomic era The rise of bioinformatics.

Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.

NCBI Literature Databases: PubMed

Class material and homework for February 9 today’s in-class topic: selected examples of contemporary biotechnology –polymerase chain reaction (PCR) –DNA.

Bioinformatics and Computational Biology

Computer Storage of Sequences

Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.

EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.

Copyright OpenHelix. No use or reproduction without express written consent1.

Copyright OpenHelix. No use or reproduction without express written consent1.

Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.

Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,

Copyright OpenHelix. No use or reproduction without express written consent1.

Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.

GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.

Welcome to the combined BLAST and Genome Browser Tutorial.

Information retrieval and sliding window programs April 5, 2011 Hand in Homework #1. Homework #2 due Tuesday, April 12. Learning objectives- Understand.

Genome Bioinformatics DNA and protein Databases I.

What is sequencing? Video: WlxM (Illumina video) WlxM.

Keeping Current: Genetics Resources. This workshop will provide an overview of NCBI resources for finding-- Background information & journal articles.

Chapter 2: Access to Information Jonathan Pevsner, Ph.D.

Introduction to Genes and Genomes with Ensembl

Introduction to Bioinformatics

Archives and Information Retrieval

Access to Sequence Data and Related Information

Genomes and Their Evolution

SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.

Presentation transcript:

Biology 4900 Biocomputing

Chapter 2 Molecular Databases and Data Analysis

Literature Databases Online databases available at CSU – Galileo Galileo – JSTOR JSTOR Online databases at other sites – PubMed. If you find a useful article, you can check PubMed Central to see if it is available online for free. PubMed PubMed Central Where to get articles – PubMed Central PubMed Central – GIL GIL – Interlibrary loan Interlibrary loan

DNARNA cDNA *ESTs UniGene phenotype genomic DNA databases protein sequence databases protein Sources of Molecular Data *Expressed Sequence Tags

Molecular Databases Primary Database – Archival - sequences submitted directly from experimental sequencing results Very little interpretation Anyone can submit; accuracy not checked Examples – Nucleic Acid: EMBL, DDJB, GenBANK EMBLDDJBGenBANK – Protein: Swiss-Prot, PIR, PDB Swiss-ProtPIRPDB Secondary Databases – Curated – sequences are validated/checked and may be annotated Refseq (nucleic acids and proteins, but limited to certain organisms) Refseq TrEMBL, GenPept, Uniprot

Nucleic Acid Databases Contain: – Nucleic acid sequences Chain termination method (Sanger sequencing)Sanger sequencing – Used for sequences bp Whole Genome Shotgun (WGS) SequencingWGS – Used for sequences >1000 bp – DNA chopped into little chunks – Sequenced using chain termination method (reads) – Numerous, overlapping reads are collected and assembled into sequence (computational methods) – Annotations for each sequence Putative identification of open reading frames (ORFs = parts of gene that encode protein) in sequence Putative intron(excised)/exon(retained) locations Authors, dates, publication, etc.

GenBankEMBL DDBJ International Nucleotide Sequence Database Collaboration (Public nucleotide and protein sequence databases) Name: European Molecular Biology Laboratory (EMBL) Location: European Bioinformatics Institute (EBI) Name: DNA Database of Japan (DDBJ) Location: National Institute of Genetics, Mishima Daily Info sharing Name: GenBank Location: National Institutes of Health, National Center for Biotechnology Information Daily Info sharing

GenBank As of April 2011, There were approximately 126,551,501,141 bases in 135,440,924 sequence records in the traditional GenBank divisions. Read the following paper: Home Page: Homo sapiens 14.9 billion bases Mus musculus 8.9b Rattus norvegicus 6.5b Bos taurus5.4b Zea mays 5.0b Sus scrofa4.8b Danio rerio 3.1b Strongylocentrotus purpurata1.4b Oryza sativa (japonica) 1.2b Nicotiana tabacum 1.2b Most sequenced organisms

GenBank Home Page

NCBI Resources PubMed BLAST OMIM Taxonomy Browser Structure

NCBI key features: PubMed National Library of Medicine's search service 21 million citations from MEDLINE & others (as of 2011) Links to other online journals Starting point for most research

Literature Searches through PubMed

Use the pull-down menu to access related resources such as Medical Subject Headings (MeSH)

A “how to” pull-down menu links to tutorials

Use “Advanced search” to limit by author, year, language, etc.

PubMed search strategies Try the tutorial Use boolean queries (capitalize AND, OR, NOT) lipocalin AND disease Try using limits (see Advanced search) There are links to find Entrez entries and external resources

lipocalin AND disease (504 results) lipocalin OR disease (2,500,000 results) lipocalin NOT disease (2,370 results) 1 AND 2 1 OR 2 1 NOT

Save Searches, Save Results, Get Papers

PubMed Author Search

Scholar Google Search Includes references that may not be found in PubMed

A search from NCBI main page will search: the scientific literature; DNA and protein sequence databases; 3D protein structure data; population study data sets; assemblies of complete genomes String search Search by author, date, keyword, publication, etc. NCBI key features Classroom exercise: Author searches Paper searches Protein searches

BLAST is… Basic Local Alignment Search Tool NCBI's sequence similarity search tool supports analysis of DNA and protein databases NCBI key features: BLAST 3CLN

Online Mendelian Inheritance in Man Catalog of human genes and genetic disorders NCBI key features: OMIM

Browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses) Taxonomy information such as genetic codes Molecular data on extinct organisms Useful to find a protein or gene from a species NCBI key features: Taxonomy Browser

Molecular Modelling Database (MMDB) biopolymer structures obtained from the Protein Data Bank (PDB) Cn3D (a 3D-structure viewer) vector alignment search tool (VAST) NCBI key features: Structure

Cn3D A 3D-structure viewer Must download ( ftp://ftp.ncbi.nlm.nih.gov/cn3d/Cn3D-4.3.msi) ftp://ftp.ncbi.nlm.nih.gov/cn3d/Cn3D-4.3.msi Use to align structures identified as similar by VAST

Example: Researching beta globin Beta globin is protein, so it will be found in 3 different types of databases DNA*RNAProteins GenBank dbGSS GenBank dbHTGS GenBank dbSTS GenBank Entrez Gene GenBank dbEST UniGene Gene Expression Omnibus Entrez Protein UniProt PDB SCOP CATH *Because RNA is unstable, it can be transcribed into complementary DNA (cDNA)

Necessary (yet annoying) Definitions Sequence Tagged Site (STS): Small DNA fragments with both DNA sequence data and mapping data (genes assigned to chromosomes) Expressed Sequence Tags (EST): Partial DNA sequence of a complementary (cDNA) clone – Typically these are randomly-selected cDNA clones sequenced on a single strand ( bp) – Useful for identifying novel genes – Higher rate of error

Unigene Unique Gene (Unigene) Project to create gene-oriented clusters by partitioning ESTs into non-redundant sets – – Ultimately there should be only 1 cluster per gene – Usually more than 1 due to errors – Types of errors 2 or more clusters may represent different parts of the same gene Sequence errors Cloning artifacts (DNA transcribed during creation of cDNA that doesn’t correspond to authentic transcript) EST’s Unigene Cluster

GenBank Flatfile A format for organizing genomic sequence data. Includes the following: Sequence and annotations Header – Locus name or accession number: unique to sequence description – Size: number of nucleotide bases or amino acid residues – Molecule: DNA, RNA, strandedness (ds, ss), and type of RNA or DNA – Genbank division code: 18 divisions (PRI = primate, PLN = plant, BAC = bacterial, etc.) – Date of last modification Definition Line: brief description of sequence (e.g. source organism, protein/gene name, function) Accession: unique identifier for a record Version – May be more than one accession – Record modification (accession.1; accession.2) – GI: is specific to version; may be more than one Keywords Source: organism or clone description Reference: publications that discuss data reported Authors and Journal publication info PubMed identifier: link to sequence record (abstract) Features: vary (chromosomal info., coding info, protein id, % of each nucleotide) Sequence Data Jump to example

What is an accession number? An accession number is label that is used to identify a sequence. It is a (unique) string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775GenBank genomic DNA sequence NT_030059Genomic contig (overlapping DNA fragments) Rs dbSNP (single nucleotide polymorphism) N An expressed sequence tag (1 of 170) NM_006744RefSeq DNA sequence (from a transcript) NP_007635RefSeq protein AAC02945GenBank protein Q28369SwissProt protein 1KT7Protein Data Bank structure record protein DNA RNA

NCBI’s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon “reference” version of a sequence. RefSeq identifiers include the following formats: Complete genomeNC_###### Complete chromosomeNC_###### Genomic contigNT_###### mRNA (DNA format)NM_###### e.g. NM_ ProteinNP_###### e.g. NP_006735

UniGene Name Search: Oncomodulin All results listed Allows filtering

UniGene Name Search: Select Human Oncomodulin 4 Expressed Sequence Tags from 1 complementary DNA library Identifies chromosome and map position on chromosome Compares cluster transcripts with refseq proteins

UniGene Name Search: Select Human Oncomodulin Click on link for menu of other links: Conserved domains Gene summary Protein sequence Clicking on Protein sequence link then takes you to predicted protein sequence file (NP_ ) 

UniGene Name Search: Select Human Oncomodulin Once here, you can: 1.Open FASTA file 2.Run BLAST 3.Identify and view conserved domains 4.See related proteins

Access to sequences: Gene at NCBI Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. Example: RefSeq provides a curated, optimal accession number for each DNA (NM_ for beta globin DNA corresponding to mRNA) or protein (NP_000509) These references should be more reliable data

Gene Name Search: Oncomodulin Returns list of gene entries for oncomodulin for different organisms Click on a highlighted link to see details

Gene Name Search: Select Human Oncomodulin Summary of all gene information, including mapping (when available). Note that this sequence has been validated as a RefSeq. Scrolling down, you can find link to protein data through UniProt. 

Gene Name Search: Link to Oncomodulin Protein

Protein Name Search: Oncomodulin Notice that I filtered this search so that results show only human oncomodulin

You can change the display (as shown)…

FASTA format: versatile, compact with one header line followed by a string of nucleotides or amino acids in the single letter code

Comparison of Gene to other resources Gene: collects key information on each gene/protein from major databases. It covers all major organisms. UniGene: Database with information on where in a body, when in development, and how abundantly a transcript is expressed HomoloGene: Gathers information on sets of related proteins based on common genetic ancestry.

Homologene Name Search: Oncomodulin Provides list of homologous (related) genes

Homologene Name Search: Oncomodulin Shows conserved domains of protein sequences. If you click on graphic, takes you to summary of domain/family information.

ExPASy to access protein and DNA sequences ExPASy (Expert Protein Analysis System) sequence retrieval system Visit Similar to Entrez for NCBI Example: Search for calmodulin Jump to PrositeProsite

UniProt: a centralized protein database (uniprot.org) This is separate from NCBI, and interlinked.

UniProt: Calmodulin Search Results for bovine calmodulin (P62157)

Protein Secondary Structure: PDBSum (EMBL-EBI) Either enter PDB file or can load new/existing sequence

ExPASy: vast proteomics resources (

Genome Browsers Genomic DNA is organized in chromosomes. Genome browsers display ideograms (pictures) of chromosomes, with user-selected “annotation tracks” that display many kinds of information. The two most essential human genome browsers are at Ensembl and UCSC. We will focus on UCSC (but the two are equally important). The browser at NCBI is not commonly used.

click human Ensembl genome browser (

enter beta globin

Ensembl output for beta globin includes views of chromosome 11 (top), the region (middle), and a detailed view (bottom). There are various horizontal annotation tracks.

The UCSC Genome Browser This browser’s focus is on humans and other eukaryotes you can select which tracks to display (and how much information for each track) tracks are based on data generated by the UCSC team and by the broad research community you can create “custom tracks” of your own data! Just format a spreadsheet properly and upload it The Table Browser is equally important as the more visual Genome Browser, and you can move between the two

[1] Visit click Genome Browser [2] Choose organisms, enter query (beta globin), hit submit

[4] On the UCSC Genome Browser: --choose which tracks to display

Protein Databases What do they contain? – Amino acid sequences Primary sequence – Direct submissions - protein sequencing – SWISS-PROT, PIR Secondary sequence – Translations - putative proteins resulting from modifying (i.e. intron splicing) nucleic acid sequence – GenPept, TrEMBL Structure – Protein Data Bank Protein Data Bank – Annotations Function, domains, etc.

SWISS-PROT Created by Amos Bairoch in 1986 at the Department of Medical Biochemistry in Geneva Maintained by the Swiss Institute of Bio-informatics (SIB) and funded by GeneBio Few redundancies Direct submission (from sequencing, not translation) PIR (The Protein Information Resource) was created by M.O. Dayhoff in 1965 Maintained by many In 2004, joined with other databases (Swiss-Prot and TrEMBL) to become part of the UniProt consortiumUniProt consortium PIR

Protein Data Bank Protein Data Bank Archive of 3-D structural data of biological macromolecules Based on experimental data Managed by the Research Collaboratory for Structural Bioinformatics (RCSB) – Rutgers & UCSD As of January 11, 2012 contained structures ~ 5000 membrane proteins

PDB: Source of protein sequence and structure data