Introduction to Bioinformatics and Biological databases Nicky Mulder:

Slides:



Advertisements
Similar presentations
Introductory to database handling Endre Sebestyén.
Advertisements

Searching PubMed. Search fields/accesses  Affiliation [AD]: Harvard Medical School[ad]  Author [AU]: Leon DA[au]  Issue [IP]:The number of the journal.
PubMed and its search options Jan Emmerich, Sonja Jacobi, Kerstin Müller (5th Semester Library Management)
Introduction to PubMed® (pubmed.gov)
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Shelly Warwick, Ph.D – Permission is granted to reproduce and edit this work for non-commercial educational use as long as attribution is provided.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
1 Introduction to Bioinformatics Fall Administration  Adi Doron  Nimrod Rubinstein  Dudu Burstein.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Biological databases.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Lecture 2.21 Retrieving Information: Using Entrez.
Protein Databases EBI – European Bioinformatics Institute
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
UniProt - The Universal Protein Resource
Databases מאגרי מידע - חלק ב' אחסון שליפה. What are we looking for in a GOOD database? Large amount of data Numerous entries Well defined fields Non-redundancy.
Bioinformatics Lecture 3 BCH 550 Arjumand Warsy. Retrieving Protein Sequences.
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
An Introduction to Bioinformatics Molecular Biology Databases.
On line (DNA and amino acid) Sequence Information
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Secondary Databases Ansuman sahoo Roll: Y Bioinformatics Class Presentation 30 Jan 2013.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Biological databases Nicky Mulder:
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Organizing information in the post-genomic era The rise of bioinformatics.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
PROTEIN DATABASES. The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
NCBI Literature Databases: PubMed
EB3233 Bioinformatics Introduction to Bioinformatics.
Bioinformatics and Computational Biology
Computer Storage of Sequences
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Getting GO: how to get GO for functional modeling Iowa State Workshop 11 June 2009.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
Copyright OpenHelix. No use or reproduction without express written consent1.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Protein databases Henrik Nielsen
محسن شیرازی کارشناسي علوم کتابداري و اطلاع رساني پزشکی
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Introduction to Bioinformatics and Biological databases Nicky Mulder:

The application of computer technology to the management of biological information. Specifically, it is the science of developing computer databases and algorithms to facilitate and expedite biological research, particularly in genomics. Doing biology on a computer (computational biology) What is Bioinformatics?

Why is Bioinformatics needed? Small- and large-scale biological analyses New laboratory technologies, e.g. sequencing Move away from single gene to whole genome Collection and storage of biological information Manipulation of biological information Computers have capability for both, and cheap

Hypothesis-driven bioinformatics Gene of interest Search PubMed for more information Retrieve the protein sequence Sequence similarity search Looking at whole genomes Retrieve the DNA sequence Analysing DNA sequence Retrieve articles Sequence similarity search Phylogenetics Sequence alignments, finding motifs Finding domains, Classifying proteins Genomics Sequence analysis Biological databases Phylogenetics

Hypothesis-generating bioinformatics High-throughput experiment (microarray, Proteomics, NGS) Experimental design Statistics Gene listsData processing Pathway analysis Data mining Gene set enrichment Systems biology Data integration Proteomics, NGS Systems biologyStatistics

Two major components to Bioinformatics Storing and retrieving data: –Biological databases –Querying these to retrieve data Manipulating the data –tools e.g: –Sequence similarity searches –Protein families and function prediction –Comparing sequences -phylogenetics

What is a database an organized body of related infomation Data collection that is: –Structured (computer readable) –Searchable –Updatable –Cross-linked –Publicly available

Biological Databases Where do you go to find: –A video -> Youtube –Info on S. Hawking-> Wikipedia –A book -> Amazon –A friend -> Facebook –DNA sequence -> EMBL –Protein sequence -> UniProtKB, RefSeq… Biological databases: –Order and make data available to public –Turn data into computer-readable form –Provide ability to retrieve data from various sources Can have primary (archival) or secondary databases (curated)

Categories of Databases for Life Sciences Sequences (DNA, protein) Genomics Mutation Protein domain/family Proteomics 3D structure Metabolism Bibliography Protein interaction

Categories of Databases for Life Sciences Sequences (DNA, protein) Genomics Mutation Protein domain/family Proteomics 3D structure Metabolism Bibliography Protein interaction

>5000 genomes sequenced (single organism, varying sizes, including virus) Thousands of ongoing genome sequencing projects cDNAs sequencing projects (ESTs or cDNAs) Metagenome sequencing projects (~200) = environmental samples: multiple ‘unknown’ organisms =microbiome Personal human genomes Cost of sequencing is coming down –alternative to other technologies Why do we need sequence DBs?

Sequence databases Used for retrieving a known gene/protein sequence Useful for finding information on a gene/protein Can find out how many genes are available for a given organism Can comparing your sequence to the others in the database Can submit your sequence to store with the rest Main databases: nucleotide and protein sequence DBs Should be interconnected with other databases

DNA sequences Gene annotation Gene expression data Protein sequences Macromolecular structure data Protein centric view of database network

Nucleotide sequence databases EMBL, DDBJ, GenBank Data submitted by sequence owner Must provide certain information and CDS if applicable No additional annotation added Entries never merged –some redundancy Promoter Exons CDS (coding sequence)

taxonomy Cross-references references accession number features

Annotation (Prediction or experimentally determined) Sequence CDS CoDing Sequence (proposed by submitters)

Feature lines in EMBL entries Describes features on a sequence NB for function, replication, recombination, structure etc. Feature key e.g. CDS protein-coding sequence, ribosome binding site Functional group Location How to find feature Qualifier Additional info

Summary of information in EMBL entries Provides taxonomy from which sequence came Provides information on submitters and references Describes features on a sequence NB for function, replication, recombination, structure etc. Shows if the DNA encodes a protein (CDS) and provides protein sequence Provides actual nucleotide sequence Describes sequence type, e.g. genomic DNA, RNA, EST

CDS: mRNA versus genomic sequence CONTIG CGANGGCCTATCAACA ATG AAAGGTCGAAACCTG Genomic AGCTACAAACAGATCCTTGATAATTGTCGTTGATTTTACTTTATCCTAAATTTATCTCAAAAATGTTGAAATTCAGATTCGTCAAGCGAGGGCCTATCAACA ATG -AAGGTCGAAACCTG *** ************ ** * ************** CONTIG CGTTTACTCCGGATACAAGATCCACCCAGGACACGGNAAAGAGACTTGTCCGTACTGACGGAAAG Genomic CGTTTACTCCGGATACAAGATCCACCCAGGACACGG-AAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTA ************************************ **************************** CONTIG Genomic TATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTT CONTIG GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC Genomic GAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC ************************************************************************** CONTIG TGTCCTCTACAGAATCAAGAACAAGAAG GGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC Genomic TGTCCTCTACAGAATCAAGAACAAGAAGGTACTTGAGATCCTTAAACGCAGTTGAAAATTGGTAATTTTACAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC **************************** *********************************************** CONTIG CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA Genomic CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA ************************************************************************************************************************ CONTIG TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTNCCAACAAG Genomic TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTGCCAACAAGGTAAACTTTCTACAATATTTATTATAAACTTTAGCATGCTGTTAGAGCTTGTAAGGTATATGTGATTTTACGAGTGT ********************************** ******** CONTIG GNAAA Genomic GTTATTTGAAGCTGTAATATCAATAAGCATGTCTCGTGTGAAGTCCGACAATTTACCATATGCATGAAATTTAAAAACAAGTTAATTTTGTCAATTCTTTATCATTGGTTTTCAGGAAAA * *** CONTIG GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATNTNAAGACTGCTGCTCCNCGTGTCGGNGGAAANCGA TAA ACGTTCTCGGNCCCGTTATTGTAATAAATTTTGTTGAC Genomic GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATGTGAAGACTGCTGCTCCACGTGTCGGAGGAAAGCGA TAA ACGTTCTCGGTCCCGTTATTGTAATAAATTTTGTTGAC ******************************************* * ************** ******** ***** **** * *********** *************************** CONTIG C Genomic CGTTAAAGTTTTAATGCAAGACATCCAACAAGAAAAGTATTCTCAAATTATTATTTTAACAGAACTATCCGAATCTGTTCATTTGAGTTTGTTTAGAATGAGGACTCTTCGAATAGCCCA * exon intron

Other nucleotide databases RefSeq dbEST Short read archive Trace archive WGS collections

Protein sequences DNA RNA Protein S S Ac Protein cleavage Protein modification Transported to organelle or membrane Folded into secondary or tertiary structure Performs a specific function All this info needs to be captured in a database

Protein Sequence Databases UniProt: –Swiss-Prot –manually curated, distinguishes between experimental and computationally derived annotation –TrEMBL - Automatic translation of EMBL, no manual curation, some automatic annotation GenPept -GenBank translations RefSeq - Non-redundant sequences for certain organisms

A UniProtKB/Swiss-Prot entry Protein existence levels: 1: Evidence at protein level 2: Evidence at transcript level 3: Inferred from homology 4: Predicted 5: Uncertain

Swiss-Prot annotation mainly found in: Comment (CC) lines –Function, pathway, cofactor, regulation, disease, subcellular location, Feature table (FT) –features on the sequence, e.g. domain, active site Keyword (KW) lines –Set of a few hundred controlled vocabulary terms Description (DE) lines –Protein name/function

Other parts to UniProt UniParc –archive of all sequences UniProt –Swiss-Prot + TrEMBL UniProt NREF100 (100% seqs merged) UniProt NREF90 (90% seqs merged) UniProt NREF50 (50% seqs merged) UniMES –metagenomic sequences

Submitting sequences to EMBL or UniProt WEB-IN -web-based submission tool for submitting DNA sequences to EMBL database. Protein sequences submitted when the peptides have been directly sequenced. Submit through SPIN

Sequence formats Not MSWord, but text! Most include an ID/name/annotation of some sort FASTA, E.g. >xyz some other comment ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgc caatatgcgctctttgtccgcgcccaggagctacacaccttcgaggtga ccggccaggaaacggtcgccagatcaaggctcatgtagcctcactgg Others specific to programs, e.g. GCG, abi, clustal, etc.

Accession numbers GenBank/EMBL/DDBJ: 1 letter & digits, e.g.: U12345 or 2 letters & 6 digits, e.g.: AY GenPept Sequence Records -3 letters & 5 digits, e.g.: AAA12345 UniProt -All 6 characters: [A,B,O,P,Q] [0-9] [A- Z,0-9] [A-Z,0-9] [A-Z,0-9] [0-9], e.g.: P12345 and Q9JJS7

Cross-referencing identifiers So many different IDs for same thing, e.g. Ensembl, EMBL, HGNC, UniGene, UniProt, Affy ID, etc. Need mapping files to move between them to avoid having to parse every entry PICR ( enables mapping between IDs UniProt website mapper (

Taxonomy Databases Most used is NCBI’s taxonomy database: omy Provides entries for all known organisms Provides taxonomic lineage and translation table for organisms Sequence entries for organism UniProt-specific taxonomy database is Newt:

Example taxonomy entry

Literature database: PubMed/Medline Source of Medical-related & scientific literature PubMed has articles published after 1965 Can search by many different means, e.g. author, title, date, journal etc., or keywords for each PubMed has list of tags to search specific fields, e.g. [AU], [TI], [DP] etc. Can save queries and results Can usually retrieve abstracts and full papers

Types of search fields Title Words [TI]MeSH Terms [MH] Title/Abstract Words [TIAB]Language [LA] Text Words [TW]Journal Title [TA] Substance Name [NM] Issue [IP] Subset [SB]Filter [FILTER] Secondary Source ID [SI]Entrez Date [EDAT] Subheadings [SH] EC/RN Number [RN] Publication Type [PT]Author Name [AU] Publication Date [DP]All Fields [ALL] Personal Name as Subject [PS]Affiliation [AD] Page Number [PG]Unique Identifiers [UID] Title Words [TI] MeSH Major Topic [MAJR] MeSH Date [MHDA]

Querying biological databases Databases hold a wealth of information Data is held in specific formats and controlled vocabulary- easy for searching Can retrieve and save data you need In many cases you can retrieve data from multiple sources

How to query databases Query languages e.g. SQL Can query with single word or phrase Boolean queries Regular expressions Basic database querying is usually done through web interface –Text or sequence-based searches –Can use Boolean queries and regular expressions

Words and phrases Most searches are case insensitive Keywords are single words searched Phrases –groups of words E.g. tyrosine protein kinase –returns anything with either of the words “tyrosine ”, “protein ” or “kinase” (keywords) “tyrosine protein kinase” –returns anything with the complete phrase only

Boolean operators ( George Boole ) Operators e.g. & (AND), | (OR), ! (NOT), e.g.: –protein & kinase ! tyrosine –tyrosine & protein & kinase More complex: (tyrosine OR kinase) AND (NOT serine) Operators don’t work in “”, e.g. “tyrosine and kinase” Wildcards * and ? E.g. cell*ase finds all words starting with “cell” and ending in “ase” Attributes are used to be more specific about where to find the keyword

Resources for searching databases Sequence Retrieval System (SRS) EBI –all EBI databases search NCBI –Entrez Each database usually has own web interface allowing simple queries E.g. EnsMart allows querying of Ensembl database

Sequence Retrieval System Integrates over 150 different databases A database can be searched and results linked to other databases in SRS Searches can be simple or complex –can view previous queries and combine them Results can be viewed in default formats or user-defined views Users can save their results and launch software packages (>100 applications) to analyse results