Introduction to Bioinformatics and Biological databases Nicky Mulder:

Introduction to Bioinformatics and Biological databases Nicky Mulder: nicola.mulder@uct.ac.za

The application of computer technology to the management of biological information. Specifically, it is the science of developing computer databases and algorithms to facilitate and expedite biological research, particularly in genomics. www.informatics.jax.org/mgihome/other/glossary.shtml Doing biology on a computer (computational biology) What is Bioinformatics?

Why is Bioinformatics needed? Small- and large-scale biological analyses New laboratory technologies, e.g. sequencing Move away from single gene to whole genome Collection and storage of biological information Manipulation of biological information Computers have capability for both, and cheap

Hypothesis-driven bioinformatics Gene of interest Search PubMed for more information Retrieve the protein sequence Sequence similarity search Looking at whole genomes Retrieve the DNA sequence Analysing DNA sequence Retrieve articles Sequence similarity search Phylogenetics Sequence alignments, finding motifs Finding domains, Classifying proteins Genomics Sequence analysis Biological databases Phylogenetics

Hypothesis-generating bioinformatics High-throughput experiment (microarray, Proteomics, NGS) Experimental design Statistics Gene listsData processing Pathway analysis Data mining Gene set enrichment Systems biology Data integration Proteomics, NGS Systems biologyStatistics

Two major components to Bioinformatics Storing and retrieving data: –Biological databases –Querying these to retrieve data Manipulating the data –tools e.g: –Sequence similarity searches –Protein families and function prediction –Comparing sequences -phylogenetics

What is a database an organized body of related infomation www.cogsci.princeton.edu/cgi-bin/webwn Data collection that is: –Structured (computer readable) –Searchable –Updatable –Cross-linked –Publicly available

Biological Databases Where do you go to find: –A video -> Youtube –Info on S. Hawking-> Wikipedia –A book -> Amazon –A friend -> Facebook –DNA sequence -> EMBL –Protein sequence -> UniProtKB, RefSeq… Biological databases: –Order and make data available to public –Turn data into computer-readable form –Provide ability to retrieve data from various sources Can have primary (archival) or secondary databases (curated)

Categories of Databases for Life Sciences Sequences (DNA, protein) Genomics Mutation Protein domain/family Proteomics 3D structure Metabolism Bibliography Protein interaction

>5000 genomes sequenced (single organism, varying sizes, including virus) Thousands of ongoing genome sequencing projects cDNAs sequencing projects (ESTs or cDNAs) Metagenome sequencing projects (~200) = environmental samples: multiple ‘unknown’ organisms =microbiome Personal human genomes Cost of sequencing is coming down –alternative to other technologies Why do we need sequence DBs?

Sequence databases Used for retrieving a known gene/protein sequence Useful for finding information on a gene/protein Can find out how many genes are available for a given organism Can comparing your sequence to the others in the database Can submit your sequence to store with the rest Main databases: nucleotide and protein sequence DBs Should be interconnected with other databases

DNA sequences Gene annotation Gene expression data Protein sequences Macromolecular structure data Protein centric view of database network

Nucleotide sequence databases EMBL, DDBJ, GenBank Data submitted by sequence owner Must provide certain information and CDS if applicable No additional annotation added Entries never merged –some redundancy Promoter Exons CDS (coding sequence)

taxonomy Cross-references references accession number features

Annotation (Prediction or experimentally determined) Sequence CDS CoDing Sequence (proposed by submitters)

Feature lines in EMBL entries Describes features on a sequence NB for function, replication, recombination, structure etc. Feature key e.g. CDS protein-coding sequence, ribosome binding site Functional group Location How to find feature Qualifier Additional info

Summary of information in EMBL entries Provides taxonomy from which sequence came Provides information on submitters and references Describes features on a sequence NB for function, replication, recombination, structure etc. Shows if the DNA encodes a protein (CDS) and provides protein sequence Provides actual nucleotide sequence Describes sequence type, e.g. genomic DNA, RNA, EST

CDS: mRNA versus genomic sequence CONTIG --------------------------------------------------------------------------------------CGANGGCCTATCAACA ATG AAAGGTCGAAACCTG Genomic AGCTACAAACAGATCCTTGATAATTGTCGTTGATTTTACTTTATCCTAAATTTATCTCAAAAATGTTGAAATTCAGATTCGTCAAGCGAGGGCCTATCAACA ATG -AAGGTCGAAACCTG *** ************ ** * ************** CONTIG CGTTTACTCCGGATACAAGATCCACCCAGGACACGGNAAAGAGACTTGTCCGTACTGACGGAAAG------------------------------------------------------- Genomic CGTTTACTCCGGATACAAGATCCACCCAGGACACGG-AAAGAGACTTGTCCGTACTGACGGAAAGGTGAGTTCAGTTTCTCTTTGAAAGGCGTTAGCATGCTGTTAGAGCTCGTAAGGTA ************************************ **************************** CONTIG ------------------------------------------------------------------------------------------------------------------------ Genomic TATTGTAATTTTACGAGTGTTGAAGTATTGCAAAAGTAAAGCATAATCACCTTATGTATGTGTTGGTGCTATATCTTCTAGTTTTTAGAAGTTATACCATCGTTAAGCATGCCACGTGTT CONTIG ----------------------------------------------GTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC Genomic GAGTGCGACAAACTACCGTTTCATGATTTATTTATTCAAATTTCAGGTCCAAATCTTCCTCAGTGGAAAGGCACTCAAGGGAGCCAAGCTTCGCCGTAACCCACGTGACATCAGATGGAC ************************************************************************** CONTIG TGTCCTCTACAGAATCAAGAACAAGAAG---------------------------------------------GGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC Genomic TGTCCTCTACAGAATCAAGAACAAGAAGGTACTTGAGATCCTTAAACGCAGTTGAAAATTGGTAATTTTACAGGGAACCCACGGACAAGAGCAAGTCACCAGAAAGAAGACCAAGAAGTC **************************** *********************************************** CONTIG CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA Genomic CGTCCAGGTTGTTAACCGCGCCGTCGCTGGACTTTCCCTTGATGCTATCCTTGCCAAGAGAAACCAGACCGAAGACTTCCGTCGCCAACAGCGTGAACAAGCCGCTAAGATCGCCAAGGA ************************************************************************************************************************ CONTIG TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTNCCAACAAG----------------------------------------------------------------------------- Genomic TGCCAACAAGGCTGTCCGTGCCGCCAAGGCTGCTGCCAACAAGGTAAACTTTCTACAATATTTATTATAAACTTTAGCATGCTGTTAGAGCTTGTAAGGTATATGTGATTTTACGAGTGT ********************************** ******** CONTIG -------------------------------------------------------------------------------------------------------------------GNAAA Genomic GTTATTTGAAGCTGTAATATCAATAAGCATGTCTCGTGTGAAGTCCGACAATTTACCATATGCATGAAATTTAAAAACAAGTTAATTTTGTCAATTCTTTATCATTGGTTTTCAGGAAAA * *** CONTIG GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATNTNAAGACTGCTGCTCCNCGTGTCGGNGGAAANCGA TAA ACGTTCTCGGNCCCGTTATTGTAATAAATTTTGTTGAC Genomic GAAGGCCTCTCAGCCAAAGACCCAGCAAAAGACCGCCAAGAATGTGAAGACTGCTGCTCCACGTGTCGGAGGAAAGCGA TAA ACGTTCTCGGTCCCGTTATTGTAATAAATTTTGTTGAC ******************************************* * ************** ******** ***** **** * *********** *************************** CONTIG C----------------------------------------------------------------------------------------------------------------------- Genomic CGTTAAAGTTTTAATGCAAGACATCCAACAAGAAAAGTATTCTCAAATTATTATTTTAACAGAACTATCCGAATCTGTTCATTTGAGTTTGTTTAGAATGAGGACTCTTCGAATAGCCCA * exon intron

Other nucleotide databases RefSeq dbEST Short read archive Trace archive WGS collections

Protein sequences DNA RNA Protein S S Ac Protein cleavage Protein modification Transported to organelle or membrane Folded into secondary or tertiary structure Performs a specific function All this info needs to be captured in a database

Protein Sequence Databases UniProt: –Swiss-Prot –manually curated, distinguishes between experimental and computationally derived annotation –TrEMBL - Automatic translation of EMBL, no manual curation, some automatic annotation GenPept -GenBank translations RefSeq - Non-redundant sequences for certain organisms

A UniProtKB/Swiss-Prot entry Protein existence levels: 1: Evidence at protein level 2: Evidence at transcript level 3: Inferred from homology 4: Predicted 5: Uncertain

Swiss-Prot annotation mainly found in: Comment (CC) lines –Function, pathway, cofactor, regulation, disease, subcellular location, Feature table (FT) –features on the sequence, e.g. domain, active site Keyword (KW) lines –Set of a few hundred controlled vocabulary terms Description (DE) lines –Protein name/function

Other parts to UniProt UniParc –archive of all sequences UniProt –Swiss-Prot + TrEMBL UniProt NREF100 (100% seqs merged) UniProt NREF90 (90% seqs merged) UniProt NREF50 (50% seqs merged) UniMES –metagenomic sequences

Submitting sequences to EMBL or UniProt WEB-IN -web-based submission tool for submitting DNA sequences to EMBL database. Protein sequences submitted when the peptides have been directly sequenced. Submit through SPIN

Sequence formats Not MSWord, but text! Most include an ID/name/annotation of some sort FASTA, E.g. >xyz some other comment ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgc caatatgcgctctttgtccgcgcccaggagctacacaccttcgaggtga ccggccaggaaacggtcgccagatcaaggctcatgtagcctcactgg Others specific to programs, e.g. GCG, abi, clustal, etc.

Accession numbers GenBank/EMBL/DDBJ: 1 letter & digits, e.g.: U12345 or 2 letters & 6 digits, e.g.: AY123456 GenPept Sequence Records -3 letters & 5 digits, e.g.: AAA12345 UniProt -All 6 characters: [A,B,O,P,Q] [0-9] [A- Z,0-9] [A-Z,0-9] [A-Z,0-9] [0-9], e.g.: P12345 and Q9JJS7

Cross-referencing identifiers So many different IDs for same thing, e.g. Ensembl, EMBL, HGNC, UniGene, UniProt, Affy ID, etc. Need mapping files to move between them to avoid having to parse every entry PICR (http://www.ebi.ac.uk/Tools/picr/) enables mapping between IDs UniProt website mapper (www.uniprot.org)

Taxonomy Databases Most used is NCBI’s taxonomy database: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Taxon omy Provides entries for all known organisms Provides taxonomic lineage and translation table for organisms Sequence entries for organism UniProt-specific taxonomy database is Newt: http://www.ebi.ac.uk/newt

Example taxonomy entry

Literature database: PubMed/Medline Source of Medical-related & scientific literature PubMed has articles published after 1965 Can search by many different means, e.g. author, title, date, journal etc., or keywords for each PubMed has list of tags to search specific fields, e.g. [AU], [TI], [DP] etc. Can save queries and results Can usually retrieve abstracts and full papers

Types of search fields Title Words [TI]MeSH Terms [MH] Title/Abstract Words [TIAB]Language [LA] Text Words [TW]Journal Title [TA] Substance Name [NM] Issue [IP] Subset [SB]Filter [FILTER] Secondary Source ID [SI]Entrez Date [EDAT] Subheadings [SH] EC/RN Number [RN] Publication Type [PT]Author Name [AU] Publication Date [DP]All Fields [ALL] Personal Name as Subject [PS]Affiliation [AD] Page Number [PG]Unique Identifiers [UID] Title Words [TI] MeSH Major Topic [MAJR] MeSH Date [MHDA]

Querying biological databases Databases hold a wealth of information Data is held in specific formats and controlled vocabulary- easy for searching Can retrieve and save data you need In many cases you can retrieve data from multiple sources

How to query databases Query languages e.g. SQL Can query with single word or phrase Boolean queries Regular expressions Basic database querying is usually done through web interface –Text or sequence-based searches –Can use Boolean queries and regular expressions

Words and phrases Most searches are case insensitive Keywords are single words searched Phrases –groups of words E.g. tyrosine protein kinase –returns anything with either of the words “tyrosine ”, “protein ” or “kinase” (keywords) “tyrosine protein kinase” –returns anything with the complete phrase only

Boolean operators ( George Boole ) Operators e.g. & (AND), | (OR), ! (NOT), e.g.: –protein & kinase ! tyrosine –tyrosine & protein & kinase More complex: (tyrosine OR kinase) AND (NOT serine) Operators don’t work in “”, e.g. “tyrosine and kinase” Wildcards * and ? E.g. cell*ase finds all words starting with “cell” and ending in “ase” Attributes are used to be more specific about where to find the keyword

Resources for searching databases Sequence Retrieval System (SRS) EBI –all EBI databases search NCBI –Entrez Each database usually has own web interface allowing simple queries E.g. EnsMart allows querying of Ensembl database

Sequence Retrieval System http://srs.ebi.ac.uk Integrates over 150 different databases A database can be searched and results linked to other databases in SRS Searches can be simple or complex –can view previous queries and combine them Results can be viewed in default formats or user-defined views Users can save their results and launch software packages (>100 applications) to analyse results

Introduction to Bioinformatics and Biological databases Nicky Mulder:

Similar presentations

Presentation on theme: "Introduction to Bioinformatics and Biological databases Nicky Mulder:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Bioinformatics and Biological databases Nicky Mulder:

Similar presentations

Presentation on theme: "Introduction to Bioinformatics and Biological databases Nicky Mulder:"— Presentation transcript:

Similar presentations

About project

Feedback