“Explosion”, “Avalanche”, “Tsunami” of data Complexity of data

New, Growing, and Evolving Fields of Genetics, Genomics, and Computational Biology
“Explosion”, “Avalanche”, “Tsunami” of data Complexity of data 30,000 “genes” structure, function unknown pathways (circuits) unknown so much of biology that we do not know chromosomal packaging, transcriptional regulation, post-translational modification, evolutionary questions – introns early/late, etc. Stone Age of medicine for determining drugs and treatment? Multiple model organism genomes Requires the basics of genetics and biology Also represents an increasing need in computational expertise

Biomedical information tsunami
overwhelming volume of data multitude of sources Taken from Ken Buetow, NCI

Incredible developments in biomedical information generation
Taken from Ken Buetow, NCI

Treatment of Disease Accelerated by Human Genome Project time
Disease with genetic component ID genes Diagnostics Understanding basic biological defect Preventative medicine Gene therapy Drug therapy Pharmacogenomics Adapted from Francis Collins

Stone Age of Medicine Try compound on model organism No No Does it
work? Does it harm/kill? Try compound on different model organism or for other result Yes Yes Seek approval, human testing, clinical trials, FDA, 16 years, millions of dollars Done Knowledge from the genome is moving us away from this.

Motivation of this course
Intro to Bioinformatics provides a first exposure to some available computational techniques and resources however, the emphasis is on utilization In this course -- I try to emphasize tools and techniques that you would use to go about developing your own computational resources (software, systems, tools, etc). Computational Methods in Molecular Biology advanced topics

What is genomics? Genomics Functional genomics Proteomics
Mapping, sequencing and analysis of genomes Functional genomics Simultaneous study of function of all genes in pathway or system Proteomics Simultaneous study of all proteins in a pathway or system

What is Bioinformatics?
“… addresses problems related to the storage, retrieval and analysis of information about biological structure, sequence and function." Altman, R. (1998) Bioinformatics

Questions people hope to answer
Can we find the genes and assign them functions? Can we link genotype to phenotype? Can we use genotype/phenotype to predict relevant outcome? Can we predict protein structures and functions? Can we reconstruct metabolic, signaling, and other pathways? Can we reconstruct informational networks? Can we use cross-species comparisons to learn something?

Bioinformatics Goals Scope Applications Limitations Future
Understand a living cell and how it functions at the molecular level Scope Develop computational tools and databases Application of these tools and databases to understand living systems Applications Knowledge-based drug design; biotechnology of all kinds Limitations Dependent on quality information; errors in sequencing or annotation; algorithm errors due to invalid assumptions Future Systems biology: molecular simulation of all cellular processes

Database construction and curation
Structure analysis Applications Structure analysis Sequence analysis Function analysis Software development Database construction and curation

Topics Biomolecular databases and genome viewers (1)‏
Sequence similarity and alignments (2)‏ Annotation and protein analyses/predictions (3)‏ Pathways, protein interactions and structure (4)‏ Comparative genomics and genetic information (5)‏

Gathering knowledge Anatomy, architecture Dynamics, mechanics
Informatics (Cybernetics – Wiener, 1948) (Cybernetics has been defined as the science of control in machines and animals, and hence it applies to technological, animal and environmental systems)‏ Genomics, bioinformatics Rembrandt, 1632 Newton, 1726

Bioinformatics Bioinformatics Chemistry Biology Mathematics Molecular
Statistics Bioinformatics Computer Science Informatics Medicine Physics

Bioinformatics “Studying informational processes in biological systems” (Hogeweg, early 1970s)‏ No computers necessary Back of envelope OK “Information technology applied to the management and analysis of biological data” (Attwood and Parry- Smith) Applying algorithms with mathematical formalisms in biology (genomics) -- USA

Bioinformatics in the olden days
Close to Molecular Biology: (Statistical) analysis of protein and nucleotide structure Protein folding problem Protein-protein and protein-nucleotide interaction Many essential methods were created early on (BG era)‏ Protein sequence analysis (pairwise and multiple alignment)‏ Protein structure prediction (secondary, tertiary structure)‏

Bioinformatics in the olden days (Cont.)‏
Evolution was studied and methods created Phylogenetic reconstruction (clustering – NJ method

But then the big bang….

The Human Genome -- 26 June 2000
Dr. Craig Venter Celera Genomics -- Shotgun method Dr. Francis Collins / Sir John Sulston Human Genome Project

Human DNA There are about 3bn (3  109) nucleotides in the nucleus of almost all of the trillions (5-10  1012 ) of cells of a human body (an exception is, for example, red blood cells which have no nucleus and therefore no DNA) – a total of ~1023 nucleotides! Many DNA regions code for proteins, and are called genes (1 gene codes for 1 protein in principle) Human DNA contains ~30,000 expressed genes Deoxyribonucleic acid (DNA) comprises 4 different types of nucleotides: adenine (A), thiamine (T), cytosine (C) and guanine (G). These nucleotides are sometimes also called bases

Human DNA (Cont.)‏ All people are different, but the DNA of different people only varies for 0.2% or less. So, only 1 letter in ~1400 is expected to be different. Over the whole genome, this means that about 3 million letters would differ between individuals. The structure of DNA is the so-called double helix, discovered by Watson and Crick in 1953, where the two helices are cross-linked by A-T and C-G base-pairs (nucleotide pairs – so-called Watson-Crick base pairing). The Human Genome has recently been announced as complete (in 2004).

Genome size Organism Number of base pairs X-174 virus 5,386
Epstein Bar Virus 172,282 Mycoplasma genitalium 580,000 Hemophilus Influenza 1.8  106 Yeast (S. Cerevisiae)  106 Human  109 Wheat  109 Lilium longiflorum 90  109 Salamander  109 Amoeba dubia  109

Humans have spliced genes…

A gene codes for a protein
mRNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDE CCUGAGCCAACUAUUGAUGAA

Genome revolution has changed bioinformatics
More high-throughput (HTP) applications (cluster computing, GRID, etc.)‏ More automatic pipeline applications More user-friendly interfaces Greater emphasis on biostatistics Greater influence of computer science (machine learning, software engineering, etc.)‏ More integration of disciplines, databases and techniques

New areas interfacing bioinformatics
Systems Biology Cellular networks Quantitative studies Time processes Cellular compartmentation Multi-scale modelling Link with experiment Neurobiology From genome information to behaviour Brain modelling

Protein Sequence-Structure-Function
Ab initio prediction and folding Sequence Structure Function Threading Function prediction from structure Homology searching (BLAST)‏

Luckily for bioinformatics…
There are many annotated databases (i.e. DBs with experimentally verified information)‏ Based on evolution, we can relate biological macromolecules and then “steal” annotation of “neighbouring” proteins or DNA in the DB. This works for sequence as well as structural information Problem we discuss in this course: how do we score the evolutionary relationships; i.e. we need to develop a measure to decide which molecules are (probably) neighbours and which are not Sequence – Structure/function gap: there are far more sequences than solved tertiary structures and functional annotations. This gap is growing so there is a need to predict structure and function.

What is a database? A collection of information, usually stored in an electronic format that can be searched by a computer.

Databases (DB)‏ Sets of data Stored on computer Explicit data model
What is in the DB? What is not? Well-defined data structure

Analysis methods Searches Comparison Features Keywords or free text
Similarity Comparison Sequence-sequence alignment Structures Features Identification of (sites, domains)‏ Prediction of (secondary structure)‏

Different Databases in Bioinformatics
Sequence Databases SNP/mutation DBs Structure Databases Expression Databases Spectral Databases Metabolism Databases Drug Databases Cell/Strain Databases Organism Databases Interaction Databases Function/Ontology DBs Bibliographic DBs Disease Databases Databases of Databases

Why So Many Databases? To collect and preserve valuable data
To make data accessible and easily searched To standardize data representation or data formats To organize data into knowledge

Different Types of Databases
Public (web-accessible, downloadable) Local or Private (restricted to registered users or not web-accessible) Archival (anything goes) Curated (managed data submission) Specialty databases (special interest) General databases (wide interest)

Where To Go For More Info?
Nucleic Acids Research Web Server Issue (every Jan.)

Sequence Databases* GenBank EMBL/trEMBL/UniProt DDBJ PIR SwissProt
EMBL/trEMBL/UniProt DDBJ PIR SwissProt

Sequence Databases Some specialize in DNA sequence data (GenBank, DDBJ) Some specialize in protein sequence data (PIR, Swiss-Prot, UniProt) Some are specific for organisms or classes of organisms (yeast, fruitflies, human, certain bacteria)

Sequence Annotation Most sequence databases usually include more information than just the raw sequence data This additional information is called “annotation” and it may be done either automatically or manually Annotations include information about gene/protein name, length, position, references, corrections, etc

Different Levels of Annotation
Sparse – typical of most archival DNA sequence databases (GenBank, DDBJ) Moderate – typical of more curated databases or protein-specific databases (PIR, trEMBL) Detailed – typical of organism-specific databases or databases with a very high level of curation (Swiss- Prot, EcoCyc, BacMap)

Different Levels of Database Annotation*
GenBank (large # of sequences, minimal annotation) PIR (large # of sequences, slightly better annotation) SwissProt (small # of sequences, even better annotation) Organism-specific DB (very small # of sequences, best annotation)

Important web sites EBI NCBI www.ebi.ac.uk/
European Bioinformatics Institute Databases: EMBL, UniProt, Ensembl,… NCBI National Center for Biotechnology Information Databases: PubMed, Entrez, OMIM,…

Historical background, 1
“Atlas of Protein Sequence and Structure”, Margaret Dayhoff et al 1965 Printed book with all published sequences New editions into the 1970s Basis for Protein Information Resource (PIR), pir.georgetown.edu/ Since 2003 part of UniProt,

Historical background, 2
SwissProt Amos Bairoch, University of Geneva Swiss Institute of Bioinformatics, Data from literature, carefully curated Started in 1986 Since 2003 part of UniProt

DB properties Quality Comprehensiveness Redundancy
Error rates, types of errors Update policy Comprehensiveness Data sources Redundancy Multiple entries for same biological item?

Consider when choosing a DB
Central data type Data entry and quality Primary or derived data Maintainer status Availability

Central data types Nucleotide sequences Protein sequences
EMBL, GenBank Protein sequences UniProt (PIR, Swiss-Prot)‏ Genes, genomes Ensembl, EntrezGene 3D structure PDB (RSCB)‏

Data entry and quality Method of data entry Quality control
Scientists deposit data Curators enter data (from literature)‏ Quality control Consistency, redundancy, conflicts Are checks applied? Update policy Regularity Are errors removed?

Primary and Secondary Databases
Primary databases REAL EXPERIMENTAL DATA Biomolecular sequences or structures and associated annotation information (organism, function, mutation linked to disease, functional/structural patterns, bibliographic etc.) Secondary databases DERIVED INFORMATION Fruits of analyses of sequences in the primary sources (patterns, blocks, profiles etc. which represent the most conserved features of multiple alignments) Primary database contains: Annotation, administration, etc., but also REAL EXPERIMENTAL DATA Secondary database contains: Data from primary database(s), extra annotation and administration, and added value (calculation, annotation, etc). Most conserved features of the multiple alignment such that they are able to provide potent discriminators of family members for newly determined sequences.

Primary Databases Sequence Information Genome Information
DNA: EMBL, Genbank, DDBJ Protein: SwissProt, TREMBL, PIR, OWL Genome Information GDB, MGD, ACeDB, ENSEMBL Structure Information PDB, NDB, CCDB/CSD DDBJ - The DNA Data Bank of Japan. The OWL database is a non-redundant protein sequence database produced from the following source databases: SWISSPROT PIR(1-3) GenBank translations NRL-3D PIR(1-3) - The Protein Identification Resource EMBL - The European Molecular Biology DNA Sequence Database PDB - The Protein Databank (3D structures) AceDB was originally developed for the C. elegans genome project , from which its name was derived (A C. elegans DataBase). However, the tools in it have been generalized to be much more flexible and the same software is now used for many different genomic databases from bacteria to fungi to plants to man Cambridge Crystallogrphic Database (CCDB) Primary database contains: Annotation, administration, etc., but also REAL EXPERIMENTAL DATA

Secondary Databases Sequence-related Information
ProSite, REBase Genome-related Information OMIM, TransFac Structure-related Information DSSP, HSSP, FSSP, PDBFinder Pathway Information KEGG, Pathways Function-related Enzyme, GO PROSITE - A Dictionary of Protein Sites and Patterns (1492 patterns (oct 2001)) EC-Enzyme - The EC Enzyme Classification Database OMIM - Online Mendelain Inheritance in Man SWISS-2DPAGE - Two-dimensional Polyacrylamide Gel Electrophoresis Database REBASE - The Restriction Enzyme Database Refbase - A Protein Sequence Citation Database KEGG: Kyoto Encyclopedia of Genes and Genomes DSSP database of sec struct assignments (and much more) for all of the entries in the PDB. 15xxx entries. HSSP homology-derived struct of proteins; derived db; merging struct (2&3D) and seq info (1D). Implied sec & tert struct. 15xxx entries. FSSP families of structurally similar proteins. Structural alignment of proteins in PDB entries. Secondary database contains: Data from primary database(s), extra annotation and administration, and added value (calculation, annotation, etc).

Databases Data must be in certain format for the programs to recognize them. Every database can have its own format, but some data elements are essential for every database: Unique identifier, or accession code 2. Name of depositor 3. Literature references 4. Deposition date 5. The real data

3 examples 1. SwissProt 2. EMBL 3. PDB

Quality of databases SwissProt EMBL, PDB
Data is only entered by annotation experts EMBL, PDB Everybody can submit data Data are accepted the way they are submitted

SwissProt database Database of protein sequences
Produced by Amos Bairoch (University of Geneva) and the EMBL Data Library Data derived from: translations of DNA sequences (from EMBL Database) adapted from the PIR collection extracted from the literature and directly submitted by researchers SwissProt & SwissNew July 2001: ~86,600 entries, ~15,000 new entries / year Swissnew: 53,000 entries Ca. 200 Annotation experts worldwide Keyword-organised flatfile 31 miljoen/86593 = 357 aa 20 annotators doen 99% vanhet werk. 200 remote experts doen de 1% probleemgevallen.

SwissProt records (1) ID identification line
ID ENTRY_NAME DATA_CLASS; MOLECULE_TYPE; SEQUENCE_LENGTH. ID CRAM_CRAAB STANDARD; PRT; AA. Format for the ENTRY_NAME: NAME_SPECIES ( 10 characters) here: Crambin (Crambe abyssinica) For number of organisms (16) SPECIES has a recognizable name: HUMAN, MOUSE, CHICK, BOVIN, YEAST, ECOLI…. N.B. The ID can change, e.g. serotonine receptors have got a new nomenclature SWISSPROT:CRAA_HUMAN; alpha crystallin a chain.

SwissProt records (2) AC accession number AC P01542; AC is unique: Name, sequence, everything can change but AC stays the same DT deposition date DT 21-JUL-1986 (Rel. 01, Created) DT 30-MAY-2000 (Rel. 39, Last sequence update) DT 30-MAY-2000 (Rel. 39, Last annotation update) 1) You can not see what the last annotation update was 2) No depositor record (Implicit: author of first reference) DT relevance ivm protein vs DNA sequencing

SwissProt records (3) DE description DE CRAMBIN. DE 6-phosphofructo-2-kinase 1 (EC ) (Phosphofructokinase 2 I) 1) General descriptive information 2) Free-format GN gene name GN THI2. OS & OC & OG OS Crambe abyssinica (Abyssinian crambe). OC Eukaryota; Viridiplantae; Embryophyta;Tracheophyta;Spermatophyta; OC Magnoliophyta; eudicotyledons; Rosidae; eurosids II; Brassicales; OC Brassicaceae; Crambe. Organism Species; Organism Classification; OrGanelle Organel m.n. mitochondrion & chloroplast

SwissProt records (4) RN References RN [1] RP SEQUENCE. RX MEDLINE; RA Teeter M.M., Mazer J.A., L'Italien J.J.; RT "Primary structure of the hydrophobic plant protein crambin."; RL Biochemistry 20: (1981). CC Comments or notes CC -!- FUNCTION: THE FUNCTION OF THIS HYDROPHOBIC PLANT SEED PROTEIN CC IS NOT KNOWN. CC -!- MISCELLANEOUS: TWO ISOFORMS EXISTS, A MAJOR FORM PL (SHOWN HERE) CC AND A MINOR FORM SI. CC -!- SIMILARITY: BELONGS TO THE PLANT THIONIN FAMILY. RP = keyword, waar de referentie over gaat, bijvoorbeeld: STRUCTURE, SEQUENCE. Als [RP= Sequence], is de eerste auteur van deze referentie de depositor van de data. CC CATALYTIC ACTIVITY CC TISSUE SPECIFICITY

SwissProt records (5) DR Database Cross Reference DR PIR; A01805; KECX. DR PDB; 1CRN; 16-APR-87. DR PDB; 1CBN; 31-JAN-94. DR PDB; 1CCM; 31-OCT-93. DR PDB; 1CCN; 31-JAN-94. DR PDB; 1CNR; 31-AUG-94. DR PDB; 1AB1; 12-AUG-97. DR INTERPRO; IPR001010; -. DR PFAM; PF00321; plant_thionins; 1. DR PRINTS; PR00287; THIONIN. DR PROSITE; PS00271; THIONIN; 1. KW Keyword Not standardized (under control of depositor) KW Thionin; 3D-structure. C-C-x(5)-R-x(2)-[FY]-x(2)-C [The three C's are involved in disulfide bonds] plant thionins signature InterPro - Integrated Resource of ProteinDomains and Functional Sites PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family; its diagnostic power is refined by iterative scanning of OWL. Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space. Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs: thedatabase thus provides a useful adjunct to PROSITE.

SwissProt records (6) FT Feature table data
FT DISULFID FT DISULFID FT DISULFID FT VARIANT P -> S (IN ISOFORM SI). FT VARIANT L -> I (IN ISOFORM SI). FT STRAND FT HELIX FT TURN FT HELIX FT TURN FT STRAND FT TURN

Feature table Other features: post-translational modifications, binding sites, enzyme active sites, local secondary structure or other characteristics reported in the cited references. Sequence conflicts between references are also included. FT CONFLICT MISSING (IN REF. 2). FT MUTAGEN G->R,L,M: DNA BINDING LOST. FT MOD_RES PHOSPHORYLATION (BY PKC). FT LIPID 1 1 MYRISTATE. FT CARBOHYD GLUCOSYLGALACTOSE. FT METAL COPPER (POTENTIAL). FT BINDING HEME (COVALENT). FT PROPEP ACTIVATION PEPTIDE. FT DOMAIN EXTRACELLULAR (POTENTIAL). FT ACT_SITE ACCEPTS A PROTON DURING CATALYSIS. FT VARSPLIC GRP -> DVR (IN SHORT FORM).

SwissProt records (7) SQ sequence header SQ SEQUENCE 46 AA; MW; 919E68AF159EF722 CRC64; Sequence data TTCCPSIVAR SNFNVCRLPG TPEALCATYT GCIIIPGATC PGDYAN // Termination line Dit getal - 919E68AF159EF722 CRC64 – is de zogenaamde check-sum. Een getal berekend door de computer om te controleren of de data nog correct is.

EMBL database Nucleotide database EMBL & EMNEW July 2001:
EMBL: 3,951,820 entries, EMNEW: 323,703 EMEST*: 8,092,600, EMNEWEST*: 619,777 *) EMEST/EMNEWEST = EST-section of EMBL, EST = expressed sequence tag EMBL records follows roughly same scheme as SwissProt Obligatory deposit of sequence in EMBL (or SwissProt) before publication EMEST = EST-sectie van de EMBL database. EST – expressed sequence tag.

Protein Data Bank (PDB)
Databank for macromolecular structure data (3- dimensional coordinates) Obligatory deposit of coordinates in the PDB before publication ~16,000 entries (October 2001) PDB file is a keyword-organised flat-file (80 column) human readable every line starts with a keyword (3-6 letters) platform independent Started ca. 25 years ago (on punche cards!) Naast eiwittenook DNA & RNA!!

PDB records (1) Filename= accession number= PDB Code 1) Filename is 4 positions (often 1 digit & 3 letters, e.g. 1CRN) 2) Be aware: 0HYK means entry HYK does not contain coordinates HEADER describes molecule & gives deposition date HEADER PLANT SEED PROTEIN APR CRN 1CRND 1 CMPND name of molecule COMPND CRAMBIN CRN 4 SOURCE organism SOURCE ABYSSINIAN CABBAGE (CRAMBE ABYSSINICA) SEED CRN 5

PDB records (2) AUTHOR AUTHOR W.A.HENDRICKSON,M.M.TEETER 1CRN 6 JRNL
The depositor JRNL JRNL AUTH M.BLABER,X.-J.ZHANG,B.W.MATTHEWS L 10 JRNL TITL STRUCTURAL BASIS OF ALPHA-HELIX PROPENSITY AT TWO L 11 JRNL TITL 2 SITES IN T4 LYSOZYME L 12 JRNL REF SCIENCE V L 13 JRNL REFN ASTM SCIEAS US ISSN L 14 REMARK Not standardized: many different REMARK records & subrecords! REMARK 1 REFERENCE CRNC 10 REMARK 1 AUTH M.M.TEETER,W.A.HENDRICKSON CRN 16 REMARK 1 TITL HIGHLY ORDERED CRYSTALS OF THE PLANT SEED PROTEIN 1CRN 17 REMARK 1 TITL 2 CRAMBIN CRN 18 REMARK 1 REF J.MOL.BIOL V CRN 19 REMARK 1 REFN ASTM JMOBAK UK ISSN CRN 20 REMARK CRN 21 REMARK 2 RESOLUTION. 1.5 ANGSTROMS CRN 22

PDB records (3) SEQRES Sequence of protein; Be aware: Not always all 3D-coordinates are present for all the amino acids in SEQRES!! SEQRES THR THR CYS CYS PRO SER ILE VAL ALA ARG SER ASN PHE 1CRN 51 SEQRES ASN VAL CYS ARG LEU PRO GLY THR PRO GLU ALA ILE CYS 1CRN 52 SEQRES ALA THR TYR THR GLY CYS ILE ILE ILE PRO GLY ALA THR 1CRN 53 SEQRES CYS PRO GLY ASP TYR ALA ASN CRN 54 HET & FORMUL metals, cofactors, ions, etc. HET NAD A NAD CO-ENZYME MDH 219 HET SUL A SULFATE MDH 220 HET NAD B NAD CO-ENZYME MDH 221 HET SUL B SULFATE MDH 222 FORMUL 3 NAD 2(C21 H28 N7 O14 P2) MDH 223 FORMUL 4 SUL 2(O4 S1) MDH 224 FORMUL 5 HOH *471(H2 O1) MDH 225 HEADER OXIDOREDUCTASE(NAD(A)-CHOH(D)) APR MDH MDH 3 COMPND CYTOPLASMIC MALATE DEHYDROGENASE (E.C ) MDH 4 SOURCE PORCINE (SUS $SCROFA) HEART MDH 5

PDB records (4) HELIX/SHEET/TURN Secondary structure elements as provided by the crystallographer (subjective) HELIX H1 ILE PRO /10 CONFORMATION RES 17, CRN 55 SHEET S1 2 CYS ILE CRN 58 TURN T1 PRO TYR CRN 59 SSBOND disulfide bridges SSBOND 1 CYS CYS CRN 60 SSBOND 2 CYS CYS CRN 61 CRYST1, ORIGX1, ORIGX2, ORIGX3, SCALE1, SCALE2, SCALE3 crystallographic parameters CRYST P CRN 63 ORIGX CRN 64 ORIGX CRN 65 ORIGX CRN 66 SCALE CRN 67 SCALE CRN 68 SCALE CRN 69

PDB records (5) ATOM one line for each atom with its unique name and its x,y,z coordinates ATOM N THR CRN 70 ATOM CA THR CRN 71 ATOM C THR CRN 72 ATOM O THR CRN 73 ATOM CB THR CRN 74 ATOM OG1 THR CRN 75 ATOM CG2 THR CRN 76 ATOM N THR CRN 77 ATOM CA THR CRN 78 ATOM C THR CRN 79 ATOM O THR CRN 80 TER record terminates the amino acid chain ATOM OD1 ASN CRN 394 ATOM ND2 ASN CRN 395 ATOM OXT ASN CRN 396 TER ASN CRN 397 One TER-record per molecule. So if protein is a dimer, than two TER-records.

PDB records (6) HETATM atomic coordinate records for atoms within “HET & FORMUL”-lines (metals, cofactors, ions, …) and for water molecules HETATM 5158 AP NAD B MDH5495 HETATM 5159 AO1 NAD B MDH5496 HETATM 5160 AO2 NAD B MDH HETATM O HOH MDH5544 HETATM O HOH MDH5545 HETATM O HOH MDH5546

DB vs. Interface Confusion: Interface is not same as DB!
Interface is the method of access Database (DB) is the data itself Same DB accessed by different interfaces (UniProt from ExPASy or EBI)‏ One interface may be used to access different databases (SRS)‏

Maintainer status Large, academic, public institute
EBI, NCBI Quasi-academic institute TIGR, SIB Research group or scientist Company

Availability Publicly available, no restrictions
EMBL, GenBank Available, but with copyright May not be re-used in other DB UniProt Commercial Copyright May be accessible to academics at no charge

DB identifiers Identify a DB item uniquely A primary key for an item
Permanent “Accession code” E.g. P01112 Use this for UniProt, EMBL, etc “Entry name” E.g. RASH_HUMAN Warning: may change! P01112 RASH_HUMAN

Accession codes and updates
DB items may be merged or split Two sequence entries merged, e.g. they were actually the same protein A sequence entry split, e.g. actually from two different genes Primary accession code The new, recommended code Secondary accession code The old; kept for trackability Version numbers in some DBs

Nucleotide sequence DBs
Primary EMBL, GenBank, Collaboration and synchronization Data submitted directly from sequencing projects, scientists Large, with subdivisions Redundant, fragments, rather messy… Publicly available, no restrictions

Genome DBs, 1 Nucleotides, but complete genome
Usually high quality, careful annotation Ensembl Eukaryotes Automatic annotation, links to other DBs SG Vega Manually curated, built on top of Ensembl

Genome DBs, 2 UniGene EntrezGene (LocusLink)‏
“An organized view of the transcriptome” EntrezGene (LocusLink)‏ d=Retrieve&dopt=full_report&list_uids=3265 Species-specific databases SGD, Saccharomyces cerevisiae, Berkeley Drosophila Genome Project,

Protein sequence DBs UniProt Swiss-Prot TrEMBL www.uniprot.org
UniProtKB = UniProt Knowledge Base UniProt = Swiss-Prot + PIR + TrEMBL Swiss-Prot Credible sequences Manual expert annotation TrEMBL Translations of EMBL nucleotide sequences Automatic, basic annotation Eventually integrated into Swiss-Prot

UniProt ExPASY interface (Swiss Inst Bioinfo)‏ EBI interface
EBI interface srv/uniProtView.do?proteinId=RASH_HUMAN&pag er.offset=0 Same data, different look; your choice

Protein sequence domains
Domains, motifs, families Patterns of similar residues in a section of a protein sequence Often a functional and structural unit The presence of a domain: hints on function Pfam Protein sequence domains Hidden Markov Models, software HMMER

Macromolecular 3D structure
Protein Data Bank, PDB Oldest computer-based bio-DB (1971)‏ 3D structures of proteins, oligonucleotides X-ray crystallography and NMR SCOP, Structural Classification Hierarchical scheme of classification Similar 3D structures in families

Others, 1 OMIM Online Mendelian Inheritance in Man
Human genes and genetic disorders d=190020

Others, 2 GeneCards GeneLynx Aggregate database; human
Links from gene to other DBs GeneLynx Aggregate database; human, rat, mouse

Databases Flat-file Relational Object-oriented

Flat-file databases Each entry is its own text file
Easy to set up and understand Computationally intensive to search No connections between records Doesn’t work for large databases

Relational databases Data is organized in sets of tables
Columns represent fields and rows represent records Columns are indexed with attribute that can be cross-referenced by other tables Databases are queried using a structured query language (SQL)‏ Require much more planning to set up, but much faster and easier to access than flat files

Object oriented databases
Store data as objects Allows hierarchical associations to be made Makes some programming tasks easier Better suited to complex data types (i.e. protein-protein interactions where there are multiple interactions between proteins

Exercise Text-based searching of sequence databases
Refining searches using query logic (Boolean)‏ Retrieving data and formatting it for other programs

Exercise 1: Sequence database searching by keyword
Purpose: To practice with Boolean search terms to learn how to refine searches based on keywords. To familiarize yourself with navigating between databases within the Entrez system. To become familiar with text searches of integrated biological databases through Entrez. Activities: Find references to the genes in the literature Locate the database references for a particular human gene. Practice query searching with limits and wildcards.

1 AND 2 1 2 lipocalin AND disease (60 resulten)‏ 1 OR 2 1 2
lipocalin OR disease ( resultaten)‏ 1 NOT 2 1 2 lipocalin NOT disease (530 resultaten)‏ 91

Text based search of NCBI Entrez
Query 1: human MAP kinase inhibitor How many nucleotide, protein and gene records does this query return? Query 2: put quotes around the original query How many records are returned from the 3 databases if you search with quotes around the entire query?

Query 3: Drop the quotes and limit the records to human only.
How many records were returned from the nucleotide, protein and Gene database? From this final query, look at the records returned from the Gene database and go through the annotations of the first records. What family of kinase inhibitors is represented? With what kinases do they interact? What is the function of this family of kinase inhibitors? Despite searching for kinase inhibitors, why were so many kinase records returned? What strategy did you employ to restrict the searches to records from human only?

Query 4: Turn off any limits that you may have set in the previous searches and do the following query of Entrez: human catalase. How many records were returned from the nucleotide, protein and Gene databases? What is the official gene symbol for human catalase? What are 2 proteins that catalase interacts with? On what KEGG pathways is it located (include pathway name and ID)? What reaction does it catalyze? What is the enzyme classification number for human catalase? What co-factors are required for catalase to catalyze this reaction?

Query 5: Toll-like receptor genes.
Answer the following questions about human Toll-like receptor proteins. How many human toll-like receptors are there? What are their Refseq accession numbers, official genes symbols and gene names? Which ones have a mouse homolog? What is the mouse homolog Refseq accession number? What are the adapter molecules for each toll-like receptor? What is the major role of toll-like receptors in human immunology? Which toll-like receptors dimerize with each other?

Query 6: Do the following query of Entrez: HIV CCR5 receptor
How many records were returned in the nucleotide, protein and Gene databases? What is the official gene symbol for the HIV CCR5 receptor? What are two aliases for the official gene symbol? What is the offical gene name for HIV CCR5 receptor? What does the Gene record tell you about its structure? What is its importance in the biology of an HIV infection? How many exons doe the gene have? How long is the protein?

“Explosion”, “Avalanche”, “Tsunami” of data Complexity of data

Similar presentations

Presentation on theme: "“Explosion”, “Avalanche”, “Tsunami” of data Complexity of data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

“Explosion”, “Avalanche”, “Tsunami” of data Complexity of data

Similar presentations

Presentation on theme: "“Explosion”, “Avalanche”, “Tsunami” of data Complexity of data"— Presentation transcript:

Similar presentations

About project

Feedback