Biological Information and Biological Databases Meena K Sakharkar Bioinformatics Centre National University of Singapore.

Slides:



Advertisements
Similar presentations
NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
Advertisements

Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Bioinformatics What is bioinformatics? Why bioinformatics? The major molecular biology facts Brief history of bioinformatics Typical problems of bioinformatics:
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
GENBANK, SWISSPROT AND OTHERS As Problem Sources for CSE 549 Andriy Tovkach Genetics.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
JYC: CSM17 BioinformaticsCSM17 Week 10: Summary, Conclusions, The Future.....? Bioinformatics is –the study of living systems –with respect to representation,
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Bioinformatics and Phylogenetic Analysis
The Cell, Central Dogma and Human Genome Project.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
The Protein Data Bank (PDB)
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.
Goals of the Human Genome Project determine the entire sequence of human DNA identify all the genes in human DNA store this information in databases improve.
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
From T. MADHAVAN, & K.Chandrasekaran Lecturers in Zoology.. EXIT.
Overview of Bioinformatics A/P Shoba Ranganathan Justin Choo National University of Singapore A Tutorial on Bioinformatics.
On line (DNA and amino acid) Sequence Information
Bioinformatics.
Development of Bioinformatics and its application on Biotechnology
1 Bio + Informatics AAACTGCTGACCGGTAACTGAGGCCTGCCTGCAATTGCTTAACTTGGC An Overview پرتال پرتال بيوانفورماتيك ايرانيان.
What is the Human Genome Project? Identify all the approximately 35,000 genes in human DNA Determine the sequences of the 3,000,000,000 bases ( = 200 phone.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas
Systematics the study of the diversity of organisms and their evolutionary relationships Taxonomy – the science of naming, describing, and classifying.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
Biological Databases and Tools Sandra Sinisi / Kathryn Steiger November 25, 2002.
PROTEIN STRUCTURE CLASSIFICATION SUMI SINGH (sxs5729)
20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes The ultimate goal of genomic research: determining the ordered nucleotide sequences.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
REMINDERS 2 nd Exam on Nov.17 Coverage: Central Dogma of DNA Replication Transcription Translation Cell structure and function Recombinant DNA technology.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Protein and RNA Families
Mining Biological Data. Protein Enzymatic ProteinsTransport ProteinsRegulatory Proteins Storage ProteinsHormonal ProteinsReceptor Proteins.
BIOLOGICAL DATABASES. BIOLOGICAL DATA Bioinformatics is the science of Storing, Extracting, Organizing, Analyzing, and Interpreting information in biological.
EB3233 Bioinformatics Introduction to Bioinformatics.
An overview of Bioinformatics. Cell and Central Dogma.
Bioinformatics and Computational Biology
Computer Storage of Sequences
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Bioinformatics Dipl. Ing. (FH) Patrick Grossmann
Biotechnology and Bioinformatics: Bioinformatics Essential Idea: Bioinformatics is the use of computers to analyze sequence data in biological research.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Bioinformatics Overview
Research Paper on BioInformatics
Demo: Protein Information Resource
Archives and Information Retrieval
Genome Annotation Continued
PIR: Protein Information Resource
Genomes and Their Evolution
Introduction to Bioinformatics
Genome organization and Bioinformatics
Biological Information and Biological Databases
Explore Evolution: Instrument for Analysis
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Biological Information and Biological Databases Meena K Sakharkar Bioinformatics Centre National University of Singapore

Biological Information

Nature of Life Science Information Descriptive Classification and Nomenclatural Observational and Phenomenological Experimental Deduced/Computed Simulated? Theoretical?

Descriptive

Classify and Give Names Classification and Nomenclature Linnaeus - binomial nomenclature Group into kingdoms, phyla, classes, orders, families, genera, species, subspecies, strains, etc Associate descriptions to these classification schema, and classify according to description etc

Observational/Phenomenological Like descriptive, yet more active Observe a lot of biological phenomenon Charles Darwin Gregor Mendel to McClintock Start to do some experiments

Experimental From dissections to complex genetic engineering experiments

BioInformatics Deduced/Computed Simulated? Theoretical?

What is BioInformatics? Many related terms and buzzwords A multiplicity of names: – bioinformatics – biocomputing – biological computing – computational biology – computational genomics – biological data mining

Overview of the challenges of Molecular Biology Computing The huge dataset problem –automated DNA sequencers –the Human Genome Project –bulk sequencing of cDNAs (ESTs)

Human Genome Project What is the Human Genome Project? –15-year effort formally begun in October coordinated by the U.S. Department of Energy and the National Institutes of Health. –identify all the estimated 80,000 genes in human DNA, –determine the sequences of the 3 billion chemical bases that make up human DNA, – store this information in databases, –develop tools for data analysis, and –address the ethical, legal, and social issues (ELSI) that may arise from the project.

Who is head of the U.S. Human Genome Project? –The DOE Human Genome Program is directed by Ari Patrinos, and Francis Collins directs the NIH Human Genome Program. –Ari Patrinos also heads the Department of Energy Office of Biological and Environmental Research.

What are the comparative genome sizes of humans and other organisms being studied? If compiled in books, the data would fill an estimated 200 volumes the size of a Manhattan telephone book (at 1000 pages each), and reading it would require 26 years working around the clock

Informatics: Data Collection and Interpretation HUMAN GENETIC DIVERSITY The Ultimate Human Genetic Database Any two individuals differ in about 3 x 106 bases (0.1%). The population is now about 5 x 109. A catalog of all sequence differences would require 15 x 1015 entries. This catalog may be needed to find the rarest or most complex disease genes.

Databases

Basic Terminology What is a nucleotide/protein sequence database and databank? Database is a collection of Nucleotide/protein sequence and their Associated annotations. Databanks Groups which collect, compile, maintain and distribute the database.

Fundamental Dogma

Work from the Code of Life

Deduced and Computed Information in the Era of Computational Biology

Databases What are the different kinds of databases and their formats? Nucleic Acid Sequence EMBL at EBI. GENBANK at NCBI. DDBJ at Japan. Protein Sequence SWISS PROT NBRF(PIR)

Database Protein structure databases PDB Information on the structural data for the proteins/nucleic acids. whose 3-D structure solved by X-ray crystallography/NMR PDB database NRL 3D Database NRL_3D is a sequence-structure database. Can be used in conjunction with PIR. PDB with PIR.

GenBank Entry

EMBL Entry

SwissProt Entry

Other databases Genome Databases –GDB :Genome Data Bank –OMIM Pattern Databases –Prosite –TFD

Usage of databases Annotation Searches - KW, Authors, Features. –What is the protein sequence for human insulin? –How does the 3D structure of calmodulin look like? –What is the genetic location of cystic fibrosis gene? –List all introns in rat? Homology Searches –Is there any protein sequence that is similar to mine? –Is this gene known in any other species? –Has someone already cloned this sequence?

Usage of databases Pattern searches –Does my sequence contain any known motif (that can give me a clue about the function)? –Which known sequences contain this motif? –Is any part of my sequence recoganised by a transcription factor? –List all known start, splice and stop signals in my genomic sequence Prediction - Use the database as knowledge database –What may the structure of my protein be? Secondary structure prediction Modeling by homology –What is the gene structure of my genomic sequence? –Which parts of my protein have a high antigenicity?

Usage of Databases Comparisons: –Gene Families –Phylogenetic Trees

GenBank Growth Chart Year Bases

Evolutionary basis of Alignment Enable the researcher to determine if two sequences display sufficient similarity to justify the inference of homology. Similarity is an observable quantity that may be expressed as say %identity or some other measure. Homology is a conclusion drawn from this data that the two genes share a common evolutionary history.

Sequence Formats

Fasta Format >SANJAY REFORMAT of: SANJAY.seq check: 8826 from: 1 to: 573 March 12, 1998 MASSSVPPMITEEEARFEAEVSAVESWWRTDRFRLTRRPYSARDVVSLRGTLHHSYASDQ MAKKLWRTLKSHQSAGTASRTFGALDPVQVTMMAKHLDTIYVSGWQCSSTHTATNEPGPD LADYPYNTVPNKVEHLFFAQLYHDRKQHEARVSMTREQRAKTPYVDYLRPIIADGDTGFG GATATVKLCKLFVERGAAGVHIEDQSSVTKKCGHMAGKVLVAVSEHINRLVAARLQFDVM GVETVLVARTDAVAATLIQSNVDLRDHQFILGATNPDFKRRSLAAVLSAAMAAGKTGAVL QAIEDDWLSRAGLMTFSDAVINGINRQLPEYEKQRRLNEWAAATEYSKCVSNEQGREIAE RLGAGEIFWDWDIARTREGFYRFRGSVEAAVVRGRAFAPHADLIWMETSSPDLVECGKFA QGMKASHPEIMLAYNLSPSFNWDAAGMTDEEMRDFIPRIAKMGFCWQFITLGGFHADALV TDTFAREFAKQGMLAYVERIQREERNNGVDTLAHQKWSGANYYDRYLKTVQGGISSTAAM GKGVTEEQFKEESRTGTRGLDRGGITVNAKSRL

GCG Format ckl.seq Length: 473 September 15, :25 Type: P Check: MSTKYSASAE SASSYRRTFG SGLGSSIFAG HGSSGSSGSS RLTSRVYEVT 51 KSSASPHFSS HRASGSFGGG SVVRSYAGLG EKLDFNLADA INQDFLNTRT 101 NEKAELQHLN DRFASYIEKV RFLEQQNSAL TVEIERLRGR EPTRIAELYE 151 EEMRELRGQV EALTNQRSRV EIERDNLVDD LQKLKLRLQE EIHQKEEAEN 201 NLSAFRADVD AATLARLDLE RRIEGLHEEI AFLRKIHEEE IRELQNQMQE 251 SQVQIQMDMS KPDLTAALRD IRLQYEAIAA KNISEAEDWY KSKVSDLNQA 301 VNKNNEALRE AKQETMQFRH QLQSYTCEID SLKGTNESLR RQMSEDGGAA 351 GREAGGYQDT IARLEAEIAK MKDEMARHLR EYQDLLNVKM ALDVEIATYR 401 KLLEGEESRI SLPVQSFSSL SFRESSPEQH HHQQQQPQRS SEVHSKKTVL 451 IKTIETRDGE VVSESTQHQQ DVM

Taxonomy Database

Blast Results

Examples of the New Biology 1.Full genome-genome comparisons 2.Rapid assessment of polymorphic genetic variations 3.Complete construction of orthologous or paralogous groups of genes 4.Structure determination of large macromolecular assemblies/complexes 5.Dynamically simulation of realistic oligomeric systems 6.Rapid structural/topological clustering of proteins 7.Prediction of unknown molecular structures; Protein folding 8.Computer simulation of membrane structure and dynamic function 9.Simulation of genetic networks and the sensitivity of these pathways to component stoichiometry and kinetics 10.Integration of observations across scales of vastly different dimensions and organization to yield realistic environmental models for basic biology and societal needs

Theoretical? The day will dawn when we will have sufficient information to understand how basic life functions are integrated into a living cell, and how such cells intercommunicate and interoperate to function as a living whole. Then maybe, we can start talking about theoretical biology

Categories of BioDbs - by domain of information DNA RNA Protein Genomic Mapping Pathways Structure Bibliographic Biochemical/Molecular/Miscellaneous

Other categories By category of species By families or superfamilies of molecules etc Demo

Demonstration of BioDatabases Majority of Life Science databases are online, accessible with Web via Internet Catalogs of databases available Need for a Registry to keep track and offer quality control