MARC: Developing Bioinformatics Programs July 2009 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez 1 Essential Computing for Bioinformatics Lecture.

Slides:



Advertisements
Similar presentations
Zoology 305 Library Databases/Indexes Lab Goals for session: 1) Meet your librarian Kevin Messner 2) Understand.
Advertisements

NCBI/WHO PubMed/Hinari Course NCBI Literature Databases: PubMed Background.
NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
1.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
GENBANK, SWISSPROT AND OTHERS As Problem Sources for CSE 549 Andriy Tovkach Genetics.
Evidence-Based Information Retrieval in Bioinformatics
Archives and Information Retrieval
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Signaling Pathways and Summary June 30, 2005 Signaling lecture Course summary Tomorrow Next Week Friday, 7/8/05 Morning presentation of writing assignments.
An Introduction to Bioinformatics Molecular Biology Databases.
DEMO CSE fall. What is GeneMANIA GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Introductory Overview
On line (DNA and amino acid) Sequence Information
MARC: Developing Bioinformatics Programs July 2009 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist:
 The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students.
Bioinformatics.
Erice 2008 Introduction to PDB Workshop From Molecules to Medicine: Integrating Crystallography in Drug Discovery Erice, 29 May - 8 June Peter Rose
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Bioinformatics for biomedicine
X-ray crystallography NMR cryoEM Experimental approaches for structural biology.
Information Resources for Bioinformatics 1 MARC: Developing Bioinformatics Programs July, 2008 Alex Ropelewski Hugh Nicholas
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Searching PubMed® NCBI, NLM Resources, Micromedex -GSBS TTUHSC Preston Smith Library presents Rev. 08/17/14.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
MARC: Developing Bioinformatics Programs July 2009 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist:
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Protein and RNA Families
Overview of Bioinformatics 1 Module Denis Manley..
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
BIOLOGICAL DATABASES. BIOLOGICAL DATA Bioinformatics is the science of Storing, Extracting, Organizing, Analyzing, and Interpreting information in biological.
Protein Domain Database
Bioinformatics and Computational Biology
Computer Storage of Sequences
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Information retrieval and sliding window programs April 5, 2011 Hand in Homework #1. Homework #2 due Tuesday, April 12. Learning objectives- Understand.
Alex Ropelewski Pittsburgh Supercomputing Center National Resource for Biomedical Supercomputing Bienvenido Vélez
MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist: Learning.
MARC: Developing Bioinformatics Programs June 2012 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist:
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Using Molecular Biology to Teach Computer Science High-level Programming with Python Finding Patterns.
High-level Programming with Python Expressions and Variables Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer.
MARC: Developing Bioinformatics Programs Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist: Learning.
1 A Short Introduction to Analyzing Biological Data Using Relational Databases Part II: Creating a Relational Database to Model Biological Data Alex Ropelewski.
MARC: Developing Bioinformatics Programs June 2012 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez Reference: How to Think Like a Computer Scientist:
Biological Databases By: Komal Arora.
Demo: Protein Information Resource
Archives and Information Retrieval
생물정보학 Bioinformatics.
Genome Annotation Continued
Mangaldai College, Mangaldai
Introduction to Bioinformatics
Tutorial: Bioinformatics Resources
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

MARC: Developing Bioinformatics Programs July 2009 Alex Ropelewski PSC-NRBSC Bienvenido Vélez UPR Mayaguez 1 Essential Computing for Bioinformatics Lecture 2 Using Bioinformatics Data Sources

The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students from the biological sciences, computer science, and mathematics departments. They have been developed as a part of the NIH funded project “Assisting Bioinformatics Efforts at Minority Schools” (2T36 GM008789). The people involved with the curriculum development effort include: Dr. Hugh B. Nicholas, Dr. Troy Wymore, Mr. Alexander Ropelewski and Dr. David Deerfield II, National Resource for Biomedical Supercomputing, Pittsburgh Supercomputing Center, Carnegie Mellon University. Dr. Ricardo González Méndez, University of Puerto Rico Medical Sciences Campus. Dr. Alade Tokuta, North Carolina Central University. Dr. Jaime Seguel and Dr. Bienvenido Vélez, University of Puerto Rico at Mayagüez. Dr. Satish Bhalla, Johnson C. Smith University. Unless otherwise specified, all the information contained within is Copyrighted © by Carnegie Mellon University. Permission is granted for use, modify, and reproduce these materials for teaching purposes. Most recent versions of these presentations can be found at Essential Computing for Bioinformatics

Course Outline Course Overview Introduction to Information Needs and Databases Unstructured Data Repositories – Query models and implementation issues Structured Data Repositories – Query models and implementation issues Biology-specific Repositories – Query models and implementation issues These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 3

Outline These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 4

Definition: Biological Database Any repository containing Biological information which can be used to: – assess the current state of knowledge – Formulate new scientific hypotheses – Validate these hypotheses Some Examples of Biological Databases These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 5  Sequence  Structure  Family/Domain  Species  Taxonomy  Function/Pathway  Disease/Variation  Publication Journal  And many other ways

How is Biological Information Stored? From a computer-science perspective, there are several ways that data can be organized and stored: – In a flat text file – In a spreadsheet – In an image – In an video animation – In a relational database – In a networked (hyperlinked) model – In any combination of the above – Others These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 6

Sequence Data Libraries Organized according to sequence When one talks about “searching sequence databases” these are the libraries that they are searching Main sources for sequence libraries are direct submissions from individual researchers, genome sequencing projects, patent applications and other public resources. – Genbank, EMBL, and the DNA Database of Japan (DDBJ) are examples of annotated collections publicly available DNA sequences. – The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and annotation data These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 7

Structural Data Libraries Contain information about the (3-dimensional) structure of the molecule Main sources of structural data are direct submissions from researchers. Data can be submitted via a variety of experimental techniques including – X-ray crystallography – NMR structure depositions. – EM structure depositions. – Other methods (including Electron diffraction, Fiber diffraction). The Protein Data Bank and the Cambridge Structural Database are two well-known repositories of structural information These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 8

Family and Domain Libraries Typically built from sets of related sequences and contain information about the residues that are essential to the structure/function of the sequences Used to: – Generate a hypothesis that the query sequence has the same structure/function as the matching group of sequences. – Quickly identify a good group of sequences known to share a biological relationship. Some examples: – PFAM, Prosite, BLOCKS, PRINTS These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 9

Species Libraries Goal is to collect and organize a variety of information concerning the genome of a particular species Usually each species has its own portal to access information such as genomic-scale datasets for the species. Examples: – EuPathDB - Eukaryotic Pathogens Database (Cryptosporidium, Giardia, Plasmodium, Toxoplasma and Trichomonas) – Saccharomyces Genome Database – Rat Genome Database – Candida Genome Database These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 10

Taxonomy Libraries The science of naming and classifying organisms Taxonomy is organized in a tree structure, which represents the taxonomic lineage. Bottom level leafs represents species or sub-species Top level nodes represent higher ranks like phylum, order and family Examples: – NEWT – NCBI Taxonomy These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 11

Taxonomy Libraries - NEWT These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 12

NCBI Taxonomy Browser These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 13

Function/Pathway Collection of pathway maps representing our knowledge on the molecular interaction and reaction networks for: – Metabolism – Genetic Information Processing – Environmental Information Processing – Cellular Processes – Human Diseases – Drug Development Examples: – KEGG Pathway Database – NCI-Nature Pathway Interaction Database These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 14

Disease/Variation Catalogs of genes involving variations including within populations and among populations in different parts of the world as well as genetic disorders and other diseases. Examples: – OMIM, Online Mendelian Inheritance in Man - focuses primarily on inherited, or heritable, genetic diseases in humans – HapMap - a catalog of common genetic variants that occur in humans. These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 15

Journal U.S. National Library of Medicine PubMed is the premiere resources for scientific literature relevant to the biomedical sciences. Includes over 18 million citations from MEDLINE and other life science journals for articles back to the 1950s. PubMed includes links to full text articles and other related resources. Common uses of PubMed: – Find journal articles that describe the structure/function/evolution of sequences that you are interested in – Find out if anyone has already done the work that you are proposing These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 16

Current databases are loosely integrated In order to prove a hypothesis one must often collect information from several independent databases and tools Lots of time are spent converting data back and forth among the multiple specific formats required by the various tools and databases Discovery process may take a long time, weeks or even months, to complete and tools do not effectively assist the scientist in saving intermediate results in order to continue the search from that point at a later time. These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 17 What has been done about this?

Integrated Information Resources Integrated resources typically use a combination of relational databases and hyperlinks to databases maintained by others to provide more information than any single data source can provide Many Examples: – NCBI Entrez – NCBI’s cross-database tool – iProClass - proteins with links to over 90 biological databases. including databases for protein families, functions and pathways, interactions, structures and structural classifications, genes and genomes, ontologies, literature, and taxonomy – InterPro - Integrated Resource Of Protein Domains And Functional Sites. These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 18

NCBI Entrez Data Integration These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 19

NCBI Entrez These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 20

NCBI Entrez Results These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 21

NCBI Entrez PubMed Results These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 22

NCBI Entrez OMIM Results These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 23

NCBI Entrez Core Nucleotide Results These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 24

NCBI Entrez Core Nucleotide Results These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 25

NCBI Entrez Core Nucleotide Results These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 26

NCBI Entrez Core Nucleotide Results These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 27

NCBI Entrez Saving Sequences These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 28

NCBI Sequence Identifiers Accession Number: unique identifier given to a sequence when it is submitted to one of the DNA repositories (GenBank, EMBL, DDBJ). These identifiers follow an accession.version format. Updates increment the version, while the accession remains constant. GI: GenInfo Identifier. If a sequence changes a new GI number will be assigned. A separate GI number is also assigned to each protein translation. These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 29

iProClass Protein Knowledgebase Protein centric Links to over 90 biological data libraries Goal is to provide a comprehensive picture of protein properties that may lead to functional inference for previously uncharacterized "hypothetical" proteins and protein groups. Uses both data warehousing in relational databases as well as hypertext links to outside data sources These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 30

iProclass Integration These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 31

iProclass Search Form These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 32

iProclass Results These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 33

iProClass SuperFamily Summary These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 34

iProClass SuperFamily Summary These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 35

iProClass SuperFamily Summary These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 36

iProClass SuperFamily Summary These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 37

iProClass PDB Structure 1a27 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 38

iProClass Domain Architecture These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 39

PIRSF Family Hierarchy These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 40

iProClass Taxonomy Nodes These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 41

iProClass Enzyme Function: KEGG These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 42

iProClass Pathway: KEGG These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 43

iProClass: Saving Sequences These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 44 Check Entries Save Format

InterPro Integrated resource of protein families, domains, repeats and sites from member databases (PROSITE, Pfam, Prints, ProDom, SMART and TIGRFAMs). Member databases represent features in different ways: Some use hidden Markov models, some use position specific scoring meaticies, some use ambiguous consensus patterns. Easy way to search several libraries at once with a query. These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 45

InterPro – Searching with InterProScan These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 46

InterPro - InterProScan Results These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 47

InterPro - InterProScan Results These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 48

InterPro - InterProScan Results These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 49

InterPro - InterProScan Results These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 50

InterPro - InterProScan Results These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 51

A Vision: Computer Assisted Bioinformatics Goal – The computer assists the scientist in the collection of all bioinformatics information relevant to the hypothesis at hand A single software application that can: Understand multiple data formats specifically devised to represent structure, function, metabolism, evolution, etc. Assist scientists in creating and maintaining relationships among different types of information collected from multiple sources Support simultaneous searches across multiple data sources of a similar nature (e.g. multiple sequence databases) These materials were developed with funding from the US National Institutes of Health grant #2T36 GM to the Pittsburgh Supercomputing Center 52 Remains an Open Research Problem