CS177 Lecture 8 Bioinformatics Databases (and genetic diseases) Tom Madej 10.31.05.

Slides:



Advertisements
Similar presentations
Gene-Proteins-Mutations
Advertisements

CS177 Lecture 10 Experimental Methods (PCR, X-ray crystallography, Microarrays) Tom Madej
Databases (“knowledge bases”) used in genome analysis
Beyond PubMed and BLAST: Exploring NCBI tools and databases Kate Bronstad David Flynn Alumni Medical Library.
NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.
NCBI web resources I: databases and Entrez Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Structural Genomics and Human Health
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
Archives and Information Retrieval
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Protein structure (Part 2 of 2).
The Cell, Central Dogma and Human Genome Project.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Genome Browsers UCSC (Santa Cruz, California) and Ensembl (EBI, UK)
1 Computational Biology, Part 13 Retrieving and Displaying Macromolecular Structures Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
Chromosomes carry genetic information
Genome Evolution: Duplication (Paralogs) & Degradation (Pseudogenes)
Sequencing a genome and Basic Sequence Alignment
Enzymatic Function Module (KEGG, MetaCyc, and EC Numbers)
A Study of Cystic Fibrosis Using Web-Based Tools Anuradha Datta Murphy Graduate Student, Dept. of Molecular and Integrative Physiology, University of Illinois.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Introductory Overview
On line (DNA and amino acid) Sequence Information
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Exploring 3D Molecular Structures Using NCBI Tools A Field Guide June 17, 2004.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Searching PubMed® NCBI, NLM Resources, Micromedex -GSBS TTUHSC Preston Smith Library presents Rev. 08/17/14.
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Using Entrez.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Copyright OpenHelix. No use or reproduction without express written consent1.
RNA and Protein Synthesis
The Strategy of Atomic Resolution Structural Biology Break down complexity so that the system can be understood at a fundamental level Build up a picture.
Sequencing a genome and Basic Sequence Alignment
CS177 Review/Summary of the Madej lectures Tom Madej
Copyright OpenHelix. No use or reproduction without express written consent1.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Sackler Medical School
NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Using Entrez.
NCBI Literature Databases: PubMed
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
Bioinformatics and Computational Biology
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Copyright OpenHelix. No use or reproduction without express written consent1 1.
DNA makes RNA  Transcription RNA makes Proteins  Translation Information flows from genes  proteins – But not the other way! (usually)
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Copyright OpenHelix. No use or reproduction without express written consent1.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Welcome to the combined BLAST and Genome Browser Tutorial.
Gene Expression DNA, RNA, and Protein Synthesis. Gene Expression Genes contain messages that determine traits. The process of expressing those genes includes.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
生物資料庫搜尋 ( 第八組 ) 連威森 王鼎 黃智楹 張鈞淵
NCBI PubMed NCBI Literature Databases: PubMed Session #1, April 28, 2005 Session #2, April 29, 2005 Ho Chi Minh City, VietNam.
8.2 KEY CONCEPT DNA structure is the same in all organisms.
Keeping Current: Genetics Resources. This workshop will provide an overview of NCBI resources for finding-- Background information & journal articles.
NCBI Molecular Biology Resources
From Gene to Protein Chapter 2 and 7 of IB Bio book.
There are four levels of structure in proteins
Annotation Presentation
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

CS177 Lecture 8 Bioinformatics Databases (and genetic diseases) Tom Madej

Lecture overview Very brief and fast overview of on-line databases. Formulating queries in Entrez. Molecular biology of diseases, including an extensive example involving a lot of linking between a number of Entrez databases.

Bioinformatics Resources Reference: Chapter 3 in Sequence – Evolution – Function, E.V. Koonin and M.Y. Galperin, Kluwer Academic Available on the NCBI Bookshelf.

Sequence Databases GenBank, EMBL, DDBJ; archival (International Nucleotide Sequence Database Collaboration); sequences have a common accession SWISS-PROT curated, non-redundant, entries hyperlinked e.g. to PubMed; TrEMBL entries not yet ready for SWISS-PROT Motifs: PROSITE, BLOCKS, PRINTS Domains: Pfam, SMART, ProDOM, COGs (NCBI) Motifs/domains: InterPro, CDD (NCBI)

More databases… Structure: PDB/RCSB, MMDB (NCBI), SCOP, CATH, FSSP Organism-specific: e.g. E. coli, B. subtilis, Synechocystis sp. (bacteria); yeast (unicellular eukaryote); Arabidopsis, C. Elegans (WormBase), Fruitfly, Human COGs clusters of orthologous groups; KEGG biochemical pathways; BIND protein-protein interactions; ENZYME; LIGAND enzymes and their substrates PubChem (NCBI) chemical substances

PubChem (new)

The (ever expanding) Entrez SystemEntrez PopSet Structure PubMed Books 3D Domains Taxonomy GEO/GDS UniGene Nucleotide Protein Genome OMIM CDD/CDART Journals SNP UniSTS PubMed Central Gene HomoloGene Gene NLM Catalog PubChem BioAssays Compounds Substances

Genomes Taxonomy Links Between and Within Nodes PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure 3 -D Structures Word weight VAST BLAST Phylogeny Computational

Pubmed: Computation of Related Articles The neighbors of a document are those documents in the database that are the most similar to it. The similarity between documents is measured by the words they have in common, with some adjustment for document lengths. The value of a term is dependent on Global and Local types of information: G - the number of different documents in the database that contain the term; L - the number of times the term occurs in a particular document;

Global and local weights The global weight of a term is greater for the less frequent terms. The presence of a term that occurred in most of the documents would really tell one very little about a document. The local weight of a term is the measure of its importance in a particular document. Generally, the more frequent a term is within a document, the more important it is in representing the content of that document.

How we define similar documents The similarity between two documents is computed by adding up the weights (local wt1 × local wt2 × global wt) of all of the terms the two documents have in common. All results are ranked and the most similar documents become Related Articles

Entrez database queries The databases are indexed by different sets of terms. You can get to a particular DB by selecting it and then entering a “null” query. The “Preview/Index” tab displays the index terms and can be used to formulate a query (if you can’t remember the syntax for the index). “Limits” can be used e.g. to select publications in a specified time range. “Details” shows the interpretation of the query.

Exercises! How many protein structures are there that include DNA and are from bacteria? “bacteria [orgn] AND 1:100 [DNAChainCount]” In PubMed, how many articles are there from the journal Science and have “Alzheimer” in the title or abstract, and “amyloid beta” anywhere? How many since the year 2000? Notice that the results are not 100% accurate! In 3D Domains, how many domains are there with no more than two helices and 8 to 10 strands and are from the mouse? “0:2 [HelixCount] AND 8:10 [StrandCount] AND mouse [orgn]”

Investigating genetic diseases Now we will see examples of how bioinformatics databases can be used to investigate genetic diseases.

Gene variants that can affect protein function Mutation to a stop codon; truncates the protein product! Insertion/deletion of multiple bases; changes the sequence of amino acid residues. Single point change could alter folding properties of the protein. Single point change could affect the active site of the protein. Single point change could affect an interaction site with another molecule.

Lodish et al. Molecular Cell Biology, W.H. Freeman 2000

Sickle cell anemia The first “molecular disease”, i.e. the first genetic disease with a known molecular basis. The most common variant is caused by a Glu6Val mutation in the Hemoglobin β-chain (HbS). However, there are 100’s of other mutations that can cause this (OMIM lists 524 variants!). This mutation causes the hemoglobin to polymerize, in turn the red blood cells form sickle shapes and clump together under low oxygen conditions or high hemoglobin concentrations. Confers some resistance to malaria, by inhibiting parasite growth.

NHLBI web site

Exercise! Find an appropriate Hemoglobin structure and view it in Cn3D. Check the position of the Glu6Val mutation.

P53 tumor suppressor protein Li-Fraumeni syndrome; only one functional copy of p53 predisposes to cancer. Mutations in p53 are found in most tumor types. p53 binds to DNA and stimulates another gene to produce p21, which binds to another protein cdk2. This prevents the cell from progressing thru the cell cycle.

G. Giglia-Mari, A. Sarasi, Hum. Mutat. (2003)

Exercise! Use Cn3D to investigate the binding of p53 to DNA. Formulate a query for Structure that will require the DNA molecules to be present (there are 2 structures like this).

Important note! Most diseases (e.g. cancer) are complex and involve multiple factors (not just a single malfunctioning protein!).

Investigating a genetic disease… The following EST comes from a hemochromatosis patient; your task is to identify the gene and specific mutation causing the illness, and why the protein is not functioning properly. The sequence: TGCCTCCTTTGGTGAAGGTGACACATCATGTGACCTCTTCAG TGACCACTCTACGGTGTCGGGCCTTGAACTACTACCCCCAGA ACATCACCATGAAGTGGCTGAAGGATAAGCAGCCAATGGAT GCCAAGGAGTTCGAACCTAAAGACGTATTGCCCAATGGGGA TGGGACCTACCAGGGCTGGATAACCTTGGCTGTACCCCCTGG GGAAGAGCAGAGATATACGTACCAGGTGGAGCACCCAGGCC TGGATCAGCCCCTCATTGTGATCTGGG

ESTs Expressed Sequence Tags; useful for discovering genes, obtaining data on gene expression/regulation, and in genome mapping. Short nucleotide sequences ( bases or so) derived from mRNA expressed in cells. The introns from the genes will already be spliced out. mRNA is unstable, however, and so it is “reverse transcribed” into cDNA.

Hemochromatosis 2 BLAST the example EST vs. the Human genome (could take a few minutes). - Which chromosome is hit? - What is the contig that is hit (reference assembly)? - Is the EST identical to the genomic sequence? - Take note of the coords of the difference. Click on “Genome View”. Select the map element at the bottom corresponding to the contig.

Hemochromatosis 3 What gene is hit? Zoom in on the BLAST hit a few times. Display the entire gene sequence vi “dl” and “Display”. Copy and save the genomic sequence. Record the coords for the start of the genomic sequence.

Hemochromatosis 4 Add the UniGene map to the view (if it is not already there). Click on the UniGene link Hs Note: Expression profile presents data for the expression level of the gene in various tissues. How many mRNAs and ESTs are there for the HFE gene? Take note of the mRNA accession NM_

Hemochromatosis 5 Go to “spidey”: To determine the intron/exon structure, paste the HFE gene sequence into the upper box, and enter the HFE mRNA accession NM_ in the lower box. Click “Align”.

Hemochromatosis 6 How many exons are there? Which exon codes the residue that is changed in the original EST? (You have to do a little arithmetic!) Record some of the protein sequence around the changed residue: EQRYTCQVEHPG

Hemochromatosis 7 From the Map Viewer page click on the HFE gene link. How many HFE transcripts are there? Which is the longest isoform? Follow “Links” to “Protein” and then to the report for NP_ Determine the residue number that corresponds to the mutation.

RNA splicing and isoforms

Hemochromatosis 8 What effect does the mutation in the original EST have on the protein? (Look at the table for the Genetic Code.) Go back to the Gene Report; read the summary and take note of the GeneRIF bibliography; notice the ‘C282Y’ entries. Now go to “Links” and then to “GeneView in dbSNP” to a list of known SNPs.

Hemochromatosis 9 In the SNP list note that the one you want is currently shown. Select “view rs in gene region” and then click on “view rs” (actually, this is the default view). How many nonsynonomous substitutions do you see? Do you see the one we are particularly interested in?

Digression: SNPs Single Nucleotide Polymorphisms. A single base change that can occur in a person’s DNA. On average SNPs occur about 1% of the time, most are outside of protein coding regions. Some SNPs may cause a disease; some may be associated with a disease; others may affect disposition to a disease; others may be simple genetic variation. dbSNP archives SNPs and other variations such as small-scale deletion/insertion polymorphisms (DIPs), etc.

Hemochromatosis 10 Back to the Gene Report, click on “Links” and go to “OMIM” (can also get there via the Map Viewer). In the OMIM entry you can read a bit; also click on “View List” for Allelic Variants, where you can see the mutation again.

Hemochromatosis 11 From the Gene Report again follow “Links” to “Protein” and scroll down to NP_ Click on “Domains” and then “Show Details”. What is the Conserved Domain in the region of interest? Follow the link to the CD. Click on “View 3D Structure”.

Hemochromatosis 12 Look for residue position 282 in the query sequence. Highlight that column. Is the Cys282 conserved in the family? The C282Y mutation therefore likely has the effect of …

Aligning a sequence on a structure with Cn3D (example) Example: Use structure 1ne3A, align sequence for 1m5xA. In Sequence/Alignment Viewer window select the menu item “Imports/Show Imports”. In the Import Viewer window select the menu item “Edit/Import Sequences”. In the Select Chain dialogue box select 1N3E A and click OK. In the Select Import Source dialogue box select “Network via GI/Accession” and click OK. In the Import Identifier dialogue box enter the accession and click OK. The new sequence will appear. Select “Algorithms/BLAST single” and use the cursor to click anywhere on the 1m5xA sequence to align it using BLAST.

Aligning a sequence on a structure with Cn3D (example cont.) Select the menu item “Alignments/Merge All” to make the new alignment appear in the Sequence/Alignment Viewer window. The alignment should now appear in the Sequence/Alignment Viewer window, aligned residues will be red. Close the Import Viewer window, pick another color style for the alignment, if desired (e.g. identity). You can do this with multiple sequences; especially useful if there is no CD for the structure.

PDB

PDB File: Header HEADER ISOMERASE/DNA 01-MAR-00 1EJ9 TITLE CRYSTAL STRUCTURE OF HUMAN TOPOISOMERASE I DNA COMPLEX COMPND MOL_ID: 1; COMPND 2 MOLECULE: DNA TOPOISOMERASE I; COMPND 3 CHAIN: A; COMPND 4 FRAGMENT: C-TERMINAL DOMAIN, RESIDUES ; COMPND 5 EC: ; COMPND 6 ENGINEERED: YES; COMPND 7 MUTATION: YES; COMPND 8 MOL_ID: 2; COMPND 9 MOLECULE: DNA (5'- COMPND 10 D(*C*AP*AP*AP*AP*AP*GP*AP*CP*TP*CP*AP*GP*AP*AP*AP*AP*AP*TP* COMPND 11 TP*TP*TP*T)-3'); COMPND 12 CHAIN: C; COMPND 13 ENGINEERED: YES; COMPND 14 MOL_ID: 3; COMPND 15 MOLECULE: DNA (5'- COMPND 16 D(*C*AP*AP*AP*AP*AP*TP*TP*TP*TP*TP*CP*TP*GP*AP*GP*TP*CP*TP* COMPND 17 TP*TP*TP*T)-3'); COMPND 18 CHAIN: D; COMPND 19 ENGINEERED: YES SOURCE MOL_ID: 1; SOURCE 2 ORGANISM_SCIENTIFIC: HOMO SAPIENS; SOURCE 3 EXPRESSION_SYSTEM_COMMON: BACULOVIRUS EXPRESSION SYSTEM; SOURCE 4 EXPRESSION_SYSTEM_CELL: SF9 INSECT CELLS; SOURCE 5 MOL_ID: 2; SOURCE 6 SYNTHETIC: YES; SOURCE 7 MOL_ID: 3; SOURCE 8 SYNTHETIC: YES KEYWDS PROTEIN-DNA COMPLEX, TYPE I TOPOISOMERASE, HUMAN REMARK 1 REMARK 2 REMARK 2 RESOLUTION ANGSTROMS. REMARK 3 REMARK 3 REFINEMENT. REMARK 3 PROGRAM : X-PLOR 3.1 REMARK 3 AUTHORS : BRUNGER … REMARK 280 REMARK 280 CRYSTALLIZATION CONDITIONS: 27% PEG 400, 145 MM MGCL2, 20 REMARK 280 MM MES PH 6.8, 5 MM TRIS PH 8.0, 30 MM DTT REMARK

From Coordinates to Models 1EJ9: Human topoisomerase I

Building the Structure Summary Taxonomy Pubmed Protein 3D Domains Domains Nucleotide

Indexing into MMDB Structure Import only experimentally determined structures Convert to ASN.1 Verify sequences inter-residue-bonds { { atom-id-1 { molecule-id 1, residue-id 1, atom-id 1 }, atom-id-2 { molecule-id 1, residue-id 2, atom-id 9 } }, id 1, name "helix 1", type helix, location subgraph residues interval { { molecule-id 1, from 49, to 61 } } }, Add secondary structureAdd chemical bonds Create “backbone” model (Cα, P only) Create single-conformer model

Structure Indexing Entrez MMDB-ID MMDB entry date EC number Organism PDB Accession Release date Class Source Description Comment Ligands PDB code PDB name PDB description Literature Article title Author Journal Publication date Experimental Method Resolution Counters Ligand types Modified amino acids Modified nucleotides Modified ribonucleotides Protein chains DNA chains RNA chains topoisomerase AND 2[dnachaincount] AND human[organism]

Creating Sequence Records Protein Nucleotide 1EJ9A 1EJ9C1EJ9D One record per chain

Annotating Secondary Structure 1EJ9: Human topoisomerase I α-Helices β-strands coils/loops

Creating 3D Domains 3D Domain 0: 1EJ9A0 = entire polypeptide

Creating 3D Domains 3D Domains 1EJ9A1 1EJ9A3 1EJ9A2 1EJ9A4 1EJ9A5 < 3 Secondary Structure Elements

3D Domain Indexing Entrez SDI MMDB-ID Accession MMDB entry date Organism Domain number Cumulative number PDB Accession Release date Class Source Description Comment Literature Article title Author Publication date Counters Modified amino acids α-Helices β-Strands Residues Molecular weight REMEMBER: 3D Domain 0 is the entire polypeptide chain! 4[helixcount] AND 0[strandcount] AND 0[domainno] AND viruses[organism] Find all viral four helix bundles