Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center.

Slides:



Advertisements
Similar presentations
© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Advertisements

Beyond PubMed and BLAST: Exploring NCBI tools and databases Kate Bronstad David Flynn Alumni Medical Library.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Genome Browsers Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Biological databases.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Alignment of mRNAs to genomic DNA Sequence Martin Berglund Khanh Huy Bui Md. Asaduzzaman Jean-Luc Leblond.
Bioinformatics and Phylogenetic Analysis
Lecture 2.21 Retrieving Information: Using Entrez.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Genomic Database - Ensembl Ka-Lok Ng Department of Bioinformatics Asia University.
SNP Resources: Finding SNPs, Databases and Data Extraction Debbie Nickerson NIEHS SNPs Workshop.
How to access genomic information using Ensembl August 2005.
SNP Resources: Finding SNPs Databases and Data Extraction Mark J. Rieder, PhD Robert J. Livingston, PhD NIEHS Variation Workshop January 30-31, 2005.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
SNP Resources: Finding SNPs Databases and Data Extraction Mark J. Rieder, PhD SeattleSNPs Variation Workshop March 20-21, 2006.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
1 Database Resources of the National Center for Biotechnology Information Baharak Rastegari MEDG 505 presentation February 3, 2005 David.
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Using Entrez.
CANDID: A candidate gene identification tool Janna Hutz March 19, 2007.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Copyright OpenHelix. No use or reproduction without express written consent 2 Overview of Genome Browsers Materials prepared by Warren C. Lathe, Ph.D.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Copyright OpenHelix. No use or reproduction without express written consent1.
DONNA MAGLOTT, PH.D. PRO AND MEDICAL GENETICS RESOURCES AT NCBI.
Introduction to Bioinformatics Introduction to Databases
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Professional Development Course 1 – Molecular Medicine Genome Biology June 12, 2012 Ansuman Chattopadhyay, PhD Head, Molecular Biology Information Services.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.
NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Using Entrez.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Introduction to Bioinformatics Dr. Rybarczyk, PhD University of North Carolina-Chapel Hill
NCBI Literature Databases: PubMed
EB3233 Bioinformatics Introduction to Bioinformatics.
The Reference Sequence database A non-redundant collection of richly annotated DNA, RNA, and protein sequences from diverse taxaDNARNA The collection includes.
Bioinformatics and Computational Biology
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
GeWorkbench Overview Support Team Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of MIT and Harvard.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Genomes at NCBI. Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools lists 57 databases.
Welcome to the combined BLAST and Genome Browser Tutorial.
NCBI: something old, something new. What is NCBI? Create automated systems for knowledge about molecular biology, biochemistry, and genetics. Perform.
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
What is sequencing? Video: WlxM (Illumina video) WlxM.
NCBI PubMed NCBI Literature Databases: PubMed Session #1, April 28, 2005 Session #2, April 29, 2005 Ho Chi Minh City, VietNam.
GeneConnect Use Cases and Design August 3, GeneConnect Database IDs are linked by Direct Annotation, Inferred Annotation, or Sequence Alignment.
Introduction to Genes and Genomes with Ensembl
NCBI Molecular Biology Resources
Functional Annotation of the Horse Genome
gene-CENTRIC database
Ensembl Genome Repository.
Public data and tool repositories Section 2 Genome Browsers
Gene Safari (Biological Databases)
Problems from last section
Presentation transcript:

Tri-I Bioinformatics Workshop: Public data and tool repositories Alex Lash & Maureen Higgins Bioinformatics Core Memorial Sloan-Kettering Cancer Center

Workshop sections 1.Retrieving data from public resources public databases at NCBI, EBI, Ensembl locate and utilize some of the myriad of publicly available bioinformatics tools common data formats 2.Genome Browsers genome build process, ongoing and complete genome projects genome browsers of Ensembl, UCSC and NCBI Mapviewer broad survey of analysis tools and tutorials available on the Web for use directly and after download

Public data and tool repositories Section 1 Retrieving data from public resources

Goals A.Understand the scope and organization of the major public databases: NCBI, EBI/ Ensembl. B.Understand the importance of a unique identifiers, database fields, logical operators and wildcards. C.Be able to query, retrieve and display publications and sequences. D.Be able to visualize/analyze protein structure

Amyloid Precursor Protein (APP) ß-secretase  -secretase G-protein coupled receptor that binds heparin and laminin Controls nerve cell growth interacts with protein-synthesis machinery amyloid fibril amyloid plaque

NCBI Strengths are data storage, annotation and BLAST: 1.PubMed: Biomedical publications 2.Heritable diseases and syndromes 3.GenBank: Nucleotide and protein sequences 4.BLAST: Pairwise sequence comparison 5.Curated gene-centric data, including reference sequences 6.Genome builds 7.Nucleotide sequence traces Ex: Finding Entrez Gene record for APP

Indexing and logical operators Query: app[Gene Name] AND homo sapiens[Organism] … … … … … … … … aardvark … app … homo sapiens … mus musculus …AND … …

An Entrez Query 1.Query parsed: terms, fields and operators organized in a tree (if syntax incorrect generate error or warning) 2.Unfielded terms matched to synonyms, and extra terms, fields and operators added as needed 3.For each database: a)According to order of operations: i.Term found in appropriate index (if term not found, then generate warning) ii.Bit map pulled and uncompressed iii.Pairwise operations performed with previous result (if zero result, then stop) b)Number of results generated 4.If Global Query, display results summary and stop 5.List of UIDs generated from final result 6.UIDs sorted by user preference 7.Records pulled and displayed by user preference

Gene-centric questions 1.Where is a gene located? 2.What’s its genomic sequence? 3.What variations are associated with it? 4.What’s its exon-intron structure? 5.What are the mRNA sequences of its alternate transcripts? 6.What are the protein sequences of its isoforms? 7.What post-translational modification is possible? 8.What regulates its transcription? 9.What are its co-regulated partners? 10.What’s its normal function? 11.What’s its function in disease? 12.How does it fit into the larger cellular context? May depend upon cellular “state” Ex: Looking over the Entrez Gene record for APP

Common id and record formats 2.Formats a)Flat i.GenBank and GenPeptGenPept ii.FASTAFASTA iii.Multiple FASTA iv.AlignmentAlignment v.Multiple alignment vi.Tab-delimited b)Hierarchical i.ASN.1ASN.1 ii.XMLXML iii.HTML 1.Ids a)GenBank accession i.Nucleotide i.BI559391,Y00264BI559391,Y00264 ii.Protein i.AAB23646AAB23646 iii.RefSeqRefSeq b)Ensembl c)UniGene i.Hs Hs d)PDB Structures i.1iyt1iyt e)HUGO Gene Names i.APP

NCBI’s RefSeq project 1.Is a project to create curated sequence records for the biopolymers of the Central Dogma: DNA, mRNA and protein 2.First release ,079 organisms, 3,234,358 proteins 4.Goals 1.non-redundancy 2.explicitly linked nucleotide and protein sequences 3.updates to reflect current knowledge of sequence data and biology 4.data validation and format consistency distinct accession series 5.ongoing curation by NCBI staff and collaborators, with reviewed records indicated 5.What’s its relationship to BLAST database called “nr”?

UniGene versus Entrez Gene 1.UniGene 1.Automated process that compares and clusters transcript-source sequences (no assembly) 2.Gene discovery tool: predates Entrez Gene, genome assemblies 3.Based primarily on EST sequences 4.ID turn-over and retirement is common 5.Currently 76 taxa and 1,299,304 clusters 2.Entrez Gene 1.Curated clearinghouse of gene-centric information 2.Grew out of LocusLink (eukaryote model organisms) and Entrez Genome (bacteria, viruses, organelles) 3.ID turn-over and retirement happens, but is less common since it is based primarily on sequenced genomes 4.Currently 3882 taxa and 2,479,759 genes 3.Hs: 85,793 UniGene clusters compared to 38,604 Entrez Gene records

EBI/Ensembl Strengths are data storage and analysis software: 1.Biomedical publications 2.Nucleotide and protein sequences 3.Protein domains/signatures 4.Sequence comparison 5.Sequence analysis 6.Structure analysis 7.Protein function analysis 8.Ensembl genome browser Ex: Looking at the APP gene in the EBI/Ensembl resources

Ensembl ids 1.Human 1.ENSG: gene 2.ENST: transcript 3.ENSE: exon 4.ENSP: protein 2.Other organisms 1.ENS{species 3-letter code}{G|T|P}{11 digits} 2.RNO=rat 3.MUS=mouse

Amyloid Precursor Protein (APP) ß-secretase  -secretase G-protein coupled receptor that binds heparin and laminin amyloid fibril amyloid plaque Ex: Viewing the structure of an amyloid fibril DAEFRHDSGYEVHHQKLVFFAEDVGSNKGAIIGLMVGGVVIA

Other structure tools 1.Structure visualization. Free applications: a)RasMol b)Cn3D c)VMD 2.Structure prediction servers/applications a)CASP: Critical Assessment of Techniques for Protein Structure Prediction b)General method: i.Sequence similarity search to identify closest homolog with known structure ii.Fit to homolog’s known structure, minimizing some constraint

Problems 1.Query Entrez Gene with the following two queries separately and then explain the differences between the two results using a logical NOT operation: a)tyrosine kinase[Gene Ontology] AND human[Organism] b)cd00192[Domain] AND human[Organism] 2.Retrieve the APP gene record from NCBI and use the Display dropdown menu to display Conserved Domain Links. Use the ids of the listed domains to query Entrez Gene for records with the same domains. 3.Use the SNP Geneview link at NCBI to identify coding SNPs in the APP gene. Which SNP is missing from this display which was present in the Ensembl APP protein record? 4.Use the Homologene link at NCBI to identify possible functional orthologs for human APP. How does this list compare to the Ensembl list of orthologs that we reviewed previously?