All.joiner A file that describes joinable fields in the UCSC Genome Databases.

Slides:



Advertisements
Similar presentations
Database Relationships in Access As you recall, the data in a database is stored in tables. In a relational database like Access, you can have multiple.
Advertisements

EIONET Training Zope Page Templates Miruna Bădescu Finsiel Romania Copenhagen, 28 October 2003.
Microsoft® Access® 2010 Training
Organisation Of Data (1) Database Theory
The (new) Table Browser. Talk Outline Table Browser History New Table Browser Features New Table Browser Implementation –all.joiner &.as files –Overall.
What is a Database By: Cristian Dubon.
Logging In Go to web site:
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
XP Chapter 3 Succeeding in Business with Microsoft Office Access 2003: A Problem-Solving Approach 1 Analyzing Data For Effective Decision Making.
 2008 Pearson Education, Inc. All rights reserved JavaScript: Introduction to Scripting.
UCSC Genome Browser Tutorial
1 CSE Students: Please do not log in yet. Check-in with Brian in the back. Review Days 3 and 4 in the book. Others: Please save your work and logout.
Gene Pix In Situ and other pictures of gene hybridization at UCSC.
UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.
Attribute databases. GIS Definition Diagram Output Query Results.
Databases at UCSC It just *looks* like 200,000 columns.
Systems Analysis I Data Flow Diagrams
Relational Databases What is a relational database? What would we use one for? What do they look like? How can we describe them? How can you create one?
Mapping ERM to relational database
4/20/2017.
Adding metadata to web pages Please note: this is a temporary test document for use in internal testing only.
2.3 Organising Data for Effective Retrieval
ASP.NET Programming with C# and SQL Server First Edition
PHP Programming with MySQL Slide 8-1 CHAPTER 8 Working with Databases and MySQL.
1 Welcome to the Quantitative Trait Loci (QTL) Tutorial This tutorial will describe how to navigate the section of Gramene that provides information on.
1 MySQL and phpMyAdmin. 2 Navigate to and log on (username: pmadmin)
1 Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
Word Processing Notes: Mail Merge Understand business documents.2 Mail Merge Example Letter shows Merge Fields (placeholders) Letter is Personalized.
McGraw-Hill Technology Education © 2004 by the McGraw-Hill Companies, Inc. All rights reserved. Office Access 2003 Lab 3 Analyzing Data and Creating Reports.
Analyzing Data For Effective Decision Making Chapter 3.
Using Relational Databases and SQL John Hurley Department of Computer Science California State University, Los Angeles Lecture 3: Joins Part I.
Copyright OpenHelix. No use or reproduction without express written consent1.
Management Information Systems MS Access MS Access is an application software that facilitates us to create Database Management Systems (DBMS)
Access 2013 Microsoft Access 2013 is a database application that is ideal for gathering and understanding data that’s been collected on just about anything.
Chapter 17 Creating a Database.
1 Database Concepts 2 Definition of a Database An organized Collection Of related records.
1 Lab 2 and Merging Data (with SQL) HRP223 – 2009 October 19, 2009 Copyright © Leland Stanford Junior University. All rights reserved. Warning:
Chapter 4c, Database H Definition H Structure H Parts H Types.
Software Documentation Section 5.5 ALBING’s Section JIA’s Appendix B JIA’s.
The Digital Archive Database Tool Shih Lin Computing Center Academia Sinica.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
Databases at UCSC It just *looks* like 200,000 columns.
Microsoft ® Office Excel 2003 Training Using XML in Excel SynAppSys Educational Services presents:
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Copyright © Curt Hill Joins Revisited What is there beyond Natural Joins?
File and Database Design Class 22. File and database design: 1. Choosing the storage format for each attribute from the logical data model. 2. Grouping.
RDBMS MySQL. MySQL is a Relational Database Management System MySQL allows multiple tables to be related to each other. Similar to a Grandparent to a.
GVS: Genome Variation Server Materials prepared by: Warren C. Lathe, PhD Updated: Q Version 2.
NMD202 Web Scripting Week5. What we will cover today PHP & MySQL Displaying Dynamic Pages Exercises Modifying Data PHP Exercises Assignment 1.
 2008 Pearson Education, Inc. All rights reserved JavaScript: Introduction to Scripting.
INFO 330 Forward Engineering Project From User To Info.
SCALING AND PERFORMANCE CS 260 Database Systems. Overview  Increasing capacity  Database performance  Database indexes B+ Tree Index Bitmap Index 
# 1# 1 QueriesQueries How do we ask questions of the data? What is SELECT? What is FROM? What is WHERE? What is a calculated field? Spring 2010 CS105.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Build a database II: Create tables for a new Access database Create your tables In a relational database, tables store your data. Your data doesn’t “live”
LIS654 lecture 4 more on omeka Thomas Krichel
Welcome to the combined BLAST and Genome Browser Tutorial.
SQL SERVER AUDITING. Jean Joseph DBA/Consultant Contact Info: Blog:
Python: Building Geoprocessing Tools David Wynne, Ghislain Prince.
Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
HRP Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and.
MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.
Creative Create Lists Elizabeth B. Thomsen Member Services Manager
N5 Databases Notes Information Systems Design & Development: Structures and links.
ECONOMETRICS ii – spring 2018
Introduction to the New SSA OnePoint Online Website
Lab 2 and Merging Data (with SQL)
Lab 2 HRP223 – 2010 October 18, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected.
Chapter 17 Designing Databases
Joining Your Data to a Map
Presentation transcript:

all.joiner A file that describes joinable fields in the UCSC Genome Databases

basic example of an identifier The central concept of all.joiner is the identifier, which appears in fields of multiple tables, sometimes even multiple databases. $gbd is a variable that contains a comma-separated list of genome databases. An identifier consists of a an identifier line, a required comment in quotes, and a list of database.table.field where the identifier is used. The first field listed is the master key. It contains all identifiers. Later fields may not contain all. identifier softberryGeneName "Link together Fshgene++ gene structure, peptide, and homolog" $gbd.softberryGene.name $gbd.softberryPep.name $gbd.softberryHom.name

Variables Variables are defined by the set keyword. In practice they are mostly used for comma- separated lists of databases. set fish tetraodon,fugu,zebrafish set worms elegans,briggsae After these two sets, typing $fish,$worms is equivalent to typing: tetraodon,fugu,zebrafish,elegans,briggsae

Databases by organism # Define databases used for various organisms set hg hg15,hg16,hg17,hg18 set mm mm3,mm4,mm5,mm6,mm7,mm8 set rn rn2,rn3,rn4 set fr fr1,fr2 set ce ce1,ce2,ce4 set cb cb1,cb3 set dm dm1,dm2,dm3 set dp dp2,dp3 set sc sc1 set sacCer sacCer1 set panTro panTro1,panTro2 set galGal galGal2,galGal3

# Define all genome databases. set gbd $hg,$mm,$rn,$fr,$ce,$cb,$dm,$dp,$sc,$sacCer,$panTro,$galGal # Only consider one of members of gbd at a time. exclusiveSet $gbd # Define other databases that we check set otherDb visiGene,uniProt,go,proteome,hgFixed … # Set up list of databases we ignore and those we check. Program # will complain about other databases. databasesChecked $gbd,$otherDb databasesIgnored mysql,lost+found,$proteinDb,$zooDb,hgcentraltest,hgcentralbeta

# Define databases that support known genes set kgDb $hg,$mm,$rn # Define databases that support Gene Sorter # (which once was the gene family browser) set familyDb $hg,$mm,$ce,$sacCer,$dm

# Magic for tables split between chromosomes set split splitPrefix=chr%_ # Stuff to link together self chains and nets identifier chainSelf "Link together self chain info" $gbd.chainSelf.id $split $gbd.chainSelfLink.chainId $split $gbd.netSelf.chainId exclude=0 The splitPrefix= allows logical tables to be split. The % acts as a wildcard (SQL style). The exclude=0 says that the master key need not include 0. Chains and nets are more complex than other identifiers

set chainDest Hg15,Hg16,Hg17,Mm4,Mm5,Mm6… identifier chain[${chainDest}]Id "Link together chain info" $gbd.chain[].id $split $gbd.chain[]Link.chainId $split $gbd.net[].chainId $gbd.allChain[].id $gbd.netRxBest[].chainId exclude=0 $gbd.net[]NonGap.chainId exclude=0 $gbd.netSynteny[].chainId exclude=0 Other chains and nets use a macro expansion of sorts so we don’t need to define a separate identifier for each one.

# Genbank/trEMBL Accessions and meaningful subsets thereof identifier genbankAccession external=genbank "Generic Genbank Accession. More specific Genbank accessions follow" $gbd.seq.acc identifier stsAccession external=genbank typeOf=genbankAccession "Genbank accession of a Sequence Tag Site (STS) sequence." $gbd.stsInfo2.genbank dupeOk identifier bacEndAccession typeOf=genbankAccession "Genbank accession of a BAC end read." $gbd.all_bacends.qName dupeOk $gbd.bacEndPairs.lfNames comma $hg.fishClones.beNames comma minCheck=0.70 The typeOf line allows joins between parent and child, but not between siblings.

identifier hugoName external=HUGO fuzzy "International Human Gene Identifier" $hg.refLink.name $hg.atlasOncoGene.locusSymbol $hg.kgAlias.alias $hg.kgXref.geneSymbol $hg.refFlat.geneName $hg.jaxOrtholog.humanSymbol hg13,hg15.geneBands.name “Biological” names for human genes are so messy, no validation is done (note ‘fuzzy’ keyword).

identifier ensemblTranscriptId external=Ensembl dependency "Ensembl Transcript ID" $gbd.ensGene.name chopAfter=. $gbd.superfamily.name $gbd.ensGeneXref.transcript_name chopAfter=. minCheck=0.20 mm3,hg13.ensemblXref.transcript_name chopAfter=. minCheck=0.20 mm3.ensemblXref2.transcript_name chopAfter=. minCheck=0.20 $gbd.ensGtp.transcript chopAfter=. minCheck=0.98 $gbd.ensPep.name chopAfter=. minCheck=0.98 $gbd.ensTranscript.transcript_name chopAfter=. minCheck=0.20 $kgDb.knownToEnsembl.value chopAfter=. $gbd.sfDescription.name chopAfter=. mm3.superfamily.name chopAfter=. Ensembl isn’t ‘fuzzy’ but requires relaxed ‘minCheck’

# Table types - describe tables sharing a common format. type genePred $hg.acembly $gbd.ECgene $gbd.geneid $gbd.genscan $gbd.sgpGene $gbd.softberryGene $gbd.twinscan $gbd.ensGene $gbd.vegaGene $gbd.refGene $gbd.jgiFilteredModels Table browser looks for genePred.as file based on this, and fills in descriptions in ‘describe schema’.

# Dependencies not already captured in identifiers. # The joinerCheck program can quickly check times and # dependencies sort of like make. dependency $mm.affyGnfU74ADistance $mm.knownToU74 hgFixed.gnfMouseU74aMedianRatio dependency $mm.affyGnfU74BDistance $mm.knownToU74 hgFixed.gnfMouseU74bMedianRatio dependency $mm.affyGnfU74CDistance $mm.knownToU74 hgFixed.gnfMouseU74cMedianRatio dependency $hg.gnfU95Distance $hg.knownToU95 hgFixed.gnfHumanU95MedianRatio dependency $ce.kimExpDistance hgFixed.kimWormLifeMedianRatio dependency $dm.arbExpDistance $dm.bdgpToCanonical hgFixed.arbFlyLifeMedianRatio dependency $sacCer.choExpDistance hgFixed.yeastChoCellCycle

# Ignored tables - no linkage here that we check at least. tablesIgnored go instance_data source_audit tablesIgnored $gbd ancientRepeat axtInfo chromInfo cpgIsland … trackDb% chr%_mrna joinerCheck squawks about any table (or database) not mentioned

joinerCheck Checks database vs. all.joiner in various ways. Very handy for QA but… –Full joinerCheck takes a long long time to run –Output is verbose because it complains about missing stuff –The -times check is fast, but sometimes we make tables out of order without it being a true error.

joinerCheck - Parse and check joiner file usage: joinerCheck file.joiner options: -identifier=name - Just validate given identifier. -database=name - Just validate given database. -fields - Check fields in joiner file exist, faster with -fieldListIn -fieldListOut=file - List all fields in all databases to file. -fieldListIn=file - Get list of fields from file rather than mysql. -keys - Validate (foreign) keys. Takes at least an hour. -tableCoverage - Check that all tables are mentioned in joiner file -dbCoverage - Check that all databases are mentioned in joiner file -times - Check update times of tables are after tables they depend on -all - Do all tests: -fields -keys -tableCoverage -dbCoverage -times joinerCheck - the tool you’ll love to hate!

all.joiner in summary With.as files describes our large, messy, useful database. Missing info in all.joiner results in missing functionality in table browser. QA can automatically catch many problems with joinerCheck Full path - src/hg/makeDb/schema/all.joiner See also src/hg/makeDb/schema/joiner.doc