Presentation is loading. Please wait.

Presentation is loading. Please wait.

UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz.

Similar presentations


Presentation on theme: "UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz."— Presentation transcript:

1 UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz

2

3 Behind the Genome Browser ‘Genome’ database, one for each assembly of each genome. –hg17 (human genome assembly 17) –mm6 (mus musculus 6) –canFam1 (canis familiaris 1) hg17 has 1616 tables, but not really –Some tables split across chromosomes for speed –228 logical tables –Only ~30 different types of tables

4

5

6

7

8

9 Selected fields from related tables results: Ensemble Gene (ensGene) and Superfamily Description (sfDescription).

10

11

12

13

14

15

16

17

18 Custom Track Output Useful for visualizing results of queries in genome browser The way to produce more complex queries.

19

20

21

22

23

24 681/3329 (20%) of Ensemble not known also not conserved 1728/33,666 (5%) of Ensembl in general not conserved

25 Meta-data behind Table Browser The trackDb table describes each track. Table and field descriptions in AutoSql.as files, which also generate SQL code and C code to load/save from database and tab- separated files. Descriptions of how tables are connected in all.joiner file, which along with joinerCheck program checks database integrity.

26 .as Files - table and field docs table cpgIsland "Describes the CpG Islands" ( string chrom; "Human chromosome or FPC contig" uint chromStart; "Start position in chromosome" uint chromEnd; "End position in chromosome" string name; "CpG Island" uint length; "Island Length" uint cpgNum; "Number of CpGs in island" uint gcNum; "Number of C and G in island" float perCpg; "Percentage of island that is CpG" float perGc; "Percentage of island that is C or G" ) autoSql generates code from these. They also help document.

27 all.joiner - basic example The central concept is an identifier that appears in fields in multiple table, sometimes even multiple databases. $gbd is a variable that contains a comma-separated list of databases. An identifier record ends with a blank line. identifier softberryGeneName "Link together Fshgene++ gene structure, peptide, and homolog" $gbd.softberryGene.name $gbd.softberryPep.name $gbd.softberryHom.name

28 # Genbank/trEMBL Accessions and meaningful subsets thereof identifier genbankAccession external=genbank "Generic Genbank Accession. More specific Genbank accessions follow" $gbd.seq.acc identifier bacEndAccession typeOf=genbankAccession "Genbank accession of a BAC end read." $gbd.all_bacends.qName dupeOk $gbd.bacEndPairs.lfNames comma $hg.fishClones.beNames comma minCheck=0.70 typeOf - allows joins between parent and child, but not between siblings. dupeOk - allows more than one row with same identifier in primary table comma - indicates field is comma separated list of identifiers minCheck - indicates only a portion identifiers in field is in the primary table

29 identifier hugoName external=HUGO fuzzy "International Human Gene Identifier" $hg.refLink.name $hg.atlasOncoGene.locusSymbol $hg.kgAlias.alias $hg.kgXref.geneSymbol $hg.refFlat.geneName $hg.jaxOrtholog.humanSymbol hg13,hg15.geneBands.name “Biological” names for human genes are so messy, no validation is done (note ‘fuzzy’ keyword).

30 Other Databases Genome databases - one for each assembly of each organism: hg17, mm6, canFam1, etc. hgCentral - home to dbDb and user settings info. One database shared by all web servers. hgFixed - mostly microarray data. uniProt - Relationalized SwissProt/trEMBL database. go - Gene ontology terms and term/gene associations. genePix - gene image database

31 Gene Pix Image browser for in-situ and other gene- oriented pictures Hopefully in the long run will have a million images covering almost all vertebrate genes. (Needs new name, Gene Pix is a microarray analysis program. VisiGene?)

32

33 Data Sets Paul Gray - ~1000 mouse transcription factor genes - whole embryo & sections. These are in the database now. Other potential sources: –German AxelDB frog in situs –Japanese NIBB frog in situs (have nice browser) –Genepaint.org - mouse stuff –EMAGE and Jackson Lab mouse images From development and other journals, copyright issues. –Nathaniel Heintz BAC expression constructs –Eddy Rubin lab mouse embryos –UCSF cell-localization stuff?

34 Types of images Whole animal vs. sectioned tissues, vs. single cell. Single vs. multiple probes within same image. Single image vs. image series (movies even). RNA, Antibody, Fusion protein. Mitotic cell 3 stains

35 Gene Pix Programs genePixLoad - loads SQL database from a well defined format involving a.ra file and a tab separated file. See genePixLoad.doc loadMahoney - converts Paul Gray (Mahoney center) spreadsheet and image directory into genePixLoad format Hg/lib/genePix.c - interface with SQL database. hgGenePix - cgi script to display images knownToGenePix - makes table in mm5 (or other) genome database to connect known genes to genePix Ids.

36 Gene Pix Database Just a single database for all assemblies of all organisms. A knownToGenePix table in the assembly database.

37 GenePix tables fileLocation - directory bodyPart - whole, brain etc. sliceType - transverse, sagital treatment - tech details contributor - who done it Journal - scientific journal submissionSet - info about a whole set of images from one author sectionSet - links together separate sections of same specimen. Gene - gene info geneSynonym Antibody - info on an antibody probeType - antibody, RNA, fusion protein Probe - links gene, primers, sequence Ab. probeColor - color probe is imageFile - file containing image Image - a single image. imageProbe links image and probe

38 Some Anatomy Required

39 Especially with slices

40 Edinburgh mouse atlas

41 Theiler Stages

42 Later Stages

43 NIBB Japanese Frog Site

44 Earlier Stages

45

46

47 Who you gonna call? Angie Hinrichs - developer of 2nd and 4th versions of Table Browser. Genome browser hacker extraordinaire. Hiram Clawson - main mouse man at the moment. Developed ‘wiggle’ tracks.

48 Kate Rosenbloom - ENCODE project and multiple alignment display. Bob Kuhn - Software and database quality assurance. David Haussler - Ideas. Money. Comparative genomics.

49 More Acknowledgements UCSC - Robert Baertsch, Gill Bejerano, Galt Barber, Ron Chao, Mark Diekhans, Jorge Garcia, Patrick Gavin, Rachel Harte, Fan Hsu, Yontoa Lu, Crystal Lynch, Donna Karolchik, Jennifer Jackson, Ann Pace, Jacob Pedersen, Andy Pohl, Katie Pollard, Ali Sultan-Qurraie, Brian Raney, Krishna Roskin, Adam Siepel, Chuck Sugnet, Paul Tatarsky, Daryl Thomas, Heather Trumbower Penn State - Scott Schwartz, Laura Elnitski, Belinda Giardine, Ross Hardison, Minmei Hou, Webb Miller, Anton Nekrutenko Funding - NHGRI, HHMI, NCI, UCSC


Download ppt "UCSC Genome Tools and Databases Jim Kent - Genome Bioinformatics Group University of California Santa Cruz."

Similar presentations


Ads by Google