CS173 Lecture 14: Personal Genomics, GSEA/GREAT

Slides:



Advertisements
Similar presentations
Microarray statistical validation and functional annotation
Advertisements

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Methods to read out regulatory functions
[BejeranoWinter12/13] 1 MW 11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu CS173 Lecture 15:
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Gene Set Enrichment Analysis (GSEA)
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Finding regulatory modules from local alignment - Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki.
CAVEAT 1 MICROARRAY EXPERIMENTS ARE EXPENSIVE AND COMPLICATED. MICROARRAY EXPERIMENTS ARE THE STARTING POINT FOR RESEARCH. MICROARRAY EXPERIMENTS CANNOT.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Gene Ontology John Pinney
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TA: Cory McLean.
[Bejerano Fall10/11] 1 Thank you for the midterm feedback! Projects will be assigned shortly.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010.
Genome Browsers UCSC (Santa Cruz, California) and Ensembl (EBI, UK)
CS 374: Relating the Genetic Code to Gene Expression Sandeep Chinchali.
[Bejerano Aut08/09] 1 MW 11:00-12:15 in Beckman B302 Profs: Serafim Batzoglou, Gill Bejerano TAs: Cory McLean, Aaron Wenger.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
 2 Outline  Review of major computational approaches to facilitate biological interpretation of  high-throughput microarray  and RNA-Seq experiments.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Pathway Analysis. Goals Characterize biological meaning of joint changes in gene expression Organize expression (or other) changes into meaningful ‘chunks’
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
CS273A Lecture 5: Genes Enrichment, Gene Regulation I
NGS Analysis Using Galaxy
[BejeranoFall13/14] 1 MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos.
[BejeranoFall14/15] 1 MW 12:50-2:05pm in Beckman B100 Profs: Serafim Batzoglou & Gill Bejerano CAs: Jim Notwell & Sandeep Chinchali.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
Gene Set Enrichment Analysis (GSEA)
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Detecting enriched regions (Chip- seq, RIP-seq) Statistical evaluation of enriched regions Data displayed in Genome Browser Detection of enriched motifs.
EGAN: Exploratory Gene Association Networks by Jesse Paquette Biostatistics and Computational Biology Core Helen Diller Family Comprehensive Cancer Center.
Networks and Interactions Boo Virk v1.0.
Basic features for portal users. Agenda - Basic features Overview –features and navigation Browsing data –Files and Samples Gene Summary pages Performing.
Fission Yeast Computing Workshop -1- Searching, querying, browsing downloading and analysing data using PomBase Basic PomBase Features Gene Page Overview.
CS177 Lecture 10 SNPs and Human Genetic Variation
Ch. 21 Genomes and their Evolution. New approaches have accelerated the pace of genome sequencing The human genome project began in 1990, using a three-stage.
Copyright OpenHelix. No use or reproduction without express written consent1.
RNA surveillance and degradation: the Yin Yang of RNA RNA Pol II AAAAAAAAAAA AAA production destruction RNA Ribosome.
[BejeranoFall13/14] 1 MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos.
GUI GoMiner and High-Throughput GoMiner Analysis of Alternative Splice Variants Barry Zeeberg, Ari Kahn, Michael Ryan, David Kane, Curtis Jamison, Hongfang.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
[Bejerano Fall10/11] 1.
Statistical Testing with Genes Saurabh Sinha CS 466.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
[BejeranoFall15/16] 1 MW 1:30-2:50pm in Clark S361* (behind Peet’s) Profs: Serafim Batzoglou & Gill Bejerano CAs: Karthik Jagadeesh.
CS173 Lecture 9: Transcriptional regulation III
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Accessing and visualizing genomics data
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Gene Technologies and Human ApplicationsSection 3 Section 3: Gene Technologies in Detail Preview Bellringer Key Ideas Basic Tools for Genetic Manipulation.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Enhancers and 3D genomics Noam Bar RESEARCH METHODS IN COMPUTATIONAL BIOLOGY.
Gene Annotation & Gene Ontology May 24, Gene lists from RNAseq analysis What do you do with a list of 100s of genes that contain only the following.
Understanding GWAS SNPs Xiaole Shirley Liu Stat 115/215.
Integrative Genomics. Double-helix DNA strands are separated in the gene coding region Which enzyme detects the beginning of a gene ? RNA Polymerase (multi-subunit.
Functional annotation of ChIP-peaks
GO : the Gene Ontology & Functional enrichment analysis
Statistical Testing with Genes
CS273A Lecture 7: Genes Enrichment, Gene Regulation I
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
The Human Genome Source Code
Statistical Testing with Genes
Presentation transcript:

CS173 Lecture 14: Personal Genomics, GSEA/GREAT MW  11:00-12:15 in Beckman B302 Prof: Gill Bejerano TAs: Jim Notwell & Harendra Guturu http://cs173.stanford.edu [BejeranoWinter12/13]

Announcements Coming Monday 3/4 lecture is again in LK101 (see class website for room reminders) I’ll be working on grad student admissions – Harendra will lecture about his work. (we’ll prepare the ground today) http://cs173.stanford.edu [BejeranoWinter12/13]

Quick recap http://cs173.stanford.edu [BejeranoWinter12/13]

Sequencing Public project: Celera project:

Human Structural Variation http://cs173.stanford.edu [BejeranoWinter12/13]

Human Disease Cancer Congenital defects Disease Association studies Genic and cis-regulatory contributions http://cs173.stanford.edu [BejeranoWinter12/13]

Personal genomics http://cs173.stanford.edu [BejeranoWinter12/13]

Gameplan 1. As your budget allows, characterize all the variants in an individual’s genome: Against the reference genome. Against variants known in the population. If possible, against unaffected relatives. 2 Compare the structural variants you observe to the body of knowledge about genome content & function. Seek culprit mutations. 3. Having detected a smoking gun mutation, attempt to recreate it in a cell population or organism to obtain a “disease model”. Variant Types Single Nucleotide Variants(SNVs) Small Insertion / Deletion (indels) Copy Number Variants (CNVs) Structural Variants (SVs) Novel Sequence http://cs173.stanford.edu [BejeranoWinter12/13]

Targeted Sequencing, or looking under the lamp is 50x cheaper Exome Library Shotgun Genomic DNA Exon 1 Exon 2 Capture Methods vs. Shotgun Targeted sequencing allows for much higher coverage at less cost Will only capture known sites These methods also introduce significant captures bias, including failure to capture sites that differ significantly from the reference genome. (analogous to microarrays) Problem is that need a large amount of sequence in order to have accurate SNV calls regardless of the model Targeted sequencing refers to methods that enrich for a certain portion of the genome prior to the actual sequencing procedure Use oligos with baits (biotinylated) attached to beads (streptavidin) Modified from Meyerson et al. . 2010. Advances in understanding cancer genomes through second-generation sequencing. Nature Reviews Genetics 11, no. 10 (October): 685-696

Consumer genomics http://cs173.stanford.edu [BejeranoWinter12/13]

Gameplan 1 Collect scientific literature about all structural variant correlations with human disease & traits. 2 Genotype customers for as many informative loci as is commercially viable. 3 Offer counseling for your findings, and their meaning. 4 Ask customers to phenotype themselves. 5 Discover new associations! http://cs173.stanford.edu [BejeranoWinter12/13]

Pay, send biosample, get genotyped

Trait associations

Disease Risk Alleles http://cs173.stanford.edu [BejeranoWinter12/13] https://www.23andme.com/ http://cs173.stanford.edu [BejeranoWinter12/13]

Side Effects: Serious Ethical Issues http://cs173.stanford.edu [BejeranoWinter12/13]

Gene set enrichment analysis: The genic version http://cs173.stanford.edu [BejeranoWinter12/13]

Imagine you did a microarray experiment http://cs173.stanford.edu [BejeranoWinter12/13]

Cluster all genes for differential expression Experiment Control (replicates) (replicates) Most significantly up-regulated genes genes Unchanged genes Most significantly down-regulated genes http://cs173.stanford.edu [BejeranoWinter12/13]

Determine cut-offs, examine individual genes Experiment Control (replicates) (replicates) Most significantly up-regulated genes genes Unchanged genes Most significantly down-regulated genes http://cs173.stanford.edu [BejeranoWinter12/13]

Genes usually work in groups Biochemical pathways, signaling pathways, etc. Asking about the expression perturbation of groups of genes is both more appealing biologically, and more powerful statistically (you sum perturbations). http://cs173.stanford.edu [BejeranoWinter12/13]

Ask about whole gene sets + Exper. Control Gene set 3 up regulated ES/NES statistic Gene set 2 down regulated - http://cs173.stanford.edu [BejeranoWinter12/13]

One approach: GSEA Dataset distribution Gene set 3 distribution Number of genes Gene Expression Level http://cs173.stanford.edu [BejeranoWinter12/13]

Another popular approach: DAVID Input: list of genes of interest (without expression values). http://cs173.stanford.edu [BejeranoWinter12/13]

Multiple Testing Correction run tool Note that statistically you cannot just run individual tests on 1,000 different gene sets. You have to apply further statistical corrections, to account for the fact that even in 1,000 random experiments a handful may come out good by chance. (eg experiment = Throw a coin 10 times. Ask if it is biased. If you repeat it 1,000 times, you will eventually get an all heads series, from a fair coin. Mustn’t deduce that the coin is biased) http://cs173.stanford.edu [BejeranoWinter12/13]

What will you test? run tool Also note that this is a very general approach to test gene lists. Instead of a microarray experiment you can do RNA-seq. Instead of up/down-regulated genes you can test all the genes in a personal genome where you see surprising mutations. Any gene list can be tested. http://cs173.stanford.edu [BejeranoWinter12/13]

Cataloging biological knowledge Gene Sets: Cataloging biological knowledge http://cs173.stanford.edu [BejeranoWinter12/13]

Keyword lists are not enough Anatomy keywords Sheer number of terms too much to remember and sort Need standardized, stable, carefully defined terms Need to describe different levels of detail So…defined terms need to be related in a hierarchy With structured vocabularies/hierarchies Parent/child relationships exist between terms Increased depth -> Increased resolution Can annotate data at appropriate level May query at appropriate level Organ system Cardiovascular system Heart organ system embryo cardiovascular heart … Anatomy Hierarchy Sheer number of terms is too much to remember and sort… a question of scale Need to describe domains at different levels of detail AND thus we started the GO

Annotate genes to most specific terms TJL-2004

General Implementations for Vocabularies organ system embryo cardiovascular heart … Hierarchy DAG chaperone regulator molecular function chaperone activator enzyme regulator enzyme activator Query for this term Returns things annotated to descendents Remind them of what a DAG is…. Annotate at any level, query at level…. What is structure buy you 1. Annotate at appropriate level, query at appropriate level 2. Queries for higher level terms include annotations to lower level terms

Gene Sets Gene Ontology (“GO”) Pathway Databases Biological Process Molecular Function Cellular Location Pathway Databases KEGG BioCarta Broad Institute

Other Gene Sets Transcription factor targets All the genes regulated by particular TF’s Protein complex components Sets of genes whose protein products function together Ion channel receptors RNA / DNA Polymerase Paralogs Families of genes descended (in eukaryotic times) from a common ancestor

Natural Language Processing (NLP) Opportunities Ontology Map genes to ontology using literature Literature Genes http://cs173.stanford.edu [BejeranoWinter12/13]

Gene set enrichment analysis: The gene regulatory version http://cs173.stanford.edu [BejeranoWinter12/13]

Combinatorial Regulatory Code 2,000 different proteins can bind specific DNA sequences. DNA Proteins Protein binding site Gene DNA A regulatory region encodes 3-10 such protein binding sites. When all are bound by proteins the regulatory region turns “on”, and the nearby gene is activated to produce protein. http://cs173.stanford.edu [BejeranoWinter12/13]

ChIP-Seq: first glimpses of the regulatory genome in action Peak Calling Cis-regulatory peak http://cs173.stanford.edu [BejeranoWinter12/13] 35

What is the transcription factor I just assayed doing? Collect known literature of the form Function A: Gene1, Gene2, Gene3, ... Function B: Gene1, Gene2, Gene3, ... Function C: ... Ask whether the binding sites you discovered are preferentially binding (regulating) any one or more of the functions listed above. Form hypothesis and perform further experiments. Cis-regulatory peak Gene transcription start site http://cs173.stanford.edu [BejeranoWinter12/13] 36

Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile Gene transcription start site SRF binding ChIP-seq peak ChIP-seq identified 2,429 SRF binding peaks in human Jurkat cells1 SRF is known as a “master regulator of the actin cytoskeleton” In the ChIP-Seq peaks, we expect to find binding sites regulating (genes involved in) actin cytoskeleton formation. Jurkat (Human T cell lymphoblast-like cell line) Description: serum response factor (c-fos serum response RefSeq Summary (NM_003131): This gene encodes a ubiquitous nuclear protein that stimulates both cell proliferation and differentiation. It is a member of the MADS (MCM1, Agamous, Deficiens, and SRF) box superfamily of transcription factors. This protein binds to the serum response element (SRE) in the promoter region of target genes. This protein regulates the activity of many immediate-early genes, for example c-fos, and thereby participates in cell cycle regulation, apoptosis, cell growth, and cell differentiation. This gene is the downstream target of many pathways; for example, the mitogen-activated protein kinase pathway (MAPK) that acts through the ternary complex factors (TCFs). [provided by RefSeq]. http://cs173.stanford.edu [BejeranoWinter12/13]

Example: inferring functions of Serum Response Factor (SRF) from its ChIP-seq binding profile Gene transcription start site SRF binding ChIP-seq peak Ontology term (e.g. ‘actin cytoskeleton’) π π π π Existing, gene-based method to analyze enrichment: Ignore distal binding events. Count affected genes. Rank by enrichment hypergeometric p-value. N = 8 genes in genome K = 3 genes annotated with n = 2 genes selected by proximal peaks k = 1 selected gene annotated with π π π π P = Pr(k ≥1 | n=2, K =3, N=8) π π http://cs173.stanford.edu [BejeranoWinter12/13]

Pro: A lot of tools out there for the analysis of gene lists. We have (reduced ChIP-Seq into) a gene list! What is the gene list enriched for? Pro: A lot of tools out there for the analysis of gene lists. Cons: These tools are built for microarray analysis. Does it matter ?? Microarray data Microarray data Gene regulation data Microarray tool http://cs173.stanford.edu [BejeranoWinter12/13]

SRF Gene-based enrichment results Original authors can only state: “basic cellular processes, particularly those related to gene expression” are enriched1 SRF acts on genes both in nucleus and cytoplasm, that are involved in transcription and various types of binding SRF Z ~ SRF Where’s the signal? Top “actin” term is ranked #28 in the list. ~ [1] Valouev A. et al., Nat. Methods, 2008 http://cs173.stanford.edu [BejeranoWinter12/13] 40

Associating only proximal peaks loses a lot of information Relationship of binding peaks to nearest genes for eight human (H) and mouse (M) ChIP-seq datasets Restricting to proximal peaks often leads to complete loss of key enrichments http://cs173.stanford.edu [BejeranoWinter12/13]

Bad Solution: Associating distal peaks brings in many false enrichments π π π Why bad? 14% of human genes tagged ‘multicellular organismal development’. But 33% of base pairs have such a gene nearest upstream/downstream. SRF ChIP-seq set has >2,000 binding events. Throw a random set of 2,000 regions at the genome. What do you get from a gene list analysis? Term Bonferroni corrected p-value nervous system development 5x10-9 system development 8x10-9 anatomical structure development 7x10-8 multicellular organismal development 1x10-7 developmental process 2x10-6 Large “gene deserts” are often next to key developmental genes http://cs173.stanford.edu [BejeranoWinter12/13]

Real Solution: Do not convert to gene list Real Solution: Do not convert to gene list. Analyze the set of genomic regions Gene transcription start site Ontology term ( ‘actin cytoskeleton’) Gene regulatory domain Genomic region (ChIP-seq peak) π π π π π GREAT = Genomic Regions Enrichment of Annotations Tool p = 0.33 of genome annotated with π n = 6 genomic regions k = 5 genomic regions hit annotation Fraction of genome resulting in annotation explicitly used in enrichment calculation P = Prbinom(k ≥5 | n=6, p =0.33) π π π Since 33% of base pairs are near a ‘multicellular organismal development’ gene, we now expect 33% of genomic regions to hit this term by chance. => Toss 2,000 random regions at genome, get NO (false) enrichments. http://cs173.stanford.edu [BejeranoWinter12/13]

How does GREAT know how to assign distal binding peaks to genes? Future: High-throughput assays based on chromosome conformation capture (3C) methods will elucidate complex regulation mechanisms Currently: Flexible computational definitions allow assignment of peaks to nearest gene, nearest two genes, etc. Default: each gene has a “basal regulatory domain” of 5 kb up- and 1kb downstream of transcription start site, extends to basal domain of nearest genes within 1 Mb Though some associations may be missed or incorrect, in general signal richness and robustness is greatly improved by associating distal peaks http://cs173.stanford.edu [BejeranoWinter12/13]

GREAT infers many specific functions of SRF from its binding profile Top GREAT enrichments of SRF Ontology Term # Genes Binomial Experimental P-value support* Top gene-based enrichments of SRF Gene Ontology actin cytoskeleton actin binding 30 31 7x10-9 5x10-5 Miano et al. 2007 Pathway Commons TRAIL signaling Class I PI3K signaling 32 26 5x10-7 2x10-6 Bertolotto et al. 2000 Poser et al. 2000 TreeFam FOS gene family 5 1x10-8 Chai & Tarnawski 2002 (top actin-related term 28th in list) TF Targets Targets of SRF Targets of GABP Targets of YY1 Targets of EGR1 84 28 44 23 5x10-76 4x10-9 1x10-6 2x10-4 Positive control ChIp-Seq support Natesan & Gilman 1995 SRF is “master regulator of the actin cytoskeleton.” SRF is key regulator of FOS oncogene and has been shown to act in conjunction with YY1 to regulate FOS. Demonstrated associations between SRF and TRAIL signaling. SRF is needed for PI3K-dependent cell proliferation. cFOS and FOSB are known targets of SRF. * Known from literature – as in function is known, SOME of the genes are known, and the binding sites highlighted are NOT. Similar results for GABP, NRSF, Stat3, p300 ChIP-Seq [McLean et al., Nat Biotechnol., 2010] http://cs173.stanford.edu [BejeranoWinter12/13]

Limb P300: I was blind and I can see Gene List Fraction of genome resulting in annotation explicitly used in enrichment calculation http://cs173.stanford.edu [BejeranoWinter12/13]

GREAT works with ANY cis-regulatory rich set Example: GWAS Compendium set Height-associated unlinked SNPs Fraction of genome resulting in annotation explicitly used in enrichment calculation http://cs173.stanford.edu [BejeranoWinter12/13]

GREAT analysis of histone mark combinations http://cs173.stanford.edu [BejeranoWinter12/13]

GREAT includes multiple ontologies Twenty ontologies spanning broad categories of biology 44,832 total ontology terms tested in each GREAT run (2,800 terms) (6,700) (5,215) (3,079) (834) (911) (5,781) (615) (427) (19) (456) (222) (9) (150) (1,253) (6,857) (288) (8,272) (706) (238) http://cs173.stanford.edu [BejeranoWinter12/13] Michael Hiller

Advantages of the GREAT approach Tailored to the biology of gene regulation: Distal sites are incorporated, not ignored Variable length gene regulatory domains Multiple bindings next to same target gene rewarded Extensive ontologies, some home-made http://cs173.stanford.edu [BejeranoWinter12/13]

Algorithmic Optimization: A it works; B make it efficient http://cs173.stanford.edu [BejeranoWinter12/13]

enter GREAT.stanford.edu Choose genome Input peak list Hit submit! http://cs173.stanford.edu [BejeranoWinter12/13]

(Optional): alter association rules GREAT web app: (Optional): alter association rules http://great.stanford.edu Three association rule choices Lnp Evx2 HoxD cluster Literature-curated domains for a small subset of genes http://cs173.stanford.edu [BejeranoWinter12/13] [adapted from Spitz, Gonzalez, & Duboule, Cell, 2003]

GREAT web app: output summary Additional ontologies, term statistics, multiple hypothesis corrections, etc. Ontology-specific enrichments Cool visualization opportunities! http://cs173.stanford.edu [BejeranoWinter12/13]

GREAT web app: term details page Genes annotated as “actin binding” with associated genomic regions Genomic regions annotated with “actin binding” Drill down to explore how a particular peak regulates Plectin and its role in actin binding http://cs173.stanford.edu [BejeranoWinter12/13]

You can also submit any track straight from the UCSC Table Browser A simple, well documented programmatic interface allows any tool to submit directly to GREAT. (See our Help / Inquiries welcome!) http://cs173.stanford.edu [BejeranoWinter12/13]

GREAT web app: export data HTML output displays all user selected rows and columns Tab-separated values also available for additional postprocessing http://cs173.stanford.edu [BejeranoWinter12/13]

GREAT Web Stats 200-400 job submissions per day, from 7,000 IP addrs http://cs173.stanford.edu [BejeranoWinter12/13]

Adding a new species to GREAT We need: A good assembly A high quality gene set Good gene annotations* *Most valuable for species with independent annotations! http://cs173.stanford.edu [BejeranoWinter12/13]

Adapting GREAT for zebrafish We need: A good assembly A high quality gene set Good gene annotations # Scaffolds Avg. Scaffold Length # Assembly Gaps Zv8 11,724 129Kb ~55,000 Zv9 1,133 1,250 Kb ~27,000 Zv9 = UCSC danRer7 older assemblies?  liftover to Zv9/danRer7 http://cs173.stanford.edu [BejeranoWinter12/13]

Adapting GREAT for zebrafish We need: A good assembly A high quality gene set Good gene annotations Carefully combine (95% identity, 80% coverage) RefSeq transcripts Ensembl coding genes RefSeq proteins Uniprot proteins  Obtain 14,567 genes, all with ZFIN gene identifiers Using only RefSeq would miss 1,912 annotated genes Using only Ensembl would miss 1,218 annotated genes http://cs173.stanford.edu [BejeranoWinter12/13]

Adapting GREAT for zebrafish We need: A good assembly A high quality gene set Good gene annotations Curate zebrafish: Gene Ontology (GO) - Function, Process, Cellular Component ZFIN Phenotype Wiki Pathways ZFIN Wildtype Expression InterPro - protein domains, families and functional sites TreeFam - gene families of paralogs http://cs173.stanford.edu [BejeranoWinter12/13]

96% of our gene set is annotated At least one gene is annotated with the term http://cs173.stanford.edu [BejeranoWinter12/13]