Download presentation
Published by窗堀艳 颜 Modified over 7 years ago
1
Clench 2.0 A program for cluster enrichment analysis and integrated visualization of expression, annotation and transcription factor binding site data
2
What is Gene Ontology? An ontology is a specification of the concepts & relationships that can exist in a domain of discourse. (There are different ontologies for various purposes) The Gene Ontology (GO) project is an effort to provide consistent descriptions of gene products. The project began as a collaboration between three model organism databases: FlyBase (Drosophila),the Saccharomyces Genome Database (SGD) and the Mouse Genome Database (MGD) in Since then, the GO Consortium has grown to include most model organism databases. GO creates terms for: Biological Process Molecular Function Cellular Component.
3
Structure of GO relationships
4
Using GO for interpreting gene clusters from Microarray data
The terms used in annotating genes via GO are controlled and have a specific parent-child relationship So a user can form queries of the form: ‘Are photosynthesis genes being over expressed in my experiment?’ -- and the genes that are involved in photosynthesis in some way can be found systematically and included in the analysis. ‘What is the predominant function for my <group> of interesting genes? – here interesting genes are genes that have similar expression profiles and/or pass some test of significance in the data analysis. One can use a ‘reduced’ version of GO where the terms used in annotation are coarser ‘Slim GO’.
5
TF-sites in promoters of gene clusters from Microarray data
Analyze the promoters of genes with a similar expression profile and/or a similar function, for enrichment of known transcription factor (TF) binding sites. Based on the premise that co-expression of these genes can be due to co-regulation … [not always!] Becomes a tedious analyses if we have to do it for gene groups formed on the basis of expression + common function. If we can some how see groupings along all three data types (expression, annotation, promoter TF sites) simultaneously it would be very valuable in interpreting gene clusters.
6
Group 1 Groups ill-defined from the standpoint of annotations Group 2 Groups clear from the standpoint of expression Groups absent from the standpoint of promoter sequences Look for grouping based on different data types. The process of ‘grouping’ can be done sequentially: 1. First using expression promoter sequences and/or expression annotation 2. Using expression promoter and annotation jointly 3. First using promoter expression and/or annotation ( (/5/57) Or it can be done jointly using expression, promoter and annotation data with or without a formal model to integrate them.
7
Clench takes in a group of genes formed on the basis of expression data and analyzes it for similarity in promoter TF-site and annotation data. It allows the user to visualize all three data types simultaneously to draw meaningful inferences.
8
Clench inputs A list of ‘background genes’, listed as one AGI-identifier per line. A list of ‘cluster genes’, listed as one AGI-identifier per line. A list of ‘slim terms’, listed in GO-ids, one per line OR a slim term file in GO flat file format. A FASTA format file containing the promoter sequences of the genes under study. A tab delimited file containing the TF sites (consensus sequence in IUPAC symbols) that you want to search for in the promoters of genes. A tab delimited file containing the expression data for the cluster genes.
9
What does Clench do? Given a list of genes it retrieves GO annotations for them: From TAIR GO-Flat Files Locally stored annotations For the background list of genes: Counts the number of genes assigned to each GO term. Searches the promoters to count the number of promoters that have a particular TF site. For the ‘cluster’ list of genes: For each GO term, counts that number of genes assigned to that term from the cluster. For those genes, for each TF site, counts the number of genes that have that site in their promoter. Then computes p-values for enrichment.
10
How does CLENCH put p-values?
Uses a theoretical distribution to estimate: “How surprising is it that n genes from my cluster are annotated as ‘yyyy’ knowing that m genes are annotated as ‘yyyy’ in the whole genome (or chip)” CLENCH uses the hypergeometric, chi-square and the binomial distributions. The p-value calculation can be viewed in different ways: 1 – we can think of the process of assigning genes to a category as binomial process… 2 – Or we can try to find the significance of the difference between two proportions using Chi-square 3 – view it as a 2-category assignment. Belonging to ‘yyyy’ or ‘not-yyyy’…{without replacement} and use the hypergeometric distribution. we assume independence of categories which is not really true. M N m n
11
Controlling false positives
Clench performs simulations to estimate the False Discovery Rate (FDR) at a (hypergeometric) p-value cutoff of 0.05. False discovery rate falls on reducing the p-value cutoff for calling a category significant or when the number of genes in a category is large enough. If the FDR is too high, Clench will reduce the p-value cutoff till the FDR is acceptable The reduction can be same for all categories. Or “stepped” where the cutoff is reduced by a larger amount for smaller categories and by a smaller amount for large categories. [depending on the specific FDR for a category of a particular size] The FDR can also be reduced by using Slim Term mapping: which ensures that each category has a large number of genes assigned to it and hence the FDR is low to begin with. However, we can not be *sure* that the process of assigning GO-annotations to genes follows some theoretical distribution therefore it is essential to get an estimate of the false positives to expect in our result. Generally, the FDR of a category with more than 5-6 genes is less than 0.3. A p-value estimated during the simulation follows the binomial distribution more closely than the hypergeometric (contrary to popular belief!)
12
The result 1 - Overview (as a visual matrix) of TF sites found in the promoters of each cluster. 2 - Overview (as a visual matrix) of Annotations for the cluster 3 - Overview of the expression profiles for the cluster. [1,2,3 are shown on the next slide] 4 - A directed graph for each GO 'aspect' (P = process/F = function/C = component) [4 is shown on slide 13] 5 - An HTML table (one for P/F/C) showing: * Genes assigned to each term and their expression profiles, * TF sites in their promoters * Cross tabulated annotations (cross tabulated annotation = F & C annotations for a group of genes assigned to term XXX in aspect P) [5 is shown on slide 14] Enriched TF sites and annotations are colored red and the thumbnails are linked to full images. [shown on slide 15]
13
Overview Images Thumbnails showing an overview of the TF-site, Annotation and expression data. Each thumbnail is linked to the larger image.
14
Screen shot of the result
15
DAG of GO terms The graph shows relations between enriched GO terms.
Red Enriched terms Cyan Informative high level terms with a large number of genes but not statistically enriched. White Non informative terms (defined as an ‘ignore list’ by the user)
16
A row from the result table
TF-sites in the promoters P-values and Gene Description for the genes assigned to ‘photosynthesis, light reaction’ term. Function and component annotations (this row is from process)
17
TF-site & Annotation matrices
TF-sites show that the ABRE consensus and GBF sites are enriched in these genes. Component and function annotations show that these genes are located in the chloroplast thylakoid membrane, are part of the light-harvesting complex and have the molecular function of chlorophyll binding and electron transport. * The relationship between the terms is seen in the DAG *
18
Usefulness of Clench Clench helps the biologist interpret a list of genes and form a result statement such as: The photosynthesis genes located in the chloroplast are repressed in response to ozone stress and have the ABRE binding site enriched in their promoters. See more at Or at
19
Look under the hood A graph showing the various ‘parts’ of Clench 2.0 and their interaction
20
Configuring Clench AnnotationSource = TAIR
StoredAnnotations = C:\<PATH>\SavedAnnotations.txt AnnotationSourceFile = C:\<PATH>\ATH_GO txt PromoterSource = TAIR StoredPromoters = C:\<PATH>\StoredPromoters.txt PromoterSourceFile = C:\<PATH>\At_upstream_1000_ PromoterType = At_upstream_1000 MotifFile = C:\CLENCH\SampleFiles\MotifListWithNames.txt ExpressionFile = C:\CLENCH\SampleFiles\SampleExpression_Profile.txt SlimTerms = C:\CLENCH\SampleFiles\SlimTermsGeneric.txt MappingType = Complete NumberOfSim = 100 FDRcutoff = 0.3 CorrectionType = stepped mygo = DBI:mysql:mygo user = root password = clench IgnoreList = GO: |GO: |GO: |GO:000367
21
Running Clench TotalChipMips -
Provide full path and name of the file containing a list of genes (one per line) to use as a 'reference' ClusterMips - Provide full path and name of the file containing a list of genes (one per line) that belong to a cluster. OR A file containing a list of Clusterfiles. WantDetailResult?(Y/S) - Enter Y for getting all rows in the result Enter S for getting only the 'significant' rows MapToParents(Y/N) - Enter Y for assigning a gene annotated to a term to all its parent terms. Enter N for keeping the annotations as is. ConvertToGOSlim?(YC/YT/N) - Enter YC to convert fine level annotations to an arbitrary coarse level. Enter YT to convert fine level annotations to the coarse level provided by TAIR Enter N to leave annotations as is. Remove Duplicates?(Y/N) - Enter Y to remove duplicate ids in Total or Cluster ResultPrefix(Optional) - When analyzing multiple files as a list you can assign a prefix to each result file. eg. prefix with ClenchRun-1
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.