GSEA-Pro Tutorial Anne de Jong University of Groningen
Introduction The main principle of a Gene Set Enrichment Analysis (GSEA) is to discover which biological function is or functions are overrepresented in a set of genes or proteins. For such an analysis GSEA-Pro use the Genome2D database that describes the relation between genes/proteins and functions (functional classification). As example, all genes encoding enzymes for a specific metabolic pathway belong to the same class GSEA-Pro use multiple classification; GO, InterPro, KEGG, COG, PFAM, SMART and Superfamily For GSEA-Pro locus-tags are used as ID for genes as well as for proteins
Introduction Overview of Functional Analysis of Genes Sets Transcriptomics Proteomics Metagenomics -omics One or multiple sets of Genes Unravel the biological function of a “Gene Set” 3
Input STEP 1: Select Genome The GSEA-Pro is integrated into the Genome2D web-server that contain classifications of all ‘complete’ genomes of the NCBI. Be sure to select the correct strain (check your locus-tags). Preferably use the RefSeq locus-tags names, but also old-locus-tags are supported if a genome is selected from the RefSeq database. The ‘old’ non-RefSeq NCBI genome database is also supported and still contain gene names and locus- tags which are discarded by NCBI in the RefSeq database. STEP 2: Four types of data tables can be used as input Single list of locus-tags: This is a bare list of genes (as locus-tags) deduced from transcriptome or proteome analysis results. Single list of locus-tags with ratio values: The first column contains the locus-tags, the second ratio values generated by differential expression (DE) analysis. Experiments: From time series or perturbation experiments GSEA-Pro will select the gene set of each experiment on the basis of ratio data. Default threshold values can be changed on the webserver. Clustering: Clustering algorithms will group genes showing similar behavior over purtubation experiments or time series. GSEA-Pro will handle each cluster as a gene set and will show the biological function of each cluster. The first column of the input table should contain the locus-tags and the column with cluster-IDs should have the header “clusterID” (or change this at the web-server )
Input Step 3: Examples of input data tables Tables can be uploaded to the webserver as tab delimited file or by copy and paste directly from e.g. Excel Single list Single list + ratio data Experiments Clustering [ value columns will be ignored ]
Results Normally the results should be ready in seconds and generates 4 main tables; Table 1: All combinations of class / experiment are represented in one table. Values are only shown if the p-value is lower then the cutoff value (0.01). Within brackets: the number of genes of the class that are differential expressed (TopHits). The light to dark blue coloring represents low to high significance, respectively. The intensity of the color is based on (TopHits/ClassSize) * -log2(adj-pvalue). Items in the ClassID column links to external databases describing the class IDs Items in the Experiment columns links to genes and gene annotations which are member of that specific class / experiment combination The ClassSize column show the total number of genes that are member of the classID in the selected organism Table 2: Heatmap of Class x Experiments and clickable to the ‘GSEA-Pro BarGraph’ The GSEA-Pro BarGraph show the overrepresented classes and its p-value (as –log). A detailed table links to online information of classIDs and links to the genes found for the specific class Table 3: Heatmap of Class x Experiments and clickable to the full class table Table 4: Overview of the locus-tags of each experiment or cluster used for the GSEA TreeMap: Global visualization and quick mining trough the GSEA-Pro results